Abstract
We propose a new technique for the study of multivariate count data. The proposed model is applied to the study of the number of individuals several fossil species found in a set of geographical observation points. First, we are proposing a multivariate model based on the Poisson distributions, which allows positive and negative correlations between the components. We are extending the log-linear Poisson model in the multivariate case through the conditional distributions. For this model, we obtain the maximum likelihood estimates and compute several goodness of fit statistics. Finally we illustrate the application of the proposed method over data sets: various simulated data sets and a count data set of various fossil species.
Keywords: Poisson log-linear model, maximum likelihood estimation, selection model, multivariate count data, conditional modeling
1. Introduction
Count data has been the subject of an increasing number of proposals in several scientific areas. There are numerous examples in scientific literature, but models taking into account multivariate counts are still fairly rare. In investigations that are directed to the study of such variables as career interruptions, number of children, scores of soccer games or days in an hospital, the discrete Poisson distribution can be used. In general, for single counts, the application of this distribution provides good results in the modelling of the phenomena. On occasions, it is necessary to apply some modifications to avoid difficulties such as the under-dispersion or over-dispersion problem and/or the excess zeros problem. Besides, in count data analysis, if the interest is focused on the study of a single count variable in a regression context, the building of the model and its analysis are widely developed by means of the generalized linear model [21].
For multiple counts, however, the application of the Poisson distribution is not so clear. The literature reports good works using a bivariate Poisson distribution based on the models described by Johnson and Kotz [13], Kocherlakota and Kocherlakota [19] and Cameron and Johansson [3]. The disadvantage of these suggested distributions is that the correlation between count variables is restricted to being positive. This disadvantage is not strong since negative correlation for small counts is not realistic, it only occurs between high counts. Some works that study and use these multivariate versions are Jung and Winkelmann [15], Karlis [16], Karlis and Meligkotsidou [17] and Li et al. [20].
Other approaches to analyze multivariate count data have been proposed. Alfò and Tovato [1] used semi-parametric mixture models, Famoye [7] develops a bivariate negative binomial regression model, Hellstrom [11] applies a bivariate mixed Poisson log-normal model, Chib and Winkelmann [6] use the Poisson-lognormal distribution, Karlis and Meligkotsidou [18] examine finite mixtures of multivariate Poisson distributions. Cameron et al. [4], McHale and Scarf [22] and Nikoloulopoulos and Karlis [23] propose the use of copulas-based model. Famoye [8] proposes a multivariate generalized Poisson regression model based on a multivariate distribution with several parameters to model the overdispersion and several parameters to model the correlation between the count variables. Berkhout and Plug [2] introduce a bivariate model based on conditioning Poisson distributions. This model allows for positive as well as negative correlation. So, the authors introduce a model which allows for a more flexible correlation structure.
In many studies, the objective is to analyze the relationship between a set of explanatory variables and a set of count variables (count vector) with an internal dependency structure that follows the pattern: the kth component of count vector depends on the values obtained by the first components. This dependency structure between the count variables is not explicitly taking into account in the multivariate models proposed in the works cited above. These models do not include the specification of the effect of some count variables on others according to the ordering structure. And this order is present in many real data sets.
For example, in a model for studying the number of tourists (T), the number of visits (V) to a tourism point (amusement park, museum, historical building, etc.) and the number of official guide sales (S). the order should be . In a model for studying the number of individuals of various species in a type of ecosystem, the order can be known by the nature relations between the species.
However, if there is not complete information beforehand about this order, one has to select the most adequate model between the possible models. For example, the study of the number of individuals several fossil species found in a set of geographical observation points. This generic pattern is followed by the work of Ruíz et al. [25] that analyzes the foraminifera collected in three estuaries (Guadiana, Piedras, Tinto-Odiel) of southwestern Spain. The foraminifera are a large group of amoeboid protists. Generally, the foraminifera are regarded as the most important groups of marine microfossils because they are very abundant in marine sediments. So, because of their diversity, abundance and complex morphology, fossil foraminiferal assemblages are useful for biostratigraphy and paleontology.
In this experience, the count variables considered as objective are number of individuals of three species of foraminifera. The researchers try to model these count variables through their intrinsic dependency structure and their relationship with various variables that characterize the physical environment of the observations (water salinity during high tide and tidal height about the lowest astronomical tide).
In this work, we develop a multivariate linear model to describe the relationships between a vector of explanatory variables and a count vector with the aforementioned internal dependency structure. To illustrate the usefulness of the model, it will be applied to the experience of formainifera species.
In Section 2, the definition of the multivariate model is included and an iterative algorithm for computing the maximum likelihood estimation of parameters is described. In addition, various adjustment for the model are included. The distributional hypothesis of model requires that the components of the response vector are ordered. This order might be known in some applications, but in general it is unknown. Section 3 offers an efficient algorithm to identify the most appropriate ordering of the response variables, and the potential problem of no real ordering in the data is also considered.
In Section 4, we provide a simulation study to check the model, the inference process and the fit of the model to the simulated data sets.
Finally, the application to real data in Paleontology (three species of formaminifera) in Section 5 confirms the usefulness of the model in experiences associated with multivariate counting data and the easiness to interpret its results by the researchers. Section 6 ends with some conclusions.
2. Multivariate log-linear conditional Poisson model
Following the known scheme of the generalized linear model, we can construct a model that allows analyzing a p-dimensional count vector ( ) through a d-dimensional vector of covariates ( ) with the dependency structure and the distribution hypothesis listed below:
- depends on . For , depends on and . Further, the conditioned distributions of the components of follow the Poisson distribution. Let be the random subvector ,
- The assumption structural of the model is
where and for
This model can be considered a multivariate extension of the log-linear Poisson model [21]. Thus the model is determined through a distributional hypothesis (I) defined on the vector of objective variables and a structural hypothesis (II) defined on the parameters of the conditioned distributions. This model can be called a multivariate log-linear conditional Poisson model (MLCP).
The dependence between and the first variables is modeled through the parameter of the Poisson distribution that results from multiplying a term ( ) that depends on the covariates and the exponential function defined on the linear combination of those variables ( ),
| (1) |
So, the parameter of the Poisson distribution of is determined by an intrinsic factor of this variable and a correction factor defined on the preceding variables. The model assumes that the explanatory variables only influence on the intrinsic factor, but not the correction factor. Obviously if then the parameter coincide with the intrinsic factor, and is independent of the first variables.
Given that the correction factor is an exponential function, the parameters that determine the linear combination, , cannot be very large in absolute values because, in the negative case, is almost null and, in the positive case, they could take excessively large values.
The explanatory variables can be: metric variables, binary variables or qualitative variables. In this last case, the covariates have to be appropriately coded by a dummy vector. To include an intercept in the model, we can consider the first component of equals to 1.
This multivariate model can be used to model experiences that are described by a set of correlated count variables. The correlations between the variables can be positive or negative. Therefore this model offers flexibility for the description of these types of experiences. Some applications that could be interesting to obtain conclusions about: (a) the competition for food between two species; (b) the environmental factors controlling the presence of two polymorphs in the same species; (c) speciation and phylogeny; or (d) evolutionary patterns.
2.1. Likelihood inference
In order to obtain the maximum likelihood estimates (MLE) of the parameters of the model, we consider a sample
The parameters of model are and for . Denoting and , the likelihood of the model can be expressed, except for the multiplicative constant, as follows:
Therefore, the log-likelihood can be expressed by where
The maximum value is obtained with the iterated Fisher scoring algorithm, that is, starting with an initial estimate ,
| (2) |
where is the vector of first partial derivatives and is the Hessian matrix of second partial derivatives. As
for , the vector of first partial derivatives can be expressed by where are the -dimensional vectors
The Hessian matrix is with
and, for ,
For the selection of the initial values of the iterated procedure, the estimators of the univariate log-linear Poisson models associated to each objective variable can be considered. The stopping rule can be determined in terms of the absolute value of the gradient and the maximum number of iterations.
2.2. Goodness of fit
It can be confirmed that the distribution of the MLE of is asymptotically normal with approximate mean and approximate covariance matrix . Thus the goodness-of-fit may be checked by the deviance statistic or likelihood ratio statistic. This statistic is given by
where is the individual log-likelihood in which is replaced by (the maximum log-likelihood achievable), that is,
So, the deviance can be equivalently expressed as , twice the difference between the maximum log-likelihood achievable and the log-likelihood of the fitted model.
The formal use of this statistic becomes unsuitable in this model because the data are not grouped [21]. However, it is possible to consider a R-squared measure of goodness of fit. Cameron and Windmeijer [5] proposed the R-squared measure based on deviances for univariate Poisson regression models, instead of a measure based on residuals. Other pseudo R-squared measures have been defined for these models (see Waldhor et al. [28], Heinzl et al. [10]). In this paper, we propose a multivariate extension of this measure.
Additionally to the saturated model and the model postulated, we can consider other models. First, the model without the presence of covariates ( not dependent of , that is the intrinsic factors are , with , for ). Second, the intercept-only model, i.e. without the presence of the covariates and under the independence of the Poisson variables ( not dependent of and , for ).
The without-covariates model coincides with the MLCP model with a single predictor equal to 1. Therefore, to obtain the MLE estimators, a simple adaptation of the method collected in the previous subsection can be applied. Therefore, the maximum likelihood is reached for the MLE of the parameters, that is, and , for . Thus the maximum log-likelihood of this model is being
So, its deviance statistic is
Considering the intercept-only model, the maximum log-likelihood for this model is
where is the sample average of . The deviance statistic of this model can be expressed as
Obviously,
| (3) |
Based on (3), we can define two coefficients of determination or R-squared measures. These coefficients are based on the deviance for MLCP model and the reference interval. First, if we consider the interval , it can be defined as the determination coefficient the following ratio
This ratio can be denominated as the overall R-squared measure or overall determination coefficient. This coefficient lies between 0 and 1 because the fitted log-likelihood increases as regressors are added and the maximum value is . Furthermore, it can be interpreted as the ratio between the variation of the explained log-likelihood and the variation of the achievable log-likelihood. Also, can be interpreted as the relative reduction in deviance due to the complete model, that is, the model under distributional assumption and structural assumption. Hence, values close to zero of this measure indicate a lack of global adjustment in the model. This lack of fit can be caused by a faulty distributional hypothesis or a lack of capacity of explanation in the covariates. In addition, if we define the variation ratio of the log-likelihood explained by the distributional assumption (interdependence between objective variables) by
and the variation ratio of the log-likelihood explained by the structural assumption (dependence of covariates) by
then the overall R-squared measure can be expressed as the sum of these variation ratios, that is,
Second, for the reference interval , the following ratio can be considered as a determinant coefficient
It can be denominated as relative R-squared measure or relative determination coefficient. Obviously, this coefficient lies between 0 and 1 and can be interpreted as a measure of adjustment of the data to the structural assumption under the distributional hypothesis. So, the values close to zero of this measurement will indicate a lack of capacity of explanation in the covariates.
2.3. Selection of the model
In the MLCP model, the order of the dependent variables is essential. This order can be known in many applications. For example, in a model for studying the number of individuals of various species in a type of ecosystem, a part of the order can be known by the natural relations between the species. In a model for studying the number of tourists (T) and the number of visits (V) to a tourist point of interest (amusement park, museum, historical building, etc.), the order should be . However, this information on order is not known in the experience on foraminifera species that has been presented in the introduction to this work.
If there is not complete information beforehand about this order, one has to select the most adequate model between the possible models. In this section, we present a procedure when there is not any information about the order. That is, a procedure to choose the most adequate model between the total models. Obviously, the procedure can be based on the selection of a model with smaller deviance statistic. So the following procedure can be considered:
- Selection of the first variable. Study of the p univariate models:
that are denoted by . Let be deviance statistic of , for . Thus, select such that
The selected variable is noted by and the rest by . -
Selection of the second variable.
- Study of the bivariate models
that are noted by . Let be deviance statistics, for . Thus select such that -
Study of bivariate model .
- If then retaining
- In other case, retaining
The selected variables, in the indicated order, are noted by , and the rest by .
-
Selection of the third variable
- Fit the models , with the associated statistics . Thus select such that
-
Studying the model .
- If then retain the model .
- If then study the model .
- If then retain the model
- In other case, retain .
The selected variables, in the indicated order, are noted by , , and the rest by .
- Step k ( ) Selection of the variable.
- Study the models with the associated statistics
Thus select such that - Study the model .
- If
then retain the model . - In other case, study the model and proceed in a similar way.
For this procedure, the maximum number of models that must be analyzed is . For p>3 this maximum is smaller than the number of possible sorts ( ). Besides, the models included in the procedure have dimensions smaller than the dimension of the objective vector, except in the last step. Therefore this procedure diminishes the computational complexity.
This procedure is based on the idea that once the order of dependency between the two variables is determined then this order should not be modified with the presence of a new variable. Possibly in many biological studies focusing on the study of the number of individuals of several species, this premise can be applied.
Our proposed procedure is successfully applied on several artificial data sets in the next section, detecting the correct order as it is defined in the population model. However, real data sets could arise from a generation process where no established order is present, but the proposed procedure would nonetheless arrive to a final ordering. To avoid this problem, the following idea is suggested.
If the CPU time required to fit a model with a certain ordering is not excessive, fit all the models. Otherwise the proposed procedure is applied until the next-to-last step. Last step is modified in such a way that only a fraction f of the remaining orderings are considered.
The deviance or the AIC for the generated models is finally analyzed. We must doubt about the appropriateness of the model if the criterion for selected ordering it not clearly the best. For example, very similar values of the criterion or the existence of other orderings with a better value can reveal that no ordering is clearly defined for the data set.
3. Simulations: artificial data sets
We run a small simulation study to analyze the performance of the estimators obtained in Section 2. First, we generate an artificial data set. We have used the following procedure of generation. We consider n = 120, p = 3, number of objective variables ( ) and d = 2, one explanatory variable (X) and the intercept. For , the following values are generated
: value of a distribution .
: value of a distribution , with
: value of a distribution , with
: value of a distribution , with
The results of the selection model algorithm are presented in Table 1. The application of this procedure provides the precise ordering, .
Table 1.
Artificial data set: model selection algorithm.
| Step 1 | |||
|---|---|---|---|
| h = 1 | 108.7242 | Select | |
| h = 2 | 206.8058 | ||
| h = 3 | 161.8564 | ||
| Step 2 | Step 2.1 | ||
| h = 2 | 222.2701 | ||
| h = 3 | 218.7594 | Select | |
| Step 2.2 | |||
| h = 1, t = 3 | 218.7594 | Select | |
| h = 3, t = 1 | 267.1031 | ||
| Step 3 | Step 3.2 | ||
| h = 1, t = 3, b = 2 | 2356.432 | ||
| h = 1, t = 2, b = 3 | 330.6761 | Select | |
| h = 2, t = 1, b = 3 | 2265.350 |
Two aspects of interest of the implementation of this algorithm are: first, the initial values of the estimators in each step are obtained by applying the regression model of the Poisson univariate model for each variable objective; second, the stopping rule of the iteration process is determined in terms of the absolute value of the gradient (AV G) and the maximum number of iterations (NI). The stopping rule is: or . We have written a script in R [26] to fit our model.
In Table 1, we can see that the selection model algorithm detects the order determined in the process of the generation of data , even though in the second step the variable selected is . Furthermore, the estimation procedure provides accurate estimations (see Table 2). This table presents the real values of the parameters used in the process of the data set, the estimated values and the asymptotic confidence interval.
Table 2.
Artificial data set: estimated parameters.
| Parameter | Real value | Estimated value | Confidence interval ( ) | |
|---|---|---|---|---|
| 5.00 | 4.91250 | 4.69609 | 5.12891 | |
| −0.50 | −0.47867 | −0.52670 | −0.43063 | |
| 1.00 | 0.83382 | 0.40777 | 1.25986 | |
| 0.90 | 0.91601 | 0.85716 | 0.97485 | |
| −0.30 | −0.29130 | −0.30506 | −0.27755 | |
| 1.00 | 0.71988 | −1.02076 | 2.46053 | |
| −0.50 | −0.45440 | −0.75899 | −0.14981 | |
| 0.10 | 0.09985 | 0.07350 | 0.12620 | |
| −0.01 | −0.00521 | −0.01534 | 0.00491 | |
Finally, the study of adjustment of model gives the following statistics:
. The global adjustment must be considered excellent.
and . The global fit is provide by the log-likelihood explained by the distributional assumption.
. If we consider this coefficient to measure the capacity of explanation of the explanatory variables under the distributional hypothesis then we can conclude that the variable X provides relevant information about the objective variables.
The values of these measures show that the estimated model give quite good fit to the simulated data set.
Second, we generated three artificial data sets with multivariate count data of dimensions 4, 5 and 6, respectively. The parameters of the generation process are included in Table 3, with the explanatory variable X generated according to a distribution . Given the exponential dependence between the count variables, the model parameters must be small. Some parameters have been excluded in order to simulate models in which one count variable does not influence the following variables. In all three data sets, applying the procedure to choose the most appropriate model leads to the ordering used in the generation process. Table 4 presents the estimated values of real parameters, with a clear agreement between real and estimated parameters.
Table 3.
Artificial data sets: coefficients.
| Variable | Intercept | X | |||||
|---|---|---|---|---|---|---|---|
| Dimension p = 4, sample size n = 200 | |||||||
| 1.0 | 0.5 | – | – | – | – | – | |
| 2.0 | −0.5 | 0.1 | – | – | – | – | |
| 1.0 | 0.5 | −0.1 | 0.1 | – | – | – | |
| 2.0 | −0.5 | 0.1 | −0.1 | 0.1 | – | – | |
| Dimension p = 5, sample size n = 200 | |||||||
| 3.0 | −0.5 | – | – | – | – | – | |
| 3.0 | 0.5 | −0.1 | – | – | – | – | |
| 1.0 | −0.5 | 0 | 0.1 | – | – | – | |
| 3.0 | −0.5 | −0.1 | 0 | 0.1 | – | – | |
| 3.0 | −0.5 | 0 | 0 | 0.1 | −0.1 | – | |
| Dimension p = 6, sample size n = 200 | |||||||
| 3.5 | 1.0 | – | – | – | – | – | |
| 3.0 | 1.5 | −0.01 | – | – | – | – | |
| 3.5 | 1.5 | −0.10 | 0.10 | – | – | – | |
| 3.0 | 0 | −0.05 | 0 | 0.05 | – | – | |
| 3.0 | 0 | −0.01 | 0 | −0.01 | 0.01 | – | |
| 3.0 | 1.0 | 0 | 0.01 | 0 | 0 | −0.05 | |
Table 4.
Artificial data sets: estimated parameters.
| Variable | Intercept | X | |||||
|---|---|---|---|---|---|---|---|
| Dimension p = 4 | |||||||
| 0.96 | 0.52 | – | – | – | – | – | |
| 2.04 | −0.47 | 0.08 | – | – | – | – | |
| 0.90 | 0.54 | −0.10 | 0.11 | – | – | – | |
| 2.05 | −0.38 | 0.07 | -0.08 | 0.08 | – | – | |
| Dimension p = 5 | |||||||
| 2.96 | −0.51 | – | – | – | – | – | |
| 2.98 | 0.49 | −0.10 | – | – | – | – | |
| 1.00 | −0.51 | 0.00 | 0.10 | – | – | – | |
| 2.78 | −0.58 | −0.09 | 0.01 | 0.10 | – | – | |
| 3.02 | −0.50 | 0.00 | 0.00 | 0.10 | −0.10 | – | |
| Dimension p = 6 | |||||||
| 3.49 | 1.00 | – | – | – | – | – | |
| 2.93 | 1.47 | −0.01 | – | – | – | – | |
| 3.71 | 1.59 | −0.10 | 0.09 | – | – | – | |
| 3.47 | 0.26 | −0.06 | −0.01 | 0.05 | – | – | |
| 3.09 | 0.09 | −0.02 | 0.01 | −0.03 | 0.02 | – | |
| 3.08 | 0.99 | 0.00 | 0.01 | 0.00 | 0.00 | −0.06 | |
;
Finally, we have obtained the adjustment measures of the models associated to all possible ordernatios ( factorial for the data set with p count variables). The results are included in Table 5, with the statistics: deviation, Akaike information criterion (AIC), square root of mean square error (RMSE) and the overall R-squared ( ). Table 5 includes:
These measures for the model associated with the order used in the data generation process.
A summary of the measures for all models: mean, minimum and maximum.
Table 5.
Artificial data sets: measures of model fit.
| Deviance | AIC | RMSE | ||
|---|---|---|---|---|
| Dimension p = 4 | ||||
| Order | ||||
| 819.096 | 3613.377 | 2.519 | 0.7573 | |
| Minimum | 819.096 | 3613.377 | 2.519 | 0.7244 |
| Mean | 884.212 | 3678.493 | 2.606 | 0.7511 |
| Maximum | 942.855 | 3737.136 | 2.660 | 0.7803 |
| Value obtained for the order (1,2,3,4) | ||||
| Value obtained for the order (2,3,1,4) | ||||
| Dimension p = 5 | ||||
| Order | ||||
| 936.654 | 4660.821 | 3.503 | 0.974 | |
| Minimum | 936.654 | 4660.821 | 3.503 | 0.714 |
| Mean | 3782.414 | 7507.021 | 19.776 | 0.888 |
| Maximum | 8786.090 | 12510.256 | 44.131 | 0.974 |
| Value obtained for the order (1,2,3,4,5) | ||||
| Dimension p = 6 | ||||
| Order | ||||
| 1109.437 | 5818.742 | 4.144 | 0.977 | |
| Minimum | 1109.437 | 5818.742 | 4.144 | 0.939 |
| Mean | 1897.034 | 6606.098 | 5.492 | 0.960 |
| Maximum | 2847.234 | 7557.387 | 6.621 | 0.978 |
| Value obtained for the order (1,2,3,4,5,6) | ||||
| Value obtained for the order (1,2,3,5,4,6) | ||||
In conclusion, in the three cases (p = 4, 5, 6), the ordering determined by the selection procedure coincides with the ordering used in the data generation process and with the optimal ordering according to the Deviation, AIC and RMSE. Only a small discrepancy is observed in the criterion based on the coefficient
4. Application: environmental control of foraminifera distribution
The foraminifera are a large group of amoeboid protists and are regarded as the most important groups of marine microfossils. Particularly, the scientific interest on these microfossils focuses on their great utility in paleontology and biostratigraphy.
Ruíz et al. [25] analyze the foraminifera collected in three estuaries (Guadiana, Piedras, Tinto-Odiel) of southwestern Spain. Fifty three observations were collected in the different sedimentary environments (subtidal channels, intertidal channels, channel borders, low salt marshes and high salt marshes) of the three estuaries studied. In this experience, the following variables were taking into account: water salinity during high tide (S), tidal height about the lowest astronomical tide (TH) and number of individuals of three species of foraminifera: Elphidium crispum (EC), Jadammina macrescens (JM), and Trochammina inflata (TI). TH is positive in subaerial environments and increases from the channel border to the high salt marshes, whereas it is negative in the channels and indicates the water depth in the two first environments mentioned above.
Table 6 presents descriptive statistics of the counts and the observed variables. For each specie, single log-linear Poisson models are obtained in the paper [24]. However, the single models do not allow to extract joint information because they do not model the correlation between the populations. In order to study the relation between the abundance of three species with the saline content and the tidal height, we consider the MLCP model constituted by count variables EC, JM and TI, and by the explanatory variables S and TH. The application of the selecting model algorithm provides the order: JM, TI and EC.
Table 6.
Foraminifera data set: descriptive statistics.
| Mean | St.Dev. | Min | Max | |
|---|---|---|---|---|
| Elphidium crispum | 3.56604 | 12.40736 | 0.0 | 64.0 |
| Jadammina macrescens | 7.84906 | 15.28875 | 0.0 | 71.0 |
| Trochammina inflata | 15.05660 | 42.84651 | 0.0 | 287.0 |
| Water salinity | 32.76415 | 7.12256 | 6.0 | 36.3 |
| Tidal height | −0.37886 | 2.55246 | −6.0 | 3.1 |
The results of the algorithm are presented in the Table 7. Thus, the analyzed MLCP model relates the count variables to the explanatory variables .
Table 7.
Foraminifera data set: model selection algorithm.
| Step 1 | |||
|---|---|---|---|
| 533.9910 | |||
| 527.6821 | Select | ||
| 1722.5150 | |||
| Step 2 | Step 2.1 | ||
| 912.760 | Select | ||
| 2232.855 | |||
| Step 2.2 | |||
| 1025.142 | |||
| 912.760 | Select | ||
| Step 3 | Step 3.2 | ||
| 2697.231 | |||
| 2612.888 | Select | ||
| 2629.255 |
The parameters of model are with for j = 1, 2, 3, and . So, the expected values of the Poisson distributions are , and such that
The results of the analysis are summarized in Table 8. The capacity of global explanation of the model should be considered significant, particularly the distributional hypothesis. The results obtained from univariate models are:
for EC,
for JM,
for TI,
Table 8.
Foraminifera data set: estimated parameters.
| Parameter | Estimated value | Confidence interval ( ) | ||
|---|---|---|---|---|
| Component: JM | ||||
| Intercept | 0.03392 | −0.68576 | 0.75359 | |
| Tidal height | 0.75441 | 0.66776 | 0.84106 | |
| Salinity | 0.03267 | 0.01257 | 0.05277 | |
| Component: TI | ||||
| Intercept | 2.60108 | 2.26086 | 2.94131 | |
| Tidal height | 0.57358 | 0.51001 | 0.63716 | |
| Salinity | −0.01916 | −0.02888 | −0.00944 | |
| JM | 0.00845 | 0.00457 | 0.01233 | |
| Component: EC | ||||
| Intercept | −80.67440 | −100.46653 | −60.88226 | |
| Tidal height | 0.29280 | 0.21467 | 0.37093 | |
| Salinity | 2.31655 | 1.76659 | 2.86650 | |
| JM | −0.29630 | −0.51093 | −0.08166 | |
| TI | −0.17067 | −0.34921 | 0.00786 | |
| Goodness of fit | RMSE = 23.3 | |||
| AIC = 2861.0 | ||||
That is, the multivariate model provides a better fit to the data than the univariate models. Therefore, the model allows the interpretation of the relationships between the three count variables and the two explanatory variables.
From Table 8, tidal height and JM show a high relation ( ), because the highest abundances of this species are found in the high salt marsh deposits, located over the mean high tide level. TI is the dominant species in the low salt marsh deposits and consequently presents a positive but lower relation ( ) with the tidal height. In addition, the positive relation ( ) of this variable with EC is explained by the presence of frequent individuals of this species in the channel border. Consequently, a height gradient is found when this multivariate method is applied to the sedimentary distribution of these three species, coinciding with the arrangement (JM−TI−EC) proportioned by the model selection algorithm and very similar to that observed in British estuaries [12].
The very high relation ( ) between salinity and EC can be explained by the marine character of this species (Villanueva [27]), which is clearly concentrated near the river mouths with salinity ranges comprised between and . Nevertheless, both JM and TI show slight relations positive ( ) or negative ( ), respectively. These species are little sensitive to changes in this parameter, although this positive relation between salinity and JM has been also detected in Canadian estuaries [9,14].
5. Conclusion
Count data has been the subject of an increasing number of scientific works, but models taking into account multivariate count data are still scarcely applied. For single counts, the Poisson distribution is commonly used. For multiple counts, the literature reports some works using several multivariate Poisson distribution models. The disadvantage of some of these models is that the correlation between count variables is restricted to being positive.
In this work, we propose a new regression model for multivariate count data that is based on the Poisson distribution. Our proposed model does not only extend the bivariate model proposed by Berkhout and Plug[2] but also allows to model the dependence structure between count variables and the relationship with a set of explanatory variables.
The model may have certain limitations: complexity in the correlation structure and difficulty in the selection of the order of the components. However, the flexibility of the correlation structure should be regarded as an advantage of this model. Another advantage is that model estimators require little computational power and are relatively easy to obtain. To complete the study of the model we analyze two issues: analysis of adjustment measures and selection of the model.
We propose various adjustment statistics which are based on the likelihood of the model. First, an overall adjustment measure ( ) that can be expressed as the sum of two components: the component explained by the distributional hypothesis and the component explained by the structural hypothesis. Second, a relative adjustment measure ( ) that can be interpreted as a measure of adjustment to the structural assumptions under the distributional hypothesis. These measures provide relevant and useful information about the adjustment of the model and have an additional advantage: they can be applied to any parametric multivariate model whose estimators are based on the maximum likelihood method.
The distributional hypothesis of the model depends on the order of the count variables. The order of the structure of dependency between count variables can be known in many experimental studies. However, there are situations in which the order is not known. In this case, we propose an algorithm to choose an adequate order. In practice, any available information on the dependency between count variables must be incorporated into the study, which should appropriately modify the proposed procedure.
In order to illustrate this multivariate technique, a real application is included: the relation between the abundance of three benthic foraminifera and several environmental variables. The challenge to make a joint analysis of the aforementioned populations has led to develop a multivariate model. This experience allows illustrating the usefulness of this multivariate model for the analysis of species counting experiences (scientific areas such as cell biology, botany, zoology, virology, and so on).
Other methods have been recently proposed for multivariate regression count data tasks, such as mixture Poisson models and copula based Poisson models. Both approaches are very flexible, allowing the description and modeling of a wide set of structures of correlations which can appear in real data sets. However, these models are not easily interpretable, and they can drive to conclusions which can be distant from the actual process generating the data set. Our proposed model can offer a more reliable depiction of the random process behind the data in many real problems. When the hypothesis about the ordering of the variables is questionable, the other models must also be considered, selecting the most appropriate one with the aid of some goodness of fit criterion.
Finally, it is clear that other aspects of the models need to be explored. For example, a detailed study of the properties of the maximum likelihood estimators, the estimation of the marginal moments of the count vector components, design of algorithms of selection of the explanatory variables and the development of diagnostic techniques.
The proposed model assumes no overdispersion or zero-inflation in the conditional distributions. Extending the distributional hypothesis to models containing these features (overdispersion and zero-inflation) can be very interesting from a practical viewpoint, but it requires a broad development of its theoretical and computational issues. The study of these aspects should be the subject of further works.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Alfò M. and Tovato G., Semiparametric mixture models for multivariate count data, with application, Econometrics J. 7 (2004), pp. 426–454. [Google Scholar]
- 2.Berkhout P. and Plug E., A bivariate Poisson count data model using conditional probabilities, Stat. Neerl. 58 (2004), pp. 349–364. [Google Scholar]
- 3.Cameron A.C. and Johansson P., Count data regressions using series expansions with applications, J. Appl. Econom. 12 (1997), pp. 203–224. [Google Scholar]
- 4.Cameron A.C., Li T., Trivedi P.K., and Zimmer D.M., Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts, Econom. J. 7 (2004), pp. 566–584. [Google Scholar]
- 5.Cameron A.C. and Windmeijer F.A.G, R-squared measures for count data regression models with applications to health-care utilization, J. Business Econom. Stat. 14 (1996), pp. 209–220. [Google Scholar]
- 6.Chib S. and Winkelmann R., Markov chain Monte Carlo analysis of correlated count data, J. Bussiness Econom. Stat. 19 (2001), pp. 428–435. [Google Scholar]
- 7.Famoye F., On the bivariate negative binomial regression model, J. Appl. Stat. 37 (2010), pp. 969–981. [Google Scholar]
- 8.Famoye F., A multivariate generalized Poisson regression model, Commun. Stat.-Theor. Meth. 44 (2015), pp. 497–511. [Google Scholar]
- 9.Guibault J.P., Clague J.J., and Lapointe M., Amount of subsidence during a late Holocene earthquake-evidence from fossil marsh foraminifera at Vancouver Island, west coast of Canada., Palaeogeogr. Palaeoclimatol. Palaeoecol. 118 (1998), pp. 49–71. [Google Scholar]
- 10.Heinzl H. and Mittlbock M., Pseudo R-squared measures for Poisson regression models with over or underdispersion, Comput. Statist. Data Anal. 44 (0000), pp. 253–271. [Google Scholar]
- 11.Hellstrom J., A bivariate count data model for household tourism demand, J. Appl. Econom. 21 (2006), pp. 213–226. [Google Scholar]
- 12.Horton B.P., Edwards R.J., and Lloyed J.M., UK intertidal foraminiferal distributions: implications for sea-level studies, Palaeogeogr. Palaeoclimatol. Palaeoecol. 149 (1999), pp. 127–149. [Google Scholar]
- 13.Johnson N.L. and Kotz S., Distributions in Statistics: Discrete Distributions, John Wiley & Sons, New York, 1969. [Google Scholar]
- 14.Jonasson K.E. and Patterson R.T., Preservation potential of salt marsh foraminifera from the Fraser delta river, British Columbia, Mar. Micropaleontol. 38 (1992), pp. 289–301. [Google Scholar]
- 15.Jung C.J.. and Winkelmann R., Two aspects of labor mobility: a bivariate Poisson regression approach, Empir. Econ. 18 (1993), pp. 543–556. [Google Scholar]
- 16.Karlis D., An EM algorithm for multivariate Poisson distribution and related models, J. Appl. Stat. 30 (2003), pp. 63–77. [Google Scholar]
- 17.Karlis D. and Meligkotsidou L., Multivariate Poisson regression with covariance structure, Stat. Comput. 15 (2005), pp. 255–265. [Google Scholar]
- 18.Karlis D. and Meligkotsidou L., Finite mixtures of multivariate Poisson distributions with application, J. Stat. Plan. Inference. 137 (2007), pp. 1942–1960. [Google Scholar]
- 19.Kocherlakota S. and Kocherlakota K., Bivariate Discrete Distributions, Marcel Dekker, New York, 1992. [Google Scholar]
- 20.Li C.S., Lu J.C., Brinkley P.A., and Peterson J.P., Multivariate zero-inflated Poisson models and their applications, Technometrics 41 (1999), pp. 29–38. [Google Scholar]
- 21.McCullagh P. and Nelder J.A., Generalized Linear Models, 2nd. ed., Chapman and Hall, London, UK, 1989. [Google Scholar]
- 22.McHale I. and Scarf P., Modelling soccer matches using bivariate discrete distributions with general dependence structure, Stat. Neerl. 61 (2007), pp. 432–445. [Google Scholar]
- 23.Nikoloulopoulos A.K. and Karlis D., Modeling multivariate count data using copulas, Commun. Stat. Simul. Comput. 39 (2009), pp. 172–187. [Google Scholar]
- 24.Ruiz F., González M.L., Abad M., Munoz-Pichardo J.M., and Pino-Mejias R., New micropaleontological applications of the Poisson distribution: statistical models applied to benthic foraminiferal populations., Terra Nova 19 (2007), pp. 367–372. [Google Scholar]
- 25.Ruiz F., González M.L., Pendón J.G., Abad M., Olías M., and Munoz-Pichardo J.M., Correlation between foraminifera and sedimentary environments in recent estuaries of southwestern Spain: applications to holocene reconstructions, Quat. Int. 140-141 (2005), pp. 21–36. [Google Scholar]
- 26.R Development Core Team , A language and environment for statistical computation, R Foundation for Statistical Computing. http://www.r-project.org/
- 27.Villanueva P., Implicaciones oceanográficas de los foraminíferos bentónicos recientes en la bahía y plataforma gaditana. Taxonomía y asociaciones, Ph.D. Thesis, Cádiz University, Spain, 1994.
- 28.Waldhor T., Haidinger G., and Schober E., Comparison or measures for Poisson regression by simulation, J. Epidemiol. Biostat. 3 (1998), pp. 209–215. [Google Scholar]
