Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Jan 22;48(13-15):2525–2541. doi: 10.1080/02664763.2021.1877637

A multivariate Poisson regression model for count data

J M Muñoz-Pichardo a,CONTACT, R Pino-Mejías a, J García-Heras a, F Ruiz-Muñoz b, M Luz González-Regalado b
PMCID: PMC9041711  PMID: 35707072

Abstract

We propose a new technique for the study of multivariate count data. The proposed model is applied to the study of the number of individuals several fossil species found in a set of geographical observation points. First, we are proposing a multivariate model based on the Poisson distributions, which allows positive and negative correlations between the components. We are extending the log-linear Poisson model in the multivariate case through the conditional distributions. For this model, we obtain the maximum likelihood estimates and compute several goodness of fit statistics. Finally we illustrate the application of the proposed method over data sets: various simulated data sets and a count data set of various fossil species.

Keywords: Poisson log-linear model, maximum likelihood estimation, selection model, multivariate count data, conditional modeling

1. Introduction

Count data has been the subject of an increasing number of proposals in several scientific areas. There are numerous examples in scientific literature, but models taking into account multivariate counts are still fairly rare. In investigations that are directed to the study of such variables as career interruptions, number of children, scores of soccer games or days in an hospital, the discrete Poisson distribution can be used. In general, for single counts, the application of this distribution provides good results in the modelling of the phenomena. On occasions, it is necessary to apply some modifications to avoid difficulties such as the under-dispersion or over-dispersion problem and/or the excess zeros problem. Besides, in count data analysis, if the interest is focused on the study of a single count variable in a regression context, the building of the model and its analysis are widely developed by means of the generalized linear model [21].

For multiple counts, however, the application of the Poisson distribution is not so clear. The literature reports good works using a bivariate Poisson distribution based on the models described by Johnson and Kotz [13], Kocherlakota and Kocherlakota [19] and Cameron and Johansson [3]. The disadvantage of these suggested distributions is that the correlation between count variables is restricted to being positive. This disadvantage is not strong since negative correlation for small counts is not realistic, it only occurs between high counts. Some works that study and use these multivariate versions are Jung and Winkelmann [15], Karlis [16], Karlis and Meligkotsidou [17] and Li et al. [20].

Other approaches to analyze multivariate count data have been proposed. Alfò and Tovato [1] used semi-parametric mixture models, Famoye [7] develops a bivariate negative binomial regression model, Hellstrom [11] applies a bivariate mixed Poisson log-normal model, Chib and Winkelmann [6] use the Poisson-lognormal distribution, Karlis and Meligkotsidou [18] examine finite mixtures of multivariate Poisson distributions. Cameron et al. [4], McHale and Scarf [22] and Nikoloulopoulos and Karlis [23] propose the use of copulas-based model. Famoye [8] proposes a multivariate generalized Poisson regression model based on a multivariate distribution with several parameters to model the overdispersion and several parameters to model the correlation between the count variables. Berkhout and Plug [2] introduce a bivariate model based on conditioning Poisson distributions. This model allows for positive as well as negative correlation. So, the authors introduce a model which allows for a more flexible correlation structure.

In many studies, the objective is to analyze the relationship between a set of explanatory variables and a set of count variables (count vector) with an internal dependency structure that follows the pattern: the kth component of count vector depends on the values obtained by the (k1) first components. This dependency structure between the count variables is not explicitly taking into account in the multivariate models proposed in the works cited above. These models do not include the specification of the effect of some count variables on others according to the ordering structure. And this order is present in many real data sets.

For example, in a model for studying the number of tourists (T), the number of visits (V) to a tourism point (amusement park, museum, historical building, etc.) and the number of official guide sales (S). the order should be (T,V,S). In a model for studying the number of individuals of various species in a type of ecosystem, the order can be known by the nature relations between the species.

However, if there is not complete information beforehand about this order, one has to select the most adequate model between the possible models. For example, the study of the number of individuals several fossil species found in a set of geographical observation points. This generic pattern is followed by the work of Ruíz et al. [25] that analyzes the foraminifera collected in three estuaries (Guadiana, Piedras, Tinto-Odiel) of southwestern Spain. The foraminifera are a large group of amoeboid protists. Generally, the foraminifera are regarded as the most important groups of marine microfossils because they are very abundant in marine sediments. So, because of their diversity, abundance and complex morphology, fossil foraminiferal assemblages are useful for biostratigraphy and paleontology.

In this experience, the count variables considered as objective are number of individuals of three species of foraminifera. The researchers try to model these count variables through their intrinsic dependency structure and their relationship with various variables that characterize the physical environment of the observations (water salinity during high tide and tidal height about the lowest astronomical tide).

In this work, we develop a multivariate linear model to describe the relationships between a vector of explanatory variables and a count vector with the aforementioned internal dependency structure. To illustrate the usefulness of the model, it will be applied to the experience of formainifera species.

In Section 2, the definition of the multivariate model is included and an iterative algorithm for computing the maximum likelihood estimation of parameters is described. In addition, various adjustment for the model are included. The distributional hypothesis of model requires that the components of the response vector are ordered. This order might be known in some applications, but in general it is unknown. Section 3 offers an efficient algorithm to identify the most appropriate ordering of the response variables, and the potential problem of no real ordering in the data is also considered.

In Section 4, we provide a simulation study to check the model, the inference process and the fit of the model to the simulated data sets.

Finally, the application to real data in Paleontology (three species of formaminifera) in Section 5 confirms the usefulness of the model in experiences associated with multivariate counting data and the easiness to interpret its results by the researchers. Section 6 ends with some conclusions.

2. Multivariate log-linear conditional Poisson model

Following the known scheme of the generalized linear model, we can construct a model that allows analyzing a p-dimensional count vector ( Y) through a d-dimensional vector of covariates ( X) with the dependency structure and the distribution hypothesis listed below:

  1. Y1 depends on X. For k=2,,p, Yk depends on Y1,,Yk1 and X. Further, the conditioned distributions of the components of Y follow the Poisson distribution. Let Y(h) be the random subvector [Y1Yh]t,
    Yk|Y(k1),XP(λk).
  2. The assumption structural of the model is
    logλ1=xtβ1andlogλj=xtβj+y(j1)tαj,forj=2,,p.
    where βjRd and αj=[αj1,,αj(j1)]tRj1 for j=1,,p

This model can be considered a multivariate extension of the log-linear Poisson model [21]. Thus the model is determined through a distributional hypothesis (I) defined on the vector of objective variables and a structural hypothesis (II) defined on the parameters of the conditioned distributions. This model can be called a multivariate log-linear conditional Poisson model (MLCP).

The dependence between Yj and the first (j1) variables is modeled through the parameter of the Poisson distribution that results from multiplying a term ( eηj=extβj) that depends on the covariates and the exponential function defined on the linear combination of those variables ( ey(j1)tαj),

λj=extβjey(j1)tαj (1)

So, the parameter of the Poisson distribution of Yj is determined by an intrinsic factor of this variable and a correction factor defined on the preceding variables. The model assumes that the explanatory variables only influence on the intrinsic factor, but not the correction factor. Obviously if αj=0 then the parameter coincide with the intrinsic factor, Yj|X=xP(λj(x)) and Yj is independent of the (j1) first variables.

Given that the correction factor is an exponential function, the parameters that determine the linear combination, αj, cannot be very large in absolute values because, in the negative case, λj is almost null and, in the positive case, they could take excessively large values.

The explanatory variables can be: metric variables, binary variables or qualitative variables. In this last case, the covariates have to be appropriately coded by a dummy vector. To include an intercept in the model, we can consider the first component of X equals to 1.

This multivariate model can be used to model experiences that are described by a set of correlated count variables. The correlations between the variables can be positive or negative. Therefore this model offers flexibility for the description of these types of experiences. Some applications that could be interesting to obtain conclusions about: (a) the competition for food between two species; (b) the environmental factors controlling the presence of two polymorphs in the same species; (c) speciation and phylogeny; or (d) evolutionary patterns.

2.1. Likelihood inference

In order to obtain the maximum likelihood estimates (MLE) of the parameters of the model, we consider a sample

{(xi,yi):xi=[x1i,,xdi]t,yi=[y1i,,ypi]t,i=1,,n}.

The parameters of model are βjRd and αjRj1 for j=1,,p. Denoting α=(α2tαpt)t and β=(β1tβpt)t, the likelihood of the model can be expressed, except for the multiplicative constant, as follows:

L(β,α)=i=1nLi(β,α)=i=1nh=1pexp(λh)exp(yhilogλh)

Therefore, the log-likelihood can be expressed by l(β,α)=i=1nli(β,α) where

li(β,α)=[exp(xitβ1)+y1ixitβ1]+h=2p[exp(xitβh+yi(h1)tαh)+yhi(xitβh+yi(h1)tαh)]

The maximum value is obtained with the iterated Fisher scoring algorithm, that is, starting with an initial estimate (β^(0),α^(0)),

(β^(k+1),α^(k+1))=(β^(k),α^(k))[H(β^(k),α^(k))]1s(β^(k),α^(k))for k=0,1,2 (2)

where s(β^(k),α^(k)) is the vector of first partial derivatives and H(β^(k),α^(k)) is the Hessian matrix of second partial derivatives. As

liβ1=[y1iexp(xitβ1)]xi,liβh=[yhiexp(xitβh+yi(h1)tαh)]xi,liαh=[yhiexp(xitβh+yi(h1)tαh)]yi(h1),

for h=2,,p, the vector of first partial derivatives can be expressed by s(β,α)=i=1nsi(β,α) where si are the (pd+p(p1)/2)-dimensional vectors

si(β,α)=[(liβ1)t,(liβ2)t,(liα2)t,,(liβp)t,(liαp)t]t

The Hessian matrix is H(β,α)=diag{H(h)(β,α)} with

H(1)(β,α)=i=1nexp(xitβ1)xixit

and, for h=2,,p,

H(h)(β,α)=i=1nexp(xitβh+αhtyi(h1))(xixitxiyi(h1)tyi(h1)xityi(h1)yi(h1)t)

For the selection of the initial values of the iterated procedure, the estimators of the univariate log-linear Poisson models associated to each objective variable can be considered. The stopping rule can be determined in terms of the absolute value of the gradient and the maximum number of iterations.

2.2. Goodness of fit

It can be confirmed that the distribution of the MLE of (β1,,βp,α2,,αp) is asymptotically normal with approximate mean (β1,,βp,α2,,αp) and approximate covariance matrix [H(β^,α^)]1. Thus the goodness-of-fit may be checked by the deviance statistic or likelihood ratio statistic. This statistic is given by

D(y;β^,α^)=2i=1n[li(β^,α^)li(yi)]

where li(yi) is the individual log-likelihood in which λj is replaced by yji (the maximum log-likelihood achievable), that is,

li(yi)=k=1p(yki+ykilogyki)

So, the deviance can be equivalently expressed as D(y;β^,α^)=2[l(y)l(β^,α^)], twice the difference between the maximum log-likelihood achievable and the log-likelihood of the fitted model.

The formal use of this statistic becomes unsuitable in this model because the data are not grouped [21]. However, it is possible to consider a R-squared measure of goodness of fit. Cameron and Windmeijer [5] proposed the R-squared measure based on deviances for univariate Poisson regression models, instead of a measure based on residuals. Other pseudo R-squared measures have been defined for these models (see Waldhor et al. [28], Heinzl et al. [10]). In this paper, we propose a multivariate extension of this measure.

Additionally to the saturated model and the model postulated, we can consider other models. First, the model without the presence of covariates ( ηRp not dependent of X, that is the intrinsic factors are exp(ηj), with ηjR, for j=1,,p). Second, the intercept-only model, i.e. without the presence of the covariates and under the independence of the Poisson variables ( ηRp not dependent of X and αj=0, for j=2,,p).

The without-covariates model coincides with the MLCP model with a single predictor equal to 1. Therefore, to obtain the MLE estimators, a simple adaptation of the method collected in the previous subsection can be applied. Therefore, the maximum likelihood is reached for the MLE of the parameters, that is, η~ and α~h, for h=2,,p. Thus the maximum log-likelihood of this model is l(η~,α~)=i=1nli(η~,α~) being

li(η,α)=[exp(η1)+y1iη1]+j=2p[exp(ηj+yi(j1)tαj)+yji(ηj+yi(j1)tαj)]

So, its deviance statistic is D(y;η~,α~)=2[l(y)l(η~,α~)]

Considering the intercept-only model, the maximum log-likelihood for this model is

l(y¯)=i=1nk=1p(y¯k+ykilogy¯k),

where y¯k is the sample average of Yk. The deviance statistic of this model can be expressed as

D(y;y¯)=2[l(y)l(y¯)]=2i=1nk=1pykilog(yki/y¯k)

Obviously,

l(y¯)l(η~,α~)l(β^,α^)l(y) (3)

Based on (3), we can define two coefficients of determination or R-squared measures. These coefficients are based on the deviance for MLCP model and the reference interval. First, if we consider the interval [l(y¯),l(y)], it can be defined as the determination coefficient the following ratio

RO2=1D(y;β^,α^)D(y;y¯)=l(β^,α^)l(y¯)l(y)l(y¯).

This ratio can be denominated as the overall R-squared measure or overall determination coefficient. This coefficient lies between 0 and 1 because the fitted log-likelihood increases as regressors are added and the maximum value is l(y). Furthermore, it can be interpreted as the ratio between the variation of the explained log-likelihood and the variation of the achievable log-likelihood. Also, RO2 can be interpreted as the relative reduction in deviance due to the complete model, that is, the model under distributional assumption and structural assumption. Hence, values close to zero of this measure indicate a lack of global adjustment in the model. This lack of fit can be caused by a faulty distributional hypothesis or a lack of capacity of explanation in the covariates. In addition, if we define the variation ratio of the log-likelihood explained by the distributional assumption (interdependence between objective variables) by

VRLY=l(η~,α~)l(y¯)l(y)l(y¯)

and the variation ratio of the log-likelihood explained by the structural assumption (dependence of covariates) by

VRLX=l(β^,α^)l(η~,α~)l(y)l(y¯)

then the overall R-squared measure can be expressed as the sum of these variation ratios, that is,

RO2=VRLY+VRLX

Second, for the reference interval [l(η~,α~),l(y)], the following ratio can be considered as a determinant coefficient

Rr2=l(β^,α^)l(η~,α~)l(y)l(η~,α~)=VRLX1VRLY

It can be denominated as relative R-squared measure or relative determination coefficient. Obviously, this coefficient lies between 0 and 1 and can be interpreted as a measure of adjustment of the data to the structural assumption under the distributional hypothesis. So, the values close to zero of this measurement will indicate a lack of capacity of explanation in the covariates.

2.3. Selection of the model

In the MLCP model, the order of the dependent variables is essential. This order can be known in many applications. For example, in a model for studying the number of individuals of various species in a type of ecosystem, a part of the order can be known by the natural relations between the species. In a model for studying the number of tourists (T) and the number of visits (V) to a tourist point of interest (amusement park, museum, historical building, etc.), the order should be (T,V). However, this information on order is not known in the experience on foraminifera species that has been presented in the introduction to this work.

If there is not complete information beforehand about this order, one has to select the most adequate model between the possible models. In this section, we present a procedure when there is not any information about the order. That is, a procedure to choose the most adequate model between the total p! models. Obviously, the procedure can be based on the selection of a model with smaller deviance statistic. So the following procedure can be considered:

  1. Selection of the first variable. Study of the p univariate models:
    Yh|X=xP(λh(1)):logλh(1)=xtβh(1)h=1,,p
    that are denoted by M(1)[h]. Let D(M(1)[h]) be deviance statistic of M(1)[h], for h=1,,p. Thus, select Yj such that
    D(M(1)[j])=minh=1,,pD(M(1)[h])
    The selected variable is noted by Z1(1) and the rest by Z2(1),,Zp(1).
  2. Selection of the second variable.

    1. Study of the (p1) bivariate models
      Z1|X=x(1)P(λ1(2)):logλ1(2)=xtβ1(2)Zh|X=x,Z1(1)=z1(1)P(λh(2)):logλh(2)=xtβh(2)+α(2)z1h=2,,p
      that are noted by M(2)[1,h]. Let D(M(2)[1,h]) be deviance statistics, for h=2,,p. Thus select Zj(1) such that
      D(M(2)[1,j])=minh=2,,pD(M(2)[1,h])
    2. Study of bivariate model M(2)[j,1].
      • If D(M(2)[1,j])D(M(2)[j,1])then retaining M(2)[1,j]
      • In other case, retaining M(2)[j,1].
      The selected variables, in the indicated order, are noted by Z1(2), Z2(2) and the rest by Z3(2),,Zp(2).
  3. Selection of the third variable

    1. Fit the (p2) models M(3)[1,2,h], with the associated statistics D(M(3)[1,2,h]). Thus select Zj(2) such that
      D(M(3)[1,2,j])=minh=3,,pD(M(3)[1,2,h])
    2. Studying the model M(3)[1,j,2].
      • If D(M(3)[1,2,j])D(M(3)[1,j,2]) then retain the model M(3)[1,2,j].
      • If D(M(3)[1,2,j])>D(M(3)[1,j,2]) then study the model M(3)[j,1,2].
        • If D(M(3)[1,j,2])>D(M(3)[j,1,2]) then retain the model M(3)[1,j,2]
        • In other case, retain M(3)[j,1,2].
      The selected variables, in the indicated order, are noted by Z1(3), Z2(3), Z3(3) and the rest by Z4(3),,Zp(3).
  4. Step k ( 3<kp) Selection of the kth variable.
    1. Study the (pk+1) models M(k)[1,,k1,h] with the associated statistics
      D(M(k)[1,,k1,h]).
      Thus select Zj(k1) such that
      D(M(k)[1,,k1,j])=minh=k,,pD(M(k)[1,,k1,h])
    2. Study the model M(k)[1,,j,k1].
      • If
        D(M(k)[1,,k1,j])D(M(k)[1,,j,k1])
        then retain the model M(k)[1,,k1,j].
      • In other case, study the model M(k)[1,,j,k2,k1] and proceed in a similar way.

For this procedure, the maximum number of models that must be analyzed is p2. For p>3 this maximum is smaller than the number of possible sorts ( p!). Besides, the models included in the procedure have dimensions smaller than the dimension of the objective vector, except in the last step. Therefore this procedure diminishes the computational complexity.

This procedure is based on the idea that once the order of dependency between the two variables is determined then this order should not be modified with the presence of a new variable. Possibly in many biological studies focusing on the study of the number of individuals of several species, this premise can be applied.

Our proposed procedure is successfully applied on several artificial data sets in the next section, detecting the correct order as it is defined in the population model. However, real data sets could arise from a generation process where no established order is present, but the proposed procedure would nonetheless arrive to a final ordering. To avoid this problem, the following idea is suggested.

If the CPU time required to fit a model with a certain ordering is not excessive, fit all the p! models. Otherwise the proposed procedure is applied until the next-to-last step. Last step is modified in such a way that only a fraction f of the p!p remaining orderings are considered.

The deviance or the AIC for the generated models is finally analyzed. We must doubt about the appropriateness of the model if the criterion for selected ordering it not clearly the best. For example, very similar values of the criterion or the existence of other orderings with a better value can reveal that no ordering is clearly defined for the data set.

3. Simulations: artificial data sets

We run a small simulation study to analyze the performance of the estimators obtained in Section 2. First, we generate an artificial data set. We have used the following procedure of generation. We consider n = 120, p = 3, number of objective variables ( Y1,Y2,Y3) and d = 2, one explanatory variable (X) and the intercept. For i=1,,n, the following values are generated

  • xi: value of a distribution N(5,1).

  • y1,i: value of a distribution P(eη1), with η1=5.00.5xi.

  • y2,i: value of a distribution P(eη2), with η2=1.0+0.9xi0.3y1i

  • y3,i: value of a distribution P(eη3), with η3=1.00.5xi+0.1y1i0.01y2i.

The results of the selection model algorithm are presented in Table 1. The application of this procedure provides the precise ordering, Y1,Y2,Y3.

Table 1.

Artificial data set: model selection algorithm.

Step 1   D(M(1)[h])  
  h = 1 108.7242 Select
  h = 2 206.8058  
  h = 3 161.8564  
Step 2 Step 2.1 D(M(2)[1,h])  
  h = 2 222.2701  
  h = 3 218.7594 Select
  Step 2.2 D(M(2)[h,t])  
  h = 1, t = 3 218.7594 Select
  h = 3, t = 1 267.1031  
Step 3 Step 3.2 D(M(2)[h,t,b])  
  h = 1, t = 3, b = 2 2356.432  
  h = 1, t = 2, b = 3 330.6761 Select
  h = 2, t = 1, b = 3 2265.350  

Two aspects of interest of the implementation of this algorithm are: first, the initial values of the estimators in each step are obtained by applying the regression model of the Poisson univariate model for each variable objective; second, the stopping rule of the iteration process is determined in terms of the absolute value of the gradient (AV G) and the maximum number of iterations (NI). The stopping rule is: AVG<103 or NI50. We have written a script in R [26] to fit our model.

In Table 1, we can see that the selection model algorithm detects the order determined in the process of the generation of data (Y1,Y2,Y3), even though in the second step the variable selected is Y3. Furthermore, the estimation procedure provides accurate estimations (see Table 2). This table presents the real values of the parameters used in the process of the data set, the estimated values and the asymptotic confidence interval.

Table 2.

Artificial data set: estimated parameters.

Parameter Real value Estimated value Confidence interval ( 95%)
β11 5.00 4.91250 4.69609 5.12891
β12 −0.50 −0.47867 −0.52670 −0.43063
β21 1.00 0.83382 0.40777 1.25986
β22 0.90 0.91601 0.85716 0.97485
α21 −0.30 −0.29130 −0.30506 −0.27755
β31 1.00 0.71988 −1.02076 2.46053
β32 −0.50 −0.45440 −0.75899 −0.14981
α31 0.10 0.09985 0.07350 0.12620
α32 −0.01 −0.00521 −0.01534 0.00491

Finally, the study of adjustment of model gives the following statistics:

  • R02=0.9810. The global adjustment must be considered excellent.

  • VRLY=0.9029 and VRLX=0.0781. The global fit is provide by the log-likelihood explained by the distributional assumption.

  • Rr2=0.8045. If we consider this coefficient to measure the capacity of explanation of the explanatory variables under the distributional hypothesis then we can conclude that the variable X provides relevant information about the objective variables.

The values of these measures show that the estimated model give quite good fit to the simulated data set.

Second, we generated three artificial data sets with multivariate count data of dimensions 4, 5 and 6, respectively. The parameters of the generation process are included in Table 3, with the explanatory variable X generated according to a distribution N(0,1). Given the exponential dependence between the count variables, the model parameters must be small. Some parameters have been excluded in order to simulate models in which one count variable does not influence the following variables. In all three data sets, applying the procedure to choose the most appropriate model leads to the ordering used in the generation process. Table 4 presents the estimated values of real parameters, with a clear agreement between real and estimated parameters.

Table 3.

Artificial data sets: coefficients.

Variable Intercept X Y1 Y2 Y3 Y4 Y5
Dimension p = 4, sample size n = 200
Y1 1.0 0.5
Y2 2.0 −0.5 0.1
Y3 1.0 0.5 −0.1 0.1
Y4 2.0 −0.5 0.1 −0.1 0.1
Dimension p = 5, sample size n = 200
Y1 3.0 −0.5
Y2 3.0 0.5 −0.1
Y3 1.0 −0.5 0 0.1
Y4 3.0 −0.5 −0.1 0 0.1
Y5 3.0 −0.5 0 0 0.1 −0.1
Dimension p = 6, sample size n = 200
Y1 3.5 1.0
Y2 3.0 1.5 −0.01
Y3 3.5 1.5 −0.10 0.10
Y4 3.0 0 −0.05 0 0.05
Y5 3.0 0 −0.01 0 −0.01 0.01
Y6 3.0 1.0 0 0.01 0 0 −0.05

Table 4.

Artificial data sets: estimated parameters.

Variable Intercept X Y1 Y2 Y3 Y4 Y5
Dimension p = 4
Y1 0.96 0.52
Y2 2.04 −0.47 0.08
Y3 0.90 0.54 −0.10 0.11
Y4 2.05 −0.38 0.07 -0.08 0.08
Dimension p = 5
Y1 2.96 −0.51
Y2 2.98 0.49 −0.10
Y3 1.00 −0.51 0.00 0.10
Y4 2.78 −0.58 −0.09 0.01 0.10
Y5 3.02 −0.50 0.00 0.00 0.10 −0.10
Dimension p = 6
Y1 3.49 1.00
Y2 2.93 1.47 −0.01
Y3 3.71 1.59 −0.10 0.09
Y4 3.47 0.26 −0.06 −0.01 0.05
Y5 3.09 0.09 −0.02 0.01 −0.03 0.02
Y6 3.08 0.99 0.00 0.01 0.00 0.00 −0.06

():pvalue<0.001 ; ():pvalue<0.05

Finally, we have obtained the adjustment measures of the models associated to all possible ordernatios ( p! factorial for the data set with p count variables). The results are included in Table 5, with the statistics: deviation, Akaike information criterion (AIC), square root of mean square error (RMSE) and the overall R-squared ( R02). Table 5 includes:

  • These measures for the model associated with the order used in the data generation process.

  • A summary of the measures for all models: mean, minimum and maximum.

Table 5.

Artificial data sets: measures of model fit.

  Deviance AIC RMSE R02
Dimension p = 4
Order        
(1,2,3,4) 819.096 3613.377 2.519 0.7573
Minimum 819.096 3613.377 2.519 0.7244
Mean 884.212 3678.493 2.606 0.7511
Maximum 942.855 3737.136 2.660 0.7803
Value obtained for the order (1,2,3,4)
Value obtained for the order (2,3,1,4)
Dimension p = 5
Order        
(1,2,3,4,5) 936.654 4660.821 3.503 0.974
Minimum 936.654 4660.821 3.503 0.714
Mean 3782.414 7507.021 19.776 0.888
Maximum 8786.090 12510.256 44.131 0.974
Value obtained for the order (1,2,3,4,5)
Dimension p = 6
Order        
(1,2,3,4,5,6) 1109.437 5818.742 4.144 0.977
Minimum 1109.437 5818.742 4.144 0.939
Mean 1897.034 6606.098 5.492 0.960
Maximum 2847.234 7557.387 6.621 0.978
Value obtained for the order (1,2,3,4,5,6)
Value obtained for the order (1,2,3,5,4,6)

In conclusion, in the three cases (p = 4, 5, 6), the ordering determined by the selection procedure coincides with the ordering used in the data generation process and with the optimal ordering according to the Deviation, AIC and RMSE. Only a small discrepancy is observed in the criterion based on the coefficient R02

4. Application: environmental control of foraminifera distribution

The foraminifera are a large group of amoeboid protists and are regarded as the most important groups of marine microfossils. Particularly, the scientific interest on these microfossils focuses on their great utility in paleontology and biostratigraphy.

Ruíz et al. [25] analyze the foraminifera collected in three estuaries (Guadiana, Piedras, Tinto-Odiel) of southwestern Spain. Fifty three observations were collected in the different sedimentary environments (subtidal channels, intertidal channels, channel borders, low salt marshes and high salt marshes) of the three estuaries studied. In this experience, the following variables were taking into account: water salinity during high tide (S), tidal height about the lowest astronomical tide (TH) and number of individuals of three species of foraminifera: Elphidium crispum (EC), Jadammina macrescens (JM), and Trochammina inflata (TI). TH is positive in subaerial environments and increases from the channel border to the high salt marshes, whereas it is negative in the channels and indicates the water depth in the two first environments mentioned above.

Table 6 presents descriptive statistics of the counts and the observed variables. For each specie, single log-linear Poisson models are obtained in the paper [24]. However, the single models do not allow to extract joint information because they do not model the correlation between the populations. In order to study the relation between the abundance of three species with the saline content and the tidal height, we consider the MLCP model constituted by count variables EC, JM and TI, and by the explanatory variables S and TH. The application of the selecting model algorithm provides the order: JM, TI and EC.

Table 6.

Foraminifera data set: descriptive statistics.

  Mean St.Dev. Min Max
Elphidium crispum 3.56604 12.40736 0.0 64.0
Jadammina macrescens 7.84906 15.28875 0.0 71.0
Trochammina inflata 15.05660 42.84651 0.0 287.0
Water salinity 32.76415 7.12256 6.0 36.3
Tidal height −0.37886 2.55246 −6.0 3.1

The results of the algorithm are presented in the Table 7. Thus, the analyzed MLCP model relates the count variables Y=(JM,TI,EC)t to the explanatory variables X=(1,TH,S)t.

Table 7.

Foraminifera data set: model selection algorithm.

Step 1   D(M(1)[h])  
  EC,h=1 533.9910  
  JM,h=2 527.6821 Select
  TI,h=3 1722.5150  
Step 2 Step 2.1 D(M(2)[2,h])  
  EC,h=1 912.760 Select
  TI,h=3 2232.855  
  Step 2.2 D(M(2)[h,t])  
  (EC,JM),(h=1,t=2) 1025.142  
  (JM,EC),(h=2,t=1) 912.760 Select
Step 3 Step 3.2 D(M(3)[h,t,b])  
  (JM,EC,TI),(h=2,t=1,b=3) 2697.231  
  (JM,TI,EC),(h=2,t=3,b=1) 2612.888 Select
  (TI,JM,EC),(h=3,t=2,b=1) 2629.255  

The parameters of model are {(β1,β2,β3,α2,α3):βjR3,αhRh1} with βj=(βj1,βj2,βj3)t for j = 1, 2, 3, α2=α21 and α3=(α31,α32)t. So, the expected values of the Poisson distributions are λ1, λ2 and λ3 such that

log(λ1)=β11+β12TH+β13Slog(λ2)=β21+β22TH+β23S+α21JMlog(λ3)=β31+β32TH+β33S+α31JM+α32TI

The results of the analysis are summarized in Table 8. The capacity of global explanation of the model should be considered significant, particularly the distributional hypothesis. The results obtained from univariate models are:

  • for EC, R02(EC)=0.3757964

  • for JM, R02(JM)=0.5250740

  • for TI, R02(TI)=0.3585918

Table 8.

Foraminifera data set: estimated parameters.

Parameter Estimated value Confidence interval ( 95%)
Component: JM      
Intercept β11 0.03392 −0.68576 0.75359
Tidal height β12 0.75441 0.66776 0.84106
Salinity β13 0.03267 0.01257 0.05277
Component: TI      
Intercept β21 2.60108 2.26086 2.94131
Tidal height β22 0.57358 0.51001 0.63716
Salinity β23 −0.01916 −0.02888 −0.00944
JM α21 0.00845 0.00457 0.01233
Component: EC      
Intercept β31 −80.67440 −100.46653 −60.88226
Tidal height β32 0.29280 0.21467 0.37093
Salinity β33 2.31655 1.76659 2.86650
JM α31 −0.29630 −0.51093 −0.08166
TI α32 −0.17067 −0.34921 0.00786
Goodness of fit R02=0.5691 VRLY=0.3288 RMSE = 23.3
  Rr2=0.3579 VRLX=0.2402 AIC = 2861.0

That is, the multivariate model provides a better fit to the data than the univariate models. Therefore, the model allows the interpretation of the relationships between the three count variables and the two explanatory variables.

From Table 8, tidal height and JM show a high relation ( β12=0.75), because the highest abundances of this species are found in the high salt marsh deposits, located over the mean high tide level. TI is the dominant species in the low salt marsh deposits and consequently presents a positive but lower relation ( β22=0.57) with the tidal height. In addition, the positive relation ( β32=0.29) of this variable with EC is explained by the presence of frequent individuals of this species in the channel border. Consequently, a height gradient is found when this multivariate method is applied to the sedimentary distribution of these three species, coinciding with the arrangement (JMTIEC) proportioned by the model selection algorithm and very similar to that observed in British estuaries [12].

The very high relation ( β33=2.316) between salinity and EC can be explained by the marine character of this species (Villanueva [27]), which is clearly concentrated near the river mouths with salinity ranges comprised between 330/00 and 360/00. Nevertheless, both JM and TI show slight relations positive ( β13=0.032) or negative ( β23=0.019), respectively. These species are little sensitive to changes in this parameter, although this positive relation between salinity and JM has been also detected in Canadian estuaries  [9,14].

5. Conclusion

Count data has been the subject of an increasing number of scientific works, but models taking into account multivariate count data are still scarcely applied. For single counts, the Poisson distribution is commonly used. For multiple counts, the literature reports some works using several multivariate Poisson distribution models. The disadvantage of some of these models is that the correlation between count variables is restricted to being positive.

In this work, we propose a new regression model for multivariate count data that is based on the Poisson distribution. Our proposed model does not only extend the bivariate model proposed by Berkhout and Plug[2] but also allows to model the dependence structure between count variables and the relationship with a set of explanatory variables.

The model may have certain limitations: complexity in the correlation structure and difficulty in the selection of the order of the components. However, the flexibility of the correlation structure should be regarded as an advantage of this model. Another advantage is that model estimators require little computational power and are relatively easy to obtain. To complete the study of the model we analyze two issues: analysis of adjustment measures and selection of the model.

We propose various adjustment statistics which are based on the likelihood of the model. First, an overall adjustment measure ( RO2) that can be expressed as the sum of two components: the component explained by the distributional hypothesis and the component explained by the structural hypothesis. Second, a relative adjustment measure ( Rr2) that can be interpreted as a measure of adjustment to the structural assumptions under the distributional hypothesis. These measures provide relevant and useful information about the adjustment of the model and have an additional advantage: they can be applied to any parametric multivariate model whose estimators are based on the maximum likelihood method.

The distributional hypothesis of the model depends on the order of the count variables. The order of the structure of dependency between count variables can be known in many experimental studies. However, there are situations in which the order is not known. In this case, we propose an algorithm to choose an adequate order. In practice, any available information on the dependency between count variables must be incorporated into the study, which should appropriately modify the proposed procedure.

In order to illustrate this multivariate technique, a real application is included: the relation between the abundance of three benthic foraminifera and several environmental variables. The challenge to make a joint analysis of the aforementioned populations has led to develop a multivariate model. This experience allows illustrating the usefulness of this multivariate model for the analysis of species counting experiences (scientific areas such as cell biology, botany, zoology, virology, and so on).

Other methods have been recently proposed for multivariate regression count data tasks, such as mixture Poisson models and copula based Poisson models. Both approaches are very flexible, allowing the description and modeling of a wide set of structures of correlations which can appear in real data sets. However, these models are not easily interpretable, and they can drive to conclusions which can be distant from the actual process generating the data set. Our proposed model can offer a more reliable depiction of the random process behind the data in many real problems. When the hypothesis about the ordering of the variables is questionable, the other models must also be considered, selecting the most appropriate one with the aid of some goodness of fit criterion.

Finally, it is clear that other aspects of the models need to be explored. For example, a detailed study of the properties of the maximum likelihood estimators, the estimation of the marginal moments of the count vector components, design of algorithms of selection of the explanatory variables and the development of diagnostic techniques.

The proposed model assumes no overdispersion or zero-inflation in the conditional distributions. Extending the distributional hypothesis to models containing these features (overdispersion and zero-inflation) can be very interesting from a practical viewpoint, but it requires a broad development of its theoretical and computational issues. The study of these aspects should be the subject of further works.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Alfò M. and Tovato G., Semiparametric mixture models for multivariate count data, with application, Econometrics J. 7 (2004), pp. 426–454. [Google Scholar]
  • 2.Berkhout P. and Plug E., A bivariate Poisson count data model using conditional probabilities, Stat. Neerl. 58 (2004), pp. 349–364. [Google Scholar]
  • 3.Cameron A.C. and Johansson P., Count data regressions using series expansions with applications, J. Appl. Econom. 12 (1997), pp. 203–224. [Google Scholar]
  • 4.Cameron A.C., Li T., Trivedi P.K., and Zimmer D.M., Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts, Econom. J. 7 (2004), pp. 566–584. [Google Scholar]
  • 5.Cameron A.C. and Windmeijer F.A.G, R-squared measures for count data regression models with applications to health-care utilization, J. Business Econom. Stat. 14 (1996), pp. 209–220. [Google Scholar]
  • 6.Chib S. and Winkelmann R., Markov chain Monte Carlo analysis of correlated count data, J. Bussiness Econom. Stat. 19 (2001), pp. 428–435. [Google Scholar]
  • 7.Famoye F., On the bivariate negative binomial regression model, J. Appl. Stat. 37 (2010), pp. 969–981. [Google Scholar]
  • 8.Famoye F., A multivariate generalized Poisson regression model, Commun. Stat.-Theor. Meth. 44 (2015), pp. 497–511. [Google Scholar]
  • 9.Guibault J.P., Clague J.J., and Lapointe M., Amount of subsidence during a late Holocene earthquake-evidence from fossil marsh foraminifera at Vancouver Island, west coast of Canada., Palaeogeogr. Palaeoclimatol. Palaeoecol. 118 (1998), pp. 49–71. [Google Scholar]
  • 10.Heinzl H. and Mittlbock M., Pseudo R-squared measures for Poisson regression models with over or underdispersion, Comput. Statist. Data Anal. 44 (0000), pp. 253–271. [Google Scholar]
  • 11.Hellstrom J., A bivariate count data model for household tourism demand, J. Appl. Econom. 21 (2006), pp. 213–226. [Google Scholar]
  • 12.Horton B.P., Edwards R.J., and Lloyed J.M., UK intertidal foraminiferal distributions: implications for sea-level studies, Palaeogeogr. Palaeoclimatol. Palaeoecol. 149 (1999), pp. 127–149. [Google Scholar]
  • 13.Johnson N.L. and Kotz S., Distributions in Statistics: Discrete Distributions, John Wiley & Sons, New York, 1969. [Google Scholar]
  • 14.Jonasson K.E. and Patterson R.T., Preservation potential of salt marsh foraminifera from the Fraser delta river, British Columbia, Mar. Micropaleontol. 38 (1992), pp. 289–301. [Google Scholar]
  • 15.Jung C.J.. and Winkelmann R., Two aspects of labor mobility: a bivariate Poisson regression approach, Empir. Econ. 18 (1993), pp. 543–556. [Google Scholar]
  • 16.Karlis D., An EM algorithm for multivariate Poisson distribution and related models, J. Appl. Stat. 30 (2003), pp. 63–77. [Google Scholar]
  • 17.Karlis D. and Meligkotsidou L., Multivariate Poisson regression with covariance structure, Stat. Comput. 15 (2005), pp. 255–265. [Google Scholar]
  • 18.Karlis D. and Meligkotsidou L., Finite mixtures of multivariate Poisson distributions with application, J. Stat. Plan. Inference. 137 (2007), pp. 1942–1960. [Google Scholar]
  • 19.Kocherlakota S. and Kocherlakota K., Bivariate Discrete Distributions, Marcel Dekker, New York, 1992. [Google Scholar]
  • 20.Li C.S., Lu J.C., Brinkley P.A., and Peterson J.P., Multivariate zero-inflated Poisson models and their applications, Technometrics 41 (1999), pp. 29–38. [Google Scholar]
  • 21.McCullagh P. and Nelder J.A., Generalized Linear Models, 2nd. ed., Chapman and Hall, London, UK, 1989. [Google Scholar]
  • 22.McHale I. and Scarf P., Modelling soccer matches using bivariate discrete distributions with general dependence structure, Stat. Neerl. 61 (2007), pp. 432–445. [Google Scholar]
  • 23.Nikoloulopoulos A.K. and Karlis D., Modeling multivariate count data using copulas, Commun. Stat. Simul. Comput. 39 (2009), pp. 172–187. [Google Scholar]
  • 24.Ruiz F., González M.L., Abad M., Munoz-Pichardo J.M., and Pino-Mejias R., New micropaleontological applications of the Poisson distribution: statistical models applied to benthic foraminiferal populations., Terra Nova 19 (2007), pp. 367–372. [Google Scholar]
  • 25.Ruiz F., González M.L., Pendón J.G., Abad M., Olías M., and Munoz-Pichardo J.M., Correlation between foraminifera and sedimentary environments in recent estuaries of southwestern Spain: applications to holocene reconstructions, Quat. Int. 140-141 (2005), pp. 21–36. [Google Scholar]
  • 26.R Development Core Team , A language and environment for statistical computation, R Foundation for Statistical Computing. http://www.r-project.org/
  • 27.Villanueva P., Implicaciones oceanográficas de los foraminíferos bentónicos recientes en la bahía y plataforma gaditana. Taxonomía y asociaciones, Ph.D. Thesis, Cádiz University, Spain, 1994.
  • 28.Waldhor T., Haidinger G., and Schober E., Comparison or R2 measures for Poisson regression by simulation, J. Epidemiol. Biostat. 3 (1998), pp. 209–215. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES