Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Apr 24;49(11):2740–2766. doi: 10.1080/02664763.2021.1918649

A multivariate zero-inflated binomial model for the analysis of correlated proportional data

Dianliang Deng a,CONTACT, Yiguang Sun a, Guo-Liang Tian b
PMCID: PMC9336504  PMID: 35909665

Abstract

In this paper, a new multivariate zero-inflated binomial (MZIB) distribution is proposed to analyse the correlated proportional data with excessive zeros. The distributional properties of purposed model are studied. The Fisher scoring algorithm and EM algorithm are given for the computation of estimates of parameters in the proposed MZIB model with/without covariates. The score tests and the likelihood ratio tests are derived for assessing both the zero-inflation and the equality of multiple binomial probabilities in correlated proportional data. A limited simulation study is performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates and to compare the nominal levels and powers of both score tests and likelihood ratio tests. The whitefly data is used to illustrate the proposed methodologies.

Keywords: Correlated proportional data, EM algorithm, likelihood ratio test, multivariate zero-inflated binomial, score test, stochastic representation

1. Introduction

Count and proportional data have been used in a wide variety of fields of study, including education, sociology, psychology, biology, toxicology, epidemiology, insurance, public health, engineering, ecology, econometrics, agriculture, manufacturing and horticulture. When analysing such data, generalized linear models are extensively used. However, these data often present a larger number of zero observations than what would normally arise from the standard count and proportional distributions. When those issues are not properly addressed, the analysis using usual GLMs such as binomial and Poisson models even the over-dispersed GLMs may not provide a good fit and fail to explain the kinds of variation to the actual data. Therefore, statisticians proposed so-called zero-inflated models to fit such data.

The work of zero-inflated models has a long history that could be traced back to at least the 1960s when Cohen [5] and Johnson and Kotz [13] discussed zero-inflated Poisson (ZIP) models without covariates for count data. Later the ZIP models with covariates were studied by Lambert [15] for application to defects in manufacturing. The zero-inflated negative binomial (ZINB) models were studied by Deng and Paul [7] for the count data with both zero-inflation and over-dispersion. Hall [9] and Vieira et al. [23] proposed zero-inflated binomial (ZIB) distributions for modelling proportional data with extra zeros. The zero-inflated beta-binomial (ZIBB) models were also applied by Deng and Paul [7]. Moreover, the score tests for zero-inflation in a generalized linear model were studied by Broek [3] and Deng and Paul [6,7]. Hall and Berenhaut [10] developed the score test for heterogeneity and over-dispersion in zero-inflated Poisson and binomial regression models. Jansakul and Hinde [12], Ridout et al. [21], and Min and Gzado [19] also compared the assessing power for testing zero-inflation among the likelihood ratio test, the Wald test and the score test. Most recently, Song [22] established simultaneous statistical modelling of excess zeros, over/underdispersion, and multimodality. Alevizakos and Koukouvinos [1] used the zero-inflated binomial processes with a double exponentially weighted moving average statistic to monitor quality characteristics of high-yield processes. In such processes where a large number of zero observations exists in proportional data, the ZIB models are more appropriate than the ordinary binomial models. Furthermore, Alqawba and Diawara [2] proposed a Markov zero-inflated count time series models based on a joint distribution through copula functions.

As previously stated, most available studies in the area of zero-inflation are concentrated on univariate distributions. As the more complex data frequently arose in many subjects, statisticians have extended univariate distribution to their multivariate analogues (e.g. Fang et al. [8]). Johnson and Kotz [13] introduced multivariate Poisson distribution for modelling several types of defects. Li et al. [17] studied several possible ways to construct multivariate zero-inflated Poisson (MZIP) distribution. Liu and Tian [18] purposed Type I MZIP distribution with comparison to the MZIP distribution in Li et al. [17]. On the other hand, the multivariate binomial distribution was studied by Krishnamoorthy [14]. Chandrasekar and Balakrishnan [4] obtained some properties and a characterization of multivariate binomial distribution. Furthermore, similar to the univariate case, excessive zeros are not unusual to be expected in multivariate correlated proportional data and the univariate ZIB model is typically not sufficient for modelling such data. In order to fix the over-dispersion problem, fit the multivariate proportional data well, as well as have more accurate results, a new distribution called the ‘multivariate zero-inflated binomial (MZIB) distribution’ is proposed in this paper. Such a distribution is developed along the approach of symmetric multivariate distributions by Fang et al. [8] and based on the stochastic representation of the univariate ZIB random variable. The random variable with this new multivariate zero-inflated binomial distribution is assumed to be a q-dimensional response vector and is generated by a mixture of a common degenerate distribution with a unit mass point at zero in Rq and q independent binomial distributions. Further, the correlations among the components of multivariate zero-inflated binomial variable are intuitively addressed although these binomial components are independent. Moreover, different from the random effects ZIB model, our proposed model can give the explicit expression for the correlation coefficients among the components of multivariate zero-inflated binomial variable.

The remainder of this paper is organized as follows. In Section 2, we propose a multivariate zero-inflated binomial distribution, which is inspired by good distributional properties of Type I MZIP distribution of Liu and Tian [18] and driven by the stochastic representation of univariate ZIB random variable. We then obtain joint probability mass function, joint cumulative distribution function, and mixed moments of the MZIB distribution. The likelihood-based statistical inference about parameters of interest is performed in Section 3. Moreover, the Fisher scoring algorithm and EM algorithm are given for the computation of estimates of parameters on the proposed model with/without covariates. The score tests and the likelihood ratio tests are also developed for assessing the zero-inflation and the equality of all binomial probabilities in this section. In Section 4, simulation studies are performed to evaluate the performance of proposed score tests and the likelihood ratio tests in terms of nominal levels and powers, and of EM algorithm for the computation of estimates of parameters for the proposed MZIB model with/without covariates. The whitefly data is analysed as an application of the proposed methodology in Section 5 with the discussion in Section 6.

2. A multivariate zero-inflated binomial distribution

Let ZBernoulli(1ω), XBinomial(m,π), Z and X be independent (denoted by ZX). Define a random variable Y=ZX, then, Y follows the univariate zero-inflated binomial (ZIB) distribution, denoted by Y=dZXZIB(ω,m,π). By virtue of the stochastic representation of the univariate ZIP random variable, it could be naturally extended to a multivariate version. In what follows, we give the definition of the multivariate ZIB distribution, which has the vector form of correlation structure with a common Bernoulli variable Z.

Definition 2.1

A q-dimensional discrete random vector Y=(Y1,,Yq) bounded by a given upper vector m=(m1,,mq) is said to follow a multivariate ZIB distribution with parameters ω[0,1) and π=(π1,,πq)(0,1)q if

Y=dZX=0qwith probability ω,Xwith probability 1ω, (1)

where ZBernoulli(1ω), X=(X1,,Xq), XrBinomial(mr,πr) for r=1,,q, and (Z,X1,,Xq) are mutually independent. The multivariate ZIB distribution is denoted by YMZIB(ω;m1,,mq;π1,,πq) or YMZIBq(ω,m,π), where X is called the base vector of Y.

From the stochastic representation (1), the joint probability mass function of YZIBq(ω,m,π) can be expressed as

f(y|ω,m,π)=ω+(1ω)r=1q(1πr)mrif  y=0,(1ω)r=1qmryrπryr(1πr)mryrif  y0=ωPr(ξξ=y)+(1ω)Pr(X=y), (2)

where ξξ has the degenerate distribution at mass 0. The corresponding joint cumulative distribution function is given by

Pr(Yy)=ω+(1ω)r=1qs=0yrmrsπrs(1πr)mrs=ω+(1ω)r=1qB1πr(mryr,yr+1),

where y=(y1,,yq) is a non-negative real vector in Rq, yr is the ‘floor’ function of yr, denoting the largest integer less than or equal to yr and

B1πr(mryr,yr+1)=(mryr)mryr01πrtmryr1(1t)yrdt

is the regularized incomplete beta function.

Note that although the definition of MZIB distribution based on the stochastic representation has the advantage in the derivation of its properties (see below), the limitation of this definition is that the zero-inflated parameter ω should be in the interval [0,1]. However, it is not necessary to assume ω[0,1]. Therefore, we can define the multivariate ZIB distribution based on the probability mass function as follows.

Definition 2.2

A q-dimensional discrete random vector Y=(Y1,,Yq) bounded by a given upper vector m=(m1,,mq) is said to follow a MZIB distribution with parameters ω and π=(π1,,πq)(0,1)q if its probability mass function has the form of (2).

From this definition, one can find that it is possible to take ω less than zero, provided that

ω>r=1q(1πr)mr1r=1q(1πr)mr

with equality for zero-truncation. It is zero-deflation if ω is negative. Also, the value of ω should be less than one. (This distribution would degenerate to zero if ω=1.)

It should be pointed out that when ω<0, the stochastic representation given in (1) does not hold. Since zero-deflation ( ω<0) seldom happens in proportional data and the current paper concentrates on assessing the zero-inflation in multivariate proportional data, we mainly consider the case ω0, where (1) can always be used to investigate some properties of MZIB distribution and to make the statistical inference for MZIB model. We now derive the expressions for the moments using this representation. We note that the resulting formulas continue to hold for the case ω<0, which can be checked using the probabilistic representation (2) for MZIB distribution. Now from (1), the mixed moments for YMZIBq(ω,m,π) can be obtained as follows:

Er=1qYrtr=(1ω)r=1qE(Xrtr),

where t1,,tq0. Also, by setting mπ=(m1π1,,mqπq), we have

E(Y)=(1ω)mπ,E(YY)=(1ω)[diag(mπ(1π))+mπ(mπ)],cov(Y)=(1ω)[diag(mπ(1π))+ωmπ(mπ)].

Therefore,

var(Yr)=(1ω)mrπr[(1πr)+ωmrπr)],cov(Yr,Ys)=(1ω)ωmrπrmsπs

and the correlation coefficient for Yr and Ys is

ρrscor(Yr,Ys)=ωmrπrmsπs(ωmrπr+(1πr))(ωmsπs+(1πs)),rs. (3)

In particular, when mr=ms=m and πr=πs=π, we obtain

ρrscor(Yr,Ys)=ωmπωmπ+(1π),rs.

It is worthy of note that our proposed model can address the correlations among all components of multivariate zero-inflated binomial variable and give an explicit expressions of correlation coefficients (see (3)). Furthermore, from the expression of the correlation coefficient, one can see that there exist positive (negative) correlations among the components of multivariate zero-inflated binomial variable Y if the parameter ω is greater (less) than zero although the components of base variable X are independent. The correlation is induced by the imposition of same zero mass probability.

3. Likelihood based inferences for MZIB model

In this section, we consider the statistical inferences for MZIB model. Since the current research focuses on the statistical inference for the zero-inflation of multivariate proportional data with excessive zeros, it is assumed that the value of zero-inflation parameter ω be in the unit interval [0,1) and all statistical inferences on MZIB model be based on the stochastic representation (1) in the following sequels.

Let Y1,,Yn be independent random vectors and Yi=(Yi1,,Yiq), i=1,2,,n, follow the q-dimensional ZIB distribution ZIBq(ωi,mi,πi), where for i=1,,n, mi=(mi1,,miq) are the known vectors of binomial denominators, πi=(πi1,,πiq) are the unknown vectors of binomial probabilities and ωi are the unknown zero-inflated parameters. Now suppose yi=(yi1,,yiq) is the realization of the random vector Yi, then the observed data and associated binomial denominators would be represented by yobs={y1,,yn} and mobs={m1,,mn}. Furthermore, for convenience, let yr=i=1nyir,mr=i=1nmir for r=1,,q and yi=r=1qyir,mi=r=1qmir for i=1,,n. Based on the joint probability mass function of Y given by (2), the likelihood function for the parameters (ω,π)=(ω1,,ωn,π1,,πn) can be obtained as

L(ω,π|yobs)=i=1nωi+(1ωi)r=1q(1πir)mirI(yi=0)×(1ωi)r=1qmiryirπryir(1πir)miryirI(yi0).

By reparameterization, let γi=ωi/(1ωi). Then the likelihood function for (γ,π)=(γ1,,γn,π1,,πn) is

L(γ,π|yobs)=i=1nγi1+γi+1γi1+γir=1q(1πir)mirI(yi=0)×1γi1+γir=1qmiryirπiryir(1πir)miryirI(yi0) (4)

so that the log-likelihood function is

(γ,π|yobs)=i=1nln(1+γi)+i=1nlnγi+r=1q(1πir)mirI(yi=0)+i=1nr=1qlnmiryir+yirlnπir1πir+mirln(1πir)I(yi0). (5)

3.1. MLEs of parameter for MZIB model without covariates

Based on the discussion above, we first derive the maximum likelihood estimates of parameters for MZIB model without covariates. In this case, the zero-inflated parameters γi and the probabilities πi are held fixed as γ and π. Hence the log-likelihood (5) can be simplified as

(γ,π|yobs)=nln(1+γ)+i=1nlnγ+r=1q(1πr)mirI(yi=0)+i=1nr=1qlnmiryir+yirlnπr1πr+mirln(1πr)I(yi0). (6)

3.1.1. MLEs of parameters via Fisher scoring algorithm

In this subsection, the Fisher scoring algorithm is derived to calculate the MLEs of the parameters θ=(γ,π1,,πq)=(γ,π), where γ=ω/(1ω),π=(π1,,πq). The Fisher scoring algorithm is a common method to calculate maximum likelihood estimation. Comparing with EM algorithm, it could have better stability even in multi-parameter cases. In addition, the expected Fisher information matrix should always be positively definite, when the model is not over-parameterized (Lauritzen [16]). However, the Fisher scoring algorithm requires more complex calculation than EM algorithm for deriving the expected Fisher information matrix. Moreover, the expected Fisher information matrix could not be tractable for the complicated models. Since the estimation under multivariate zero-inflated binomial distribution is multi-parameter case, the Fisher scoring algorithm should be studied. Now, based on the equation (6), the score vector is

(θ|yobs)=(γ,π|yobs)γ,(γ,π|yobs)π1,,(γ,π|yobs)πq,

where

(γ,π|yobs)γ=n1+γ+I(yi=0)γ+s=1q(1πs)mis, (7)
(γ,π|yobs)πr=i=1nmirγI(yi=0)(1πr)(γ+s=1q(1πs)mis)+yirmirπrπr(1πr) (8)

for r=1,,q. In order to apply Fisher scoring algorithm, the Hessian matrix should be obtained first as follows:

2(θ|yobs)=2(γ,π|yobs)=2(γ,π|yobs)γ22(γ,π|yobs)γπ12(γ,π|yobs)γπq2(γ,π|yobs)π1γ2(γ,π|yobs)π122(γ,π|yobs)π1πq2(γ,π|yobs)πqγ2(γ,π|yobs)πqπ12(γ,π|yobs)πq2. (9)

Then, the Fisher information matrix J(θ)=E2(γ,π|yobs) is

J(θ)=J(γ,π)=JγγJγπJπγJππ.

The derivation for the formulas of Jγγ,Jγπ,Jπγ and Jππ is given in the supplemental file.

Now let θ(0) be the initial values of the MLEs θˆ of θ=(γ,π). If θ(t) denote the tth approximation of θˆ, then the (t+1)th approximation can be computed by is

θ(t+1)=θ(t)+J1(θ(t))(θ(t)|yobs). (10)

3.1.2. MLEs of parameters via the EM algorithm

In this subsection, we will develop the EM algorithm to compute the MLEs of parameters in the proposed MZIB model. Although the Fisher scoring algorithm possesses quadratic convergence, it may not guarantee the MLEs of ω and πr,r=1,2,,q to be included in the unit interval [0,1). When the initial value (γ(0),π(0)) of Fisher scoring algorithm is sufficiently near (γˆ,πˆ), they converge very fast. However, it is sensitive to initial values under MZIB distribution. When the chosen initial value of (γ(0),π(0)) is far from (γˆ,πˆ), they might not converge. Therefore, the expectation-maximization (EM) algorithm is given for the calculation of MLEs in the MZIB model.

The EM algorithm is a popular tool for estimating maximum likelihood estimation in joint statistical models by iterating between E-step and M-step. The E-step represents the expectation of the log-likelihood. The M-step computes parameters maximizing the expected log-likelihood found on the E-step. Then, the unobserved latent variable is determined by these estimated parameters in the next E-step.

For each yi=(yi1,,yiq) with i=1,2,n, based on (1) we introduce independent latent variables

ZiiidBernoulli(1ω),Xiriidbin(mir,πr) (11)

for r=1,2,,q. We denote the latent/missing data by Ymis={zi,{xir}r=1q}i=1n and the complete data by Ycom={Yobs,Ymis}=Ymis, where zi,xir are the realizations of Zi and Xir, respectively. Thus, the complete-data likelihood function is given by

L(θ|Ycom)=i=1nω1zi(1ω)zir=1qmirxirπrxir(1πr)mirxir

and the complete-data log-likelihood function is proportional to

(θ|Ycom)i=1n[(1zi)logω+zilog(1ω)]+i=1nr=1qxirlogπr1πr+mirlog(1πr).

The M-step is to calculate the complete-data MLEs, which are given by

ω=11ni=1nzi, (12)
πr=i=1nxiri=1nmir=xrmr (13)

for r=1,2,,q. The E-step is to replace i=1nzi,i=1nxir=xr in (12) and (13) with their conditional expectations:

Ei=1nZi|Yobs,θ=nωiJai1 (14)

and

Ei=1nXir|Yobs,θ=yr+iJωaimirπr, (15)

where J={i:yi=0} and

ai=ω+(1ω)r=1q(1πr)mir,i=1,2,,n.

The detail for deriving (14) and (15) is given in A.2 of supplemental file.

Note that the latent variables {Zi}i=1n and {{Xir}r=1q}i=1n} introduced in (11) are independent Bernoulli random variables and binomial random variables, respectively. Thus, the left-hand side of (14) must be less than or equal to n, and the left-hand side in (15) must be between 0 and i=1nmir=mr. In other words, the EM algorithm (12)–(15) can guarantee that the MLEs of {ω,{πr}r=1q} fall within the unit interval [0,1), resulting in a clear statistical interpretation for these parameters in the distribution. This is advantage of the EM algorithm over the Fisher scoring algorithm. However, it is worthy of note that the EM algorithm is based on the stochastic representation (1), which intuitively assumes that the zero-inflated parameter ω is in the unite interval [0,1). It does not work for the case of zero-deflation.

Now let (ωˆ,πˆ1,,πˆq) denote the MLEs of (ω,π1,,πq) obtained via the EM algorithm (12)–(15). Actually, based on the square root of the diagonal elements of the estimated inverse Fisher information matrix [J(ωˆ,πˆ1,,πˆq)]1, the Wald-type confidence intervals for the parameters can be obtained. However, the zero-inflated parameter ω and the binomial probabilities π1,π2,,πq should be restricted within the unit interval and thus some upper (or lower) limits of these confidence intervals may be larger (or less) than 1 (or 0), resulting in useless confidence intervals. Instead of the Wald-type methods, the bootstrap approach can be used to compute the bootstrap confidence interval for any component of (ω,π1,,πq). At first, the independent sample y1,y2,,yn from the distribution MZIBq(ωˆ,m,πˆ) can be generated, where ωˆ and πˆ are the MLEs of ω, and π based on the original sample. Then, based on the generated sample y1,y2,,yn, the MLE (ωˆ,πˆ1,,πˆp) of (ω,π1,,πp) can be calculated. Independently repeating this procedure G times, the G MLE's {(ωˆg,πˆ1g,,πˆpg)}g=1G of (ω,π1,,πp) can be obtained and thus the confidence intervals of (ω,θ1,,θp) can be constructed by [ωL,ωU],[π1L,π1U],,[πqL,πqU], where ωL,π1L,,πqL and ωU,π1U,,πqU are the 100(α/2)% and 100(1α/2)% percentiles of {(ωˆg,πˆ1g,,πˆpg)}g=1G, respectively.

3.2. MLEs of parameters for MZIB model with covariates

3.2.1. The formulation of MZIB model with covariates

Again, let Y1,,Yn be independent random vectors and Yi=(Yi1,,Yiq), i=1,2,,n, follow the q-dimensional ZIB distribution MZIBq(ωi,mi,πi), where for i=1,,n, mi=(mi1,,miq) are the known vectors of binomial denominators, πi=(πi1,,πiq) are the unknown vectors of binomial probabilities and ωi are the unknown zero-inflated parameters. Further, let vi and ui be the covariates associated with the zero-inflated parameters ωi and binomial probabilities πi=(πi1,,πiq) (i=1,2,,n), respectively. Now suppose yi=(yi1,,yiq) is the realization of the random vector Yi, then the observed data and associated binomial denominators would be represented by yobs={y1,,yn} and mobs={m1,,mn}. To investigate the relationship between the parameters (ω,π) and covariates vi and ui, we consider the following regression model:

yi=(yi1,,yiq)MZIB(ωi,πi,mi),i=1,2,.n,logωi1ωi=viα,logπir1πir=uiβr,r=1,2,,q,

where vi=(1,v1i,,vpi) and ui=(1,u1i,,upi) are not necessarily identical covariate vectors associated with the subject i; α=(α0,α1,,αp)and βr=(βr0,βr1,,βrp) are corresponding regression coefficients. The primary purpose of this section is to estimate the parameter vector θ=(α,β1,,βq).

3.2.2. MLEs via the EM algorithm embedded with Fisher scoring algorithms at each M-step

Now, the complete-data likelihood function in Section 3.1.2 now becomes

L(θ|Ycom)=i=1nωi1zi(1ωi)zir=1qmirxirπirxir(1πir)mirxir

and the complete-data log-likelihood function is proportional to

(θ|Ycom)i=1n[(1zi)viαlog(1+eviα)]+i=1nr=1q[xiruiβrmirlog(1+euiβr)].

The first and negative second partial derivatives of the complete-data log-likelihood function are given by

(θ|Ycom)α=i=1n(1ziωi)vi=v(Jzω),(θ|Ycom)βr=i=1n(xirmirπir)ui=u(xrmrπr),2(θ|Ycom)αα=i=1nωi(1ωi)vivi=vdiag[ω(1ω)]vJJcom(α),2(θ|Ycom)βrβr=i=1nmirπir(1πir)uiui=udiag[mrπr(1πr)]uJcom(βr),

where 1=(1,,1),v=(v1,,vn), u=(u1,,un), z=(z1,,zn), ω=(ω1,,ωn), xr=(x1r,,xnr), mr=(m1r,,mnr), πr=(π1r,,πnr), diag[ω(1ω)]=diag[ω1(1ω1),,ωn(1ωn)], diag[mrπr(1πr)]=diag[m1rπ1r(1π1r),,mnrπnr(1πnr)]. Note that Jcom(α) is actually the complete-data Fisher information matrix associated only with the parameter vector α and the covariate matrix v and Jcom(βr) is actually the complete-data Fisher information matrix associated only with the parameter vector βr(r=1,2,,q) and the covariate matrix u, respectively, since they depend on neither the observed responses nor the latent/missing data.

Now, the M-step is to separately calculate the MLEs of α and β via two Fisher scoring algorithms as follows:

α(t+1)=α(t)+Jcom1(α(t))v(1zω(α(t))), (16)
βr(t+1)=βr(t)+Jcom1(βr(t))u(xrmrπr(β(t))),r=1,2,,q. (17)

The E-step is to replace the latent variables z,{xr}r=1q in (16) and (17) by their conditional expectations:

E(Z|Yobs,θ)=1diagωaI0 (18)

and

E(Xr|Yobs,θ)=yr+diagωamrπrI0, (19)

where yr=(y1r,,ynr), I0=(I{yi=0},,I{yn=0}), diag(ωa)=diag(ω1a1,,ωnan), diag(ωamrπr)=diag(ω1a1m1rπ1r,,ωnanmnrπnr) with

ai=ai(ωi,πi)=ωi+(1ωi)r=1q(1πir)mir,i=1,2,,n.

Note that (18) and (19) can be derived in the same way as (14) and (15). Now, let αˆ and βˆr,r=1,,q are the estimates of the parameters α,βr,r=1,2,,q, respectively. Then the asymptotic covariance matrices for αˆ and βˆr,r=1,,q can be obtained as covˆ(αˆ)=Jcom1(αˆ), covˆ(βˆr)=Jcom1(βˆr), r=1,2,,q and thus the corresponding confidence intervals for the components of θ can be constructed by using the Wald-type method.

3.3. Hypothesis testing in MZIB model without covariates

In what follows, based on the likelihood methods we derive score test statistics S1 and S3 and likelihood ratio test statistics S2 and S4. S1 and S2 are used to test the presence of zero-inflation in multivariate binomial model and S3 and S4 are used to test the equality of probabilities for all components in the multivariate zero-inflated binomial model. Under the marginal model, the corresponding two-sided hypotheses are (i) H0:γ=0 versus Ha:γ0; (ii) H0:π1==πq versus Ha:πrπs for at least one pair rs.

3.3.1. Tests for zero-inflation in MZIB model

We should first test the presence of zero-inflation for the multivariate binomial data before the multivariate zero-inflated binomial model is used to fit such data. Based on the score test method, the test statistic for testing the hypotheses H0:γ=0 versus Ha:γ0 is given by

S1=i=1nr=1qmrmir(mryr)mirI(yi=0)12i=1nr=1qmrmir(mryr)mir1r=1qyrmr(mryr)1. (20)

The details of the derivation for S1 are given in A.3 of supplemental file. Under the null hypothesis H0:γ=0, the test statistic S1 has an approximately χ2 distribution with one degree of freedom. The corresponding p-value is given by

p1=Pr(S1>s1|H0)=Pr{χ2(1)>s1}, (21)

where s1 is the observed value for S1. When p1<α, we reject the null hypothesis H0 at the α level of significance. Otherwise, we fail to reject H0.

Now for the purpose of comparison, we also give the likelihood ratio test (LRT) for testing the zero-inflation in MZIB model. The LRT statistic has the following form:

S2=2{(0,πˆ0|yobs)(γˆ,πˆ|yobs)},

where πˆ0=(y1/m1,,yq/mq) are the MLE's of π under the null hypothesis and (γˆ,πˆ) are the unconstrained MLE of (γ,π), which can be obtained via the Fisher scoring algorithm or EM algorithm given in Sections 3.1.1 and 3.1.2.

Under H0, the LRT statistic S2 has an approximately chi-squared distribution with one degree of freedom and the corresponding p-value can be computed as

p2=Pr(S2>s2|H0)=Pr{χ2(1)>s2}.

where s2 is the observed value for S2.

Note that the advantage of the score test is that the parameters are estimated only under the null hypothesis not the alternative hypothesis and thus the score test statistic S1 has a closed form, which results in easy computation and application. However, as one can see from the simulation results in Section 4, the score test exhibits some limitation for the application even if the dimension of binomial random vector is moderate. Therefore, the score test is recommended for testing the zero-inflation in MZIB model only for the lower dimensions.

3.3.2. Tests for the equality of probabilities in MZIB model

From Section 2, we know that there exists a correlation between any two components for the multivariate zero-inflated binomial model. Another question of interest for the multivariate binomial model is whether all or part components of multivariate binomial distribution share the same binomial probability. Therefore, we develop an approach to testing the equality of probabilities for all/part components in the multivariate zero-inflated binomial model. The null and alternative hypotheses to be tested are

H0:π1==πq versus Ha:πrπs for\ at\ least\ one\ pair rs. (22)

Same as in Section 3.3.1, the score test statistic can be developed for testing the equality of probabilities for all components in the multivariate binomial variable. It has the following form:

S3=U(γˆ,πˆ)I1(γˆ,πˆ)U(γˆ,πˆ),

where γˆ and πˆ are the MLEs of γ and π, respectively, under the null hypothesis H0:π1==πq. γˆ and πˆ can be obtained via the Fisher scoring algorithm from the following maximum likelihood equations under the null hypothesis:

(γ,π|yobs)γ=n1+γ+i=1nI(yi=0)γ+(1π)mi=0, (23)
(γ,π|yobs)π=i=1nmiγI(yi=0)(1π)(γ+(1π)mi)+yimiππ(1π)=0 (24)

or via the EM algorithm which can be derived in the similar way given in Section 3.1.2.

Furthermore, the score function U(γ,π) of (γ,π) has the form

U(γ,π)=(γ,π|vyobs)γ,(γ,π|yobs)π1,,(γ,π|yobs)πq,

with

(γ,π|yobs)γπr=π=n1+γ+i=1nI(yi=0)γ+(1π)mi,(γ,π|yobs)πrπr=π=i=1nmirγI(yi=0)(1π)(γ+(1π)mi)+yirmirππ(1π)

for r=1,,q. The expected information matrix I(γ,π) with parameters γ and π=(π1,,πq) is

I(γ,π)=E(2(θ|yobs))=E2γ2E2γπ1E2γπqE2π1γE2π12E2π1πqE2πqγE2πqπ1E2πq2,

where under the null hypothesis π1=π2==πq=π,

E2γ2πr=π=i=1n1(1π)mi(γ+(1π)mi)(1+γ)2,E2γπrπr=π=i=1nmir(1π)mi1(γ+(1π)mi)(1+γ),E2πr2πr=π=i=1nmirπ(1π)(1+γ)γ2mir2(1π)mi2(γ+(1πr)mi)(1+γ)

and

E2πrπsπr=πs=π=i=1nmirmis(1π)mi2γ(γ+(1π)mi)(1+γ)

and U(γˆ,πˆ) and I(γˆ,πˆ) are the estimated values of the score function U(γ,π) and the expected information matrix I(γ,π) at γ=γˆ and π=πˆ=πˆ1q, respectively.

Under the null hypothesis, the score statistic S3 has an approximately χ2 distribution with q−1 degrees of freedom and the corresponding p-value can be computed as

p3=Pr(S3>s3|H0)=Pr{χ2(q1)>s3},

where s3 is the observed value for S3. Similar to Section 3.3.1, the likelihood ratio method can also be used to test the equality of parameters in MZIB model. The LRT statistic is

S4=2{(γˆ0,πˆ0|yobs)(γˆ,πˆ|yobs)}, (25)

where γˆ0,πˆ0 are the estimators of parameters γ,π under the null hypothesis and γˆ,πˆ are the estimates of parameters γ,π under the alternative hypothesis. Although the MLEs for the parameters γ and π under both hypotheses do not have closed forms, γˆ0,πˆ0 can be computed in the same way given in the derivation of S3 and γˆ,πˆ can be calculated via the Fisher scoring algorithm or EM algorithms given in Sections 3.1.1 and 3.1.2 under the alternative hypothesis. Further, under the null hypothesis, the LRT statistic S4 in (25) follows an approximately χ2 distribution with q−1 degrees of freedom. The corresponding p-value is given by

p4=Pr(S4>s4|H0)=Pr{χ2(p1)>s4},

where s4 is the observed value for S4. Moreover, if the null hypothesis H0 in (21) is rejected, then the following hypotheses could be tested:

H0:πk1=πk2==πkqagainst H1:πkrπks (26)

for the k1th, k2th,…, kqth components ( q<q). The likelihood ratio test statistic for testing the hypotheses in (26) is given by

S4=2{(γˆ,πˆ|yobs)(γˆ,πˆ|yobs)}, (27)

where (γˆ,πˆ) are the maximum likelihood estimates of (γ,π) under H0:πk1=πk2==πkq and can be computed via the Fisher scoring algorithm or EM algorithm. The parameters (γˆ,πˆ) are the unconstrained maximum likelihood estimates of (γ,π) and can be computed via the same algorithms. Moreover, under the null hypothesis H0 the test statistic S4 has an approximately χ2 distribution with q1 degrees of freedom. The corresponding p-value is

p4=Pr(S4>s4|H0)=Pr{χ2(q1)>s4},.

where s4 is the observed value for S4.

4. Simulation study

In this section, a limited simulation study is carried out to evaluate the performance of the proposed statistical methods in Section 3 for the multivariate ZIB distribution. We first examine the accuracy of point estimates and confidence interval estimates for different parameter settings in the proposed multivariate ZIB models with/without covariates via simulation studies. Next, we establish the validity of four proposed test statistics under a finite sample situation. In terms of the nominal levels and powers, the performance of score test statistics and LRT statistics for the presence of zero-inflation and the equality of probabilities for all components in the multivariate ZIB models are investigated. All simulation studies on test methods are based on the multivariate zero-inflated binomial distribution without regressors to keep the model simple and the study more focused.

4.1. Accuracy of point estimates and interval estimates for MZIB model without covariates

Note that the proposed q-dimensional multivariate ZIB distribution has q+1 parameters. We expect that the proposed distribution can yield better data fitting without sacrificing statistical accuracy too much. To evaluate the accuracy of point estimates and confidence intervals for zero-inflated parameter ω and the probability parameters π1,,πq in the multivariate ZIB model without covariates, we consider five cases for the dimension: q = 2, 3, 4, 5 and 6. The sample size is chosen as n=50,100. Parameter configurations can be found in Table 1. First, the procedure for generating the random number {yi}i=1niidZIBq(ω,m,π) is given as follows:

Table 1.

Parameter configurations for q = 2, 3, 4, 5 and 6.

Scenarios q ω (π1,m1) (π2,m2) (π3,m3) (π4,m4) (π5,m5) (π6,m6)
1 2 0.10 (0.10, 10) (0.20, 6)        
2   0.20 (0.20, 6) (0.10, 10)        
3 3 0.10 (0.10, 8) (0.20, 6) (0.15, 10)      
4   0.20 (0.20, 6) (0.25, 10) (0.30, 8)      
5 4 0.10 (0.10, 10) (0.15, 6) (0.20, 12) (0.15, 8)    
6   0.20 (0.20, 8) (0.25, 10) (0.30, 12) (0.10, 6)    
7 5 0.10 (0.10, 6) (0.25, 12) (0.30, 8) (0.20, 9) (0.15, 11)  
8   0.20 (0.25, 10) (0.35, 12) (0.20, 8) (0.15, 9) (0.30, 11)  
9 6 0.10 (0.20, 10) (0.35, 6) (0.30, 12) (0.10, 8) (0.15, 9) (0.25, 11)
10   0.20 (0.15, 6) (0.25, 8) (0.20, 12) (0.45, 11) (0.10, 10) (0.35, 9)
  1. Generate z1,,zniidBernoulli(1ω).

  2. Generate
    {x11,,xn1}iidBinomial(m1,π1),,{x1q,,xnq}iidBinomial(mq,πq).
  3. Let xi=(xi1,,xiq). Then yi=zixiZIBq(ω,m,π) for i=1,,n.

Calculate the MLEs from the generated sample via EM algorithm (12)–(15) and the 95% bootstrap confidence intervals with repeating times G = 1000 for the parameters ω,π1,π2,,πq. Next, the 1000 samples are independently generated and the corresponding 1000 EM MLEs and 1000 bootstrap confidence intervals of ω,π1,π2,,πq are obtained. Further, in Table 2, MLE is the average of the 1000 estimates via the EM algorithm (12)–(15); width and CP of the confidence intervals are the average width and coverage proportion of 1000 bootstrap confidence intervals. As seen in Table 2, the bias are small and the MLE's are very close to the corresponding true values of parameters and the coverage probabilities are all around 0.95, although the coverage probabilities of zero-inflated parameter ω is a little less than the nominal level for q = 5 and 6 with n = 50. We also conducted the simulation study for moderate number of dimension (e.g. q = 10)(the results do not be reported here). The obtained estimates of parameters, width and CP of confidence intervals are consistently close to the true values of parameters and nominal confidence coefficient, which demonstrates the proposed EM algorithm has very good performance even for a moderate dimensional number of multivariate binomial data.

Table 2.

Biases of MLEs, widths and coverage probabilities of bootstrap confidence intervals for parameters with the number of dimension q = 2, 3, 4, 5 and 6.

q n Parameter Bias Width CP Bias Width CP
      Scenario 1 Scenario 2
2 50 ω −0.0004 0.1979 0.941 −0.0027 0.2817 0.946
    π1 0.0004 0.0730 0.929 0.0014 0.1113 0.954
    π2 −0.0007 0.0800 0.952 −0.0001 0.0639 0.949
  100 ω −0.0012 0.1509 0.943 0.0004 0.2055 0.950
    π1 −0.0006 0.0520 0.945 −0.0003 0.0790 0.949
    π2 −0.0007 0.0570 0.959 0.0003 0.0446 0.954
      Scenario 3 Scenario 4
3 50 ω 0.0023 0.1735 0.925 −0.0005 0.2177 0.933
    π1 0.0001 0.0723 0.951 0.0001 0.0791 0.960
    π2 0.0014 0.0846 0.944 −0.0016 0.1103 0.953
    π3 −0.0001 0.0672 0.946 0.0002 0.1012 0.961
  100 ω −0.0003 0.1258 0.940 −0.0025 0.1557 0.942
    π1 −0.0002 0.0508 0.949 −0.0011 0.0556 0.958
    π2 −0.0002 0.0597 0.947 0.0007 0.0779 0.945
    π3 0.0001 0.0476 0.937 0.0003 0.0716 0.955
      Scenario 5 Scenario 6
4 50 ω −0.0030 0.1580 0.894 0.0023 0.2172 0.941
    π1 −0.0003 0.0716 0.963 −0.0009 0.0791 0.952
    π2 −0.0009 0.0738 0.951 −0.0010 0.1102 0.941
    π3 0.0006 0.0743 0.949 −0.0001 0.0825 0.956
    π4 0.0008 0.0603 0.950 0.0000 0.0659 0.936
  100 ω −0.0008 0.1170 0.940 −0.0012 0.1548 0.944
    π1 0.0005 0.0507 0.952 0.0003 0.0558 0.948
    π2 −0.0009 0.0523 0.946 0.0001 0.0779 0.948
    π3 0.0007 0.0526 0.955 −0.0010 0.0583 0.944
    π4 0.0005 0.0429 0.952 −0.0001 0.0467 0.953
      Scenario 7 Scenario 8
5 50 ω 0.0003 0.1574 0.865 0.0014 0.2176 0.944
    π1 0.0000 0.0714 0.945 0.0002 0.0856 0.952
    π2 0.0004 0.0734 0.953 −0.0005 0.0861 0.961
    π3 0.0000 0.0951 0.945 −0.0008 0.0883 0.951
    π4 0.0003 0.0781 0.950 0.0005 0.0745 0.956
    π5 −0.0005 0.0632 0.956 −0.0009 0.0864 0.949
  100 ω −0.0007 0.1151 0.926 −0.0022 0.1560 0.941
    π1 −0.0002 0.0507 0.954 0.0011 0.0604 0.951
    π2 −0.0005 0.0519 0.929 0.0004 0.0606 0.955
    π3 0.0001 0.0672 0.952 0.0004 0.0624 0.961
    π4 0.0003 0.0553 0.958 −0.0001 0.0523 0.961
    π5 −0.0006 0.0447 0.953 0.0009 0.0609 0.941
      Scenario 9 Scenario 10
6 50 ω −0.0009 0.1573 0.877 0.0036 0.2163 0.935
    π1 0.0007 0.0740 0.942 −0.0004 0.0704 0.942
    π2 −0.0006 0.1142 0.953 0.0009 0.1106 0.945
    π3 0.0010 0.0776 0.947 0.0000 0.0721 0.944
    π4 0.0005 0.0621 0.946 0.0002 0.1098 0.939
    π5 0.0012 0.0698 0.954 −0.0002 0.0622 0.954
    π6 0.0006 0.0766 0.950 −0.0001 0.0898 0.943
  100 ω −0.0002 0.1157 0.931 −0.0006 0.1552 0.936
    π1 0.0001 0.0525 0.947 −0.0005 0.0497 0.939
    π2 0.0002 0.0809 0.968 0.0000 0.0777 0.952
    π3 0.0000 0.0549 0.951 −0.0004 0.0509 0.951
    π4 0.0001 0.0439 0.943 −0.0013 0.0774 0.954
    π5 0.0001 0.0494 0.945 −0.0003 0.0440 0.941
    π6 −0.0008 0.0542 0.953 −0.0003 0.0634 0.952

4.2. Accuracy of point estimates and interval estimates for MZIB model with covariates

In this subsection, we perform the limited simulation study to investigate the performance of proposed algorithm for the estimation of regression parameters α and β1,,βq in the multivariate ZIB model with covariates. The dimension is selected as q = 2, 3, 4 and 5. The covariates for the regression model are selected as u1U[1,3] and u2Bin(1,0.5). The regression parameters are selected as α=(2.2,0.3,1.0), β1=(0.3,0.7,1.5), β2=(0.5,0.8,2.0), β3=(1.0,0.6,1.8), β4=(1.2,1.0,1.3) and β5=(1.4,1.2,0.7). Thus, the logistic models for the parameters (ω,π) and covariates u1,u2 with regression parameters α and βr(r=1,2,,q) with q = 2, 3, 4, 5 are as follows:

y=(y1,,yq)MZIB(ω,π,m),logω1ω=α0+α1u1+α2u2,logπr1πr=β0r+β1ru1+β2ru2r=1,2,,q.

Now, based on the above model, the random sample yi,i=1,2,n can be generated as follow:

  1. Generate the covariates u1iU[1,3] and u2ibin(1,0.5) for i=1,2,,n.

  2. Generate the zero-inflated parameters ωi and probability parameters πi=(π1i,,πqi) for =1,2,,n from the logistic models.

  3. Generate the binomial denominators mi=(m1i,,mqi) from 5 to 15.

  4. Generate the response yi from multivariate ZIB distribution MZIB(mi,ωi,πi), i=1,2,,n using the procedure given in Section 4.1.

The sample size is selected as n = 100, 300 and 500. Then via EM algorithm (16)–(19) and Wald-type methods the MLEs for the parameters α and β1,,βq and its MSEs, the 95% confidence intervals and its widths can be calculated from the generated samples. Further, similar to the case in Table 2, with repeating times G = 1000, the average values of biases and MSEs of MLEs, widths and coverage probabilities of confidence intervals for the parameters α and β1,,βq are given in Table 3 with q = 2, 3. The simulation results with q = 4, 5 can be seen in Table A1 of supplemental file.

Table 3.

Biases and MSEs of MLEs, widths and coverage probabilities of confidence intervals for regression parameters with the dimension q = 2 and 3.

n   α10 α11 α12 β10 β11 β12 β20 β21 β22 β30 β31 β32
q = 2
100 Bias −0.2948 0.1716 −0.1231 −0.0021 0.0106 −0.0130 −0.0174 0.0224 −0.0089      
  CP 0.9600 0.9560 0.9740 0.9360 0.9360 0.9280 0.9130 0.9190 0.9501      
  Width 7.5060 4.7936 3.0683 1.7440 1.1360 0.6528 1.8381 1.2030 0.6989      
  MSE 1.9148 1.2229 0.7828 0.4449 0.2898 0.1665 0.4689 0.3067 0.1783      
300 Bias −0.1571 0.1396 −0.1526 −0.0006 0.0105 −0.0126 −0.0071 0.0128 −0.0092      
  CP 0.9380 0.9420 0.9480 0.9400 0.9370 0.9280 0.9480 0.9410 0.9440      
  Width 4.1000 2.6239 1.6894 0.9939 0.6475 0.3727 1.0482 0.6866 0.3986      
  MSE 1.0459 0.6694 0.4310 0.2536 0.1652 0.0951 0.2674 0.1752 0.1017      
500 Bias −0.0760 0.0972 −0.1253 −0.0020 0.0090 −0.0082 −0.0147 0.0190 −0.0120      
  CP 0.9510 0.9530 0.9440 0.9350 0.9400 0.9290 0.9320 0.9320 0.9290      
  Width 3.1301 2.0067 1.2769 0.7697 0.5017 0.2883 0.8105 0.5310 0.3081      
  MSE 0.7985 0.5119 0.3257 0.1963 0.1280 0.0735 0.2068 0.1355 0.0786      
q = 3
100 Bias −0.0921 0.0093 −0.0801 −0.0090 −0.0002 0.0094 0.0170 −0.0075 0.0048 0.0132 −0.0015 −0.0033
  CP 0.9660 0.9710 0.9690 0.9170 0.9340 0.9390 0.9440 0.9390 0.9310 0.9410 0.9350 0.9490
  Width 4.9825 2.2885 2.8878 1.2186 0.5883 0.6738 1.2612 0.6180 0.7050 1.2949 0.6150 0.7246
  MSE 1.2711 0.5838 0.7367 0.3109 0.1501 0.1719 0.3217 0.1577 0.1798 0.3303 0.1569 0.1848
300 Bias −0.1095 0.0497 −0.0442 0.0043 −0.0022 0.0035 0.0085 −0.0026 −0.0023 0.0142 −0.0045 −0.0047
  CP 0.9540 0.9530 0.9560 0.9330 0.9300 0.9260 0.9240 0.9300 0.9400 0.9310 0.9340 0.9350
  Width 2.7472 1.2556 1.5657 0.6942 0.3353 0.3853 0.7190 0.3521 0.4035 0.7353 0.3493 0.4147
  MSE 0.3509 0.1605 0.1993 0.0887 0.0428 0.0492 0.0919 0.0450 0.0515 0.0939 0.0446 0.0530
500 Bias −0.0439 0.0244 −0.0327 0.0028 0.0003 −0.0017 −0.0037 0.0021 0.0029 −0.0063 0.0042 −0.0012
  CP 0.9580 0.9560 0.9520 0.9430 0.9520 0.9470 0.9400 0.9370 0.9250 0.9340 0.9350 0.9260
  Width 2.0987 0.9619 1.1957 0.5362 0.2590 0.2976 0.5560 0.2724 0.3121 0.5679 0.2698 0.3202
  MSE 0.5354 0.2454 0.3050 0.1368 0.0661 0.0759 0.1418 0.0695 0.0796 0.1449 0.0688 0.0817

From the results given in Table 3 and in Table A1 of the supplemental file, one can see that the proposed EM algorithm for the estimation of regression parameters has a good performance. The biases are small and the MLEs are very close to the corresponding true values of parameters and the coverage probabilities are all around 0.95, although the algorithm shows a little liberty for the estimated parameter α with q = 2 and n = 100.

4.3. Tests for zero-inflation in MZIB model

In this subsection, the performance of both the score test and the likelihood ratio test for testing zero inflation in the multivariate ZIB distribution is conducted by a simulation study. All the simulations in this subsection are performed with G=10,000 replications and covariates are not considered for the purpose of simplicity. The dimensions q = 2, 3, 4 are considered for multivariate zero-inflated binomial distribution with sample sizes of n = 50, 100, 200, 300, 400, 500. For assessing the powers of proposed test statistics, the zero-inflation parameter is designed to be ω=0.0, 0.01, 0.05, 0.1, 0.2. Since we are not interested in the inference of binomial probabilities π and binomial denominators m, we first generate two vectors π and m randomly. For a given set of parameters (n,ω,m,π), the samples with multivariate ZIB distribution ZIB q(ω,m,π) could be generated in the same procedure as that in Section 4.1 and the estimates of parameters (ω,π) for multivariate zero-inflated binomial distribution under LRT alternative hypothesis are computed by EM algorithm.

The simulation results are summarized in Table 4 and in Table A2 of the supplemental file. The results in Table 4 are computed by using the binomial probabilities π and m, which are randomly generated from 0.05 to 0.20, and 5 to 15, respectively. The results in Table A2 of the supplemental file are computed by using the binomial probabilities π and m, which are randomly generated from 0.01 to 0.15, and 5 to 20, respectively. These results show the comparison between score test statistic S1 and LRT statistic S2 side by side for both controlling nominal level and powers. Both test statistics hold the nominal level at α=0.05 well. The empirical powers of both tests for detecting zero-inflation increase as the dimensions in multivariate ZIB distribution and the zero-inflated parameter increase. We also perform the simulation study for multivariate zero-inflated binomial distribution with q=5+ dimensions. However, not like the EM algorithm for the estimates of parameters, even for a very small zero-inflation parameter ω (ω=0.01) both tests show great powers for testing zero-inflation, but both tests do not hold nominal level well for 5+ dimensional multivariate zero-inflated proportional data. Sometimes both the score test and the likelihood ratio test require a fair size of sample for calculating the inverse of the expected information matrix. Overall, there is not much difference in power between the score test and likelihood ratio test.

Table 4.

Empirical Powers for both the score test and the likelihood ratio test for testing zero-inflation parameter ω with nominal level. α=0.05.

    Empirical powers
    ω=0 ω=0.01 ω=0.05 ω=0.10 ω=0.20
q n Score LRT Score LRT Score LRT Score LRT Score LRT
2 50 0.0528 0.0467 0.0872 0.1163 0.1346 0.1819 0.3313 0.4204 0.7578 0.8310
  100 0.0493 0.0428 0.1192 0.1667 0.2209 0.2962 0.5780 0.6705 0.9664 0.9818
  200 0.0471 0.0436 0.1846 0.2576 0.3832 0.4844 0.8567 0.9095 0.9977 1.000
  300 0.0516 0.0474 0.2430 0.3365 0.5185 0.6316 0.9571 0.9764 1.000 1.000
  400 0.0484 0.0449 0.3053 0.4109 0.6370 0.7346 0.9892 0.9948 1.000 1.000
  500 0.0497 0.0477 0.3627 0.4708 0.7311 0.8144 0.9975 0.9988 1.000 1.000
3 50 0.0419 0.0400 0.3173 0.3167 0.5387 0.5410 0.8851 0.8890 0.9966 0.9966
  100 0.0368 0.0407 0.4846 0.5068 0.7688 0.7847 0.9896 0.9890 1.000 1.000
  200 0.0434 0.0435 0.7154 0.7500 0.9516 0.9620 1.000 1.000 1.000 1.000
  300 0.0460 0.0452 0.8554 0.8801 0.9901 0.9927 1.000 1.000 1.000 1.000
  400 0.0503 0.0475 0.9258 0.9441 0.9985 0.9991 1.000 1.000 1.000 1.000
  500 0.0493 0.0447 0.9667 0.9758 0.9996 0.9997 1.000 1.000 1.000 1.000
4 50 0.0635 0.0214 0.6021 0.5143 0.8152 0.7636 0.9760 0.9684 1.000 1.000
  100 0.0512 0.0442 0.8393 0.8146 0.9665 0.9577 0.9999 0.9999 1.000 1.000
  200 0.0412 0.0376 0.9642 0.9618 0.9986 0.9984 1.000 1.000 1.000 1.000
  300 0.0421 0.0385 0.9921 0.9916 0.9998 0.9998 1.000 1.000 1.000 1.000
  400 0.0469 0.0471 0.9984 0.9986 1.000 1.000 1.000 1.000 1.000 1.000
  500 0.0423 0.0442 0.9999 0.9999 1.000 1.000 1.000 1.000 1.000 1.000

4.4. Test for equality of probabilities in MZIB model

In this subsection, the performance of both score test and likelihood ratio test for testing equality of multivariate zero-inflated binomial distribution parameters π1==πq is conducted by a simulation study. All the simulations in this subsection are performed with G=10,000 replications and covariates are not considered for the sake of simplicity. Only one zero-inflation parameter (ω=0.3) and two sets of dimensions (q = 2, 3) are considered for multivariate zero-inflated binomial distribution with sample sizes of n = 50, 100, 150, 200, 250, 300, 350, 400, 450, 500. The vector of binomial denominators m is randomly generated. The vector of binomial probabilities is preassigned. For the two-dimension case, (π1,π2) are (0.3,0.3), (0.5,0.5), (0.8,0.8) and (0.4,0.5). For the three-dimension case, values of (π1,π2,π3) are (0.3,0.3,0.3), (0.5,0.5,0.5), (0.8,0.8,0.8) and (0.45,0.5,0.5). For a given set of parameters (n,ω,m,π), the multivariate zero-inflated binomial data could be generated by a similar procedure given in Section 4.1.

The simulation results are summarized in Figure 1 and in Table A3 of the supplemental file. Figure 1 displays the comparisons of performance for empirical levels and empirical powers at the nominal level α=0.05 between the score test (solid line) and the likelihood ratio test (dotted line) for testing the equality of parameters π based on two-dimensional and three-dimensional zero-inflated binomial models. Similar conclusions can also be obtained from Table A3 in the supplemental file.

Figure 1.

Figure 1.

Comparisons of performance for empirical levels and empirical powers between the score test (solid line) and the likelihood ratio test (dotted line) for testing the equality of probabilities in two- and three-dimensional binomial data.

From the simulation results, both the score test and the likelihood ratio test maintain the nominal level α=0.05 well. The powers of both tests for detecting equality of multivariate zero-inflated binomial parameter π are very close in both two-dimensional and three-dimensional simulations with no influence of sample size. Also, both tests show great detecting power even with a very small difference in π. Overall, there is no difference between the score test and likelihood ratio test in testing the equality of π in the multivariate zero-inflated binomial distribution.

5. A real example

In this section, we illustrate the application of the proposed multivariate zero-inflated binomial model to whitefly data. van Iersel et al. [11] studied the purpose of controlling silver leaf whiteflies by using a subirrigation system. The study was designed to determine the effectiveness of controlling silver leaf whiteflies on poinsettia with imidacloprid, which was delivered by a subirrigation system. Imidacloprid is a resilient and powerful chemical (e.g. Natwick et al. [20]), that has low toxicity to mammals, and is used to control silver leaf whiteflies on poinsettia. At the first week of this experiment, researchers placed m adult whiteflies (here, m is considered as the binomial denominators with range 6–15, mean = 9.5 and SD = 1.7) in clip-on leaf cages attached to one leaf per plant and then recorded the number of surviving whiteflies 2 days later, which is considered as the response variable. To measure reproductive inhibition, they removed the fly cages after obtaining the survival count but marked the position of each cage. In the coming week, they placed m adult whiteflies in clip-on leaf cages attached to one leaf on the same plant and recorded the number of surviving whiteflies. This study lasted for consecutive 12 weeks on 54 plants. Therefore, the data can be considered to consist of the 12-dimensional binomial variables, that is, the observed data can be expressed as {(yi,mi), i=1,2,,54} with yi=(yi1,,yi,12) and mi=(mi1,,mi,12). However, although the design originally called for 648 observations in a balanced design, one observation was lost on each of plant 4 at week 2, plant 7 at week 4, plant 15 at week 11, plant 17 at week 3, plant 20 at week 11 and plant 27 at week 2 plants and two observations were lost on plant 18 at weeks 10 and 11, to yield a final data set with N=640 observations. Further, by examining the data set, the existence of excessive number of zeros for each week is discovered. The detailed information is shown in Table A4 of the supplemental file. It can be seen that the percentage of zeros in this data set is greater than 50%. Also, Figure A1 in the supplemental file shows the frequency yij of alive whiteflies in 3D image for whitefly data.

Now the proposed multivariate zero-inflated binomial model can be used to analyse the whitefly data. At first, the bivariate proportional dataset can be generated from week r and week s (r,s=1,2,,12 and rs), denoted by Drs={(yir,yis;mir,mis),i=1,2,,n}. Next the generated data set Drs can be analysed using bivariate zero-inflated binomial model and the MLEs ωˆ,πˆr and πˆs of parameters ω,πr and πs can be computed via EM algorithm. Then, the estimated correlation coefficient ρˆrs(i) for each observation (yir,yi,s;mir,mis) in bivariate binomial data can be computed from (3) with ω=ωˆ,πr=πˆr and πs=πˆs:

ρrs(i)=ωˆmirπˆrmisπˆs(ωˆmirπˆr+(1πˆr))(ωˆmisπˆs+(1πˆs)),rs, i=1,2,,n.

At last, the estimated correlation coefficient for bivariate zero-inflated binomial data set Drs is

ρˆrs=1ni=1nρˆrs(i),r,s=1,2,,12

which is the mean of estimated correlation coefficients for all observations in bivariate zero-inflated binomial data. The results for the correlation coefficients among the components of 12-dimensional binomial variables are presented in Table 5. From the results in Table 5, there certainly exist the positive correlations between any two components of the 12 binomial variables, which are induced from the existence of zero-inflation for the bivariate binomial variables.

Table 5.

Estimates of correlation coefficients among 12 binomial variables.

Week 1 2 3 4 5 6 7 8 9 10 11 12
1 1.0000                      
2 0.1439 1.0000                    
3 0.1848 0.1295 1.0000                  
4 0.2931 0.1986 0.2232 1.0000                
5 0.2922 0.2585 0.2397 0.3906 1.0000              
6 0.2935 0.2246 0.2617 0.3937 0.4642 1.0000            
7 0.2307 0.1932 0.2323 0.3250 0.4132 0.3606 1.0000          
8 0.2488 0.1776 0.2340 0.3445 0.3591 0.3622 0.3505 1.0000        
9 0.2717 0.2386 0.2252 0.3703 0.4995 0.4645 0.3376 0.3396 1.0000      
10 0.2900 0.2534 0.2219 0.3519 0.3503 0.3537 0.3400 0.3235 0.3484 1.0000    
11 0.2523 0.2532 0.2046 0.3774 0.3883 0.4201 0.3076 0.3266 0.4128 0.3673 1.0000  
12 0.1739 0.0816 0.1080 0.2298 0.2640 0.2519 0.1875 0.1879 0.2459 0.1414 0.2265 1.0000

Next, since the proposed test statistics work well only for the dimension q4, for the purpose of illustrating the proposed model, the vector of 12 variables is partitioned into four vectors of three variables and thus obtain four small multivariate proportional data sets denoted weeks 1–3, weeks 4–6, weeks 7–9 and weeks 10–12 by D1,D2,D3 and D4, respectively. Due to the unbalanced design, the observations on the plants with losing are omitted and the data sets D1D2,D3 and D4 can be expressed as

D1={(yi1,yi2,yi3;mi1,mi2,mi3),i=1,2,,51},D2={(yi4,yi5,yi6;mi4,mi5,mi6),i=1,2,,53},D3={(yi7,yi8,yi9;mi7,mi8,mi9),i=1,2,,54},D4={(yi,10,yi,11,yi,12;mi,10,mi,11,mi,12),i=1,2,,51}.

The following three-dimensional zero-inflated binomial model is used to analyse the data:

f(y;ω,π)=[ω+(1ω)(1π1)m1(1π2)m2(1π3)m3]I{y=0}×[(1ω)m1y1m2y2m3y3π1y1π2y2π3y3(1π1)m1y1(1π2)m2y2(1π3)m3y3]I{y0}.

Both the score test and the likelihood ratio test are applied to test the existence of zero inflation and the equality of the binomial probabilities for the three components in each dataset. If there exists the zero-inflation, the MLEs of zero-inflated parameters and binomial probabilities are also computed using the EM algorithm for four data sets D1,D2,D3 and D4. The results of data analysis for these data sets are given in Table 6.

Table 6.

The results for analysing four three-dimensional data sets of whitefly data by using MZIB model.

Data set D1 D2 D3 D4
Test Score(p-value) 7639.8219(0.0000) 14453.2000(0.0000) 4457.1165(0.0000) 16838.4730(0.0000)
ω=0 LRT(p-value) 67.2820(0.0000) 288.1016(0.0000) 203.2252(0.0000) 114.8060(0.0000)
Test Score(p-value) 11.9332(0.0026) 15.9083(0.0004) 6.7793(0.0337) 38.9267(0.0000)
π1=π2=π3 LRT(p-value) 11.8469(0.0027) 15.6080(0.0004) 6.7764(0.0338) 38.2862(0.0000)
MLE of Parameter (ω,π1) (0.0980, 0.2643) (0.3585, 0.4452) (0.2963, 0.3288) (0.1568, 0.2948)
  (π2,π3) (0.3694, 0.3000) (0.3046, 0.3249) (0.3443, 0.2610) (0.2473, 0.4455)

From Table 6, one can see that the values of score test statistics and LRT statistics for testing the presence of zero-inflation are very large and thus the p-values for these test are very small (near to zero), which show that there exist the zero-inflation in four three-dimensional proportional data sets and thus a positive correlation among the components of these multivariate proportional variables in each data set. Now, by using the bootstrap method given in Section 3.1.2, the 95% confidence intervals of zero-inflation parameter ω for four data sets are computed as (0.0196,0.1765),(0.2264,0.4906),(0.1666,0.4259) and (0.0588,0.2545), respectively. These confidence intervals also confirm the existence of zero-inflation and positive correlation among the components in four induced data sets. Further, there exists enough evidence to show that binomial probabilities of three components are not equal for all data sets D1,D2,D3 and D4 at the level of significance α=0.05 since all p-values of 8 statistics for testing the equality of binomial probabilities are less than 0.05. However, the p-values of score tests and LRTs test in weeks 7–9 are 0.03372 and 0.03377, respectively, and thus the proportional probabilities in the components of weeks 7–9 could be equal at the significant level α=0.025. Also, the values of score test and LRT for testing the equality of probabilities are almost same in each of the four data sets. However, the score test is more sensitive to the zero-inflation than the likelihood ratio test for testing the existence of zero-inflation since the values of score test are much larger than that of the likelihood ratio test for testing zero-inflation.

For the purpose of illustration, the original data are again partitioned into three data sets D5 for weeks 1–4, D6 for weeks 5–8, D7 for weeks 9–12. Hence, each data set consists of four-dimensional binomial variables with four-dimensional binomial denominators. The four-dimensional ZIB model with the corresponding test statistics and EM algorithm is applied to conduct the same analysis for these three four-dimensional binomial data sets as that for four three-dimensional binomial data sets. The results are given in Table 7. The results in Table 7 are similar to that in Table 6. Since all values of statistics for testing the existence of zero-inflation are very large and thus the corresponding p-values are near to zero, the data sets D5,D6 and D7 present the zero-inflation and there exist positive correlations among the components in each of three induced four-dimensional binomial data sets. Also, the p-values of score test S2 and LRT S4 for data sets D5 and D7 are less than 0.01, which indicates that binomial probabilities of four components could be not equal even at the significant level α=0.01 in D5 and D7. However, the binomial probabilities for weeks 5–8 are probably equal since the p-values of score test S2 and LRT S4 are 0.1157 and 0.1186, respectively. Further the 95% bootstrap confidence intervals for zero-inflation parameter ω in three induced data sets D5,D6 and D7 are (0.0200, 0.2000), (0.1667, 0.3889) and (0.0392, 0.2157), respectively. This also shows that there exist the zero-inflation and positive correlation among the components in three induced four-dimensional proportional data sets D5,D6 and D7.

Table 7.

The results for analysing three four-dimensional data sets of whitefly data by using MZIB model.

Data set D5 D6 D7
Test ω=0 Score(p-value) 139981.7580(0.0000) 34758.6047(0.0000) 26939.4613(0.0000)
  LRT(p-value) 100.3533(0.0000) 261.0577(0.0000) 100.1171(0.0000)
Test π1=π2=π3=π4 Score(p-value) 12.6343(0.0055) 5.9178(0.1157) 53.6393(0.0000)
  LRT(p-value) 12.5544(0.0057) 5.8597(0.1186) 51.3813(0.0000)
MLE of Parameter (ω,π1,π2) (0.1000, 0.2678, 0.3762) (0.2778, 0.2668, 0.2778) (0.1176, 0.2233, 0.2828)
  (π3,π4) (0.3035, 0.3186) (0.3168, 0.3378) (0.2329, 0.4240)

We also computed the values of score tests and likelihood tests for all possible data sets induced from the original data, such as two-dimensional data sets 1–2 weeks, 2–3 weeks,…, 11–12 weeks, three-dimensional data sets 1–3 weeks, 2–4 weeks,…, 10–12 weeks,…, 11-dimensional data sets 1–11 weeks and 2–12 weeks, and 12-dimensional data set 1–12 weeks (whole data). Although all tests show the strong existence of zero-inflation for these data sets, the test results may not be applicable because the proposed test statistics are not reliable for the multivariate proportional data with 5+ dimensions. However, we may use the bootstrap confidence interval to test the zero-inflation in multivariate binomial data with the moderate dimension number. Now we partition the original data into two six-dimensional data sets D8 and D9 and then analyse them by using the proposed MZIB model. The results are presented in Table 8.

From the results in Table 8, the values of score test statistic for testing the presence of zero-inflation are very large but unreliable. Note that since the EM algorithm is based on the stochastic representation (1) of Definition 2.1 and thus the value of zero-inflation parameter is implicitly assumed to be in the unit interval [0,1), the smallest values of lower limits in the confidence intervals are at least zero. Further based on Definition 2.1, the hypotheses for testing the presence of zero-inflation should be upper-tailed hypothesis H0:ω=0 versus H1:ω>0. From EM algorithm, the 95% bootstrap confidence intervals for zero-inflation parameter ω in two induced six-dimensional data sets D8 and D9 are (0.0000,0.1200) and (0.0000, 0.0980) and the 90% bootstrap confidence intervals are (0.0200,0.1200) and (0.0000, 0.0784), respectively. By comparing zero with the lower limit of confidence intervals, the null hypothesis is rejected in data set D8 at the level α=0.05 but is not rejected at the level α=0.025. However, the null hypothesis is not rejected at both levels α=0.05 and 0.025 in data set D9. This means that there is no evidence for the presence of zero-inflation in data set D9 at α=0.05. Furthermore, by using the bootstrap method, the p-values are approximately 0.045 in D8 and 0.126 in D9. Based on these p-values, we can also get the same conclusions as above.

Table 8.

The results foranalysing two six-dimensional datasets of whitefly data by using MZIB model.

Data set D8 D9
Test ω=0 Score(p-value) 1092598.3867(0.0000) 368992.8598(0.0000)
  LRT(p-value) 76.0592(0.0000) 49.6176(0.0000)
Test π1==π6 Score(p-value) 32.3973(0.0000) 56.7644(0.0000)
  LRT(p-value) 31.9574(0.0000) 54.0472(0.0000)
MLE of Parameter (ω,π1,π2,π3) (0.0600, 0.2575, 0.3602, 0.2891) (0.0392, 0.2479, 0.2538, 0.2043)
  (π4,π5,π6) (0.3052, 0.2185, 0.2254) (0.2593, 0.2160, 0.3974)

6. Concluding remarks

A new model for multivariate proportional data, called ‘multivariate zero-inflated binomial model’ has been proposed. The model introduced a common zero-inflated parameter for all components of multivariate binomial variable, which automatically address the correlation among the components. This model can also be regarded as an extension of the widely discussed univariate zero-inflated binomial model in proportional data. The Fisher Scoring algorithm and EM algorithm are derived for the computation of the estimates of parameters in the proposed multivariate model with/without covariates. Score tests and likelihood ratio tests are also proposed to detect the existence of zero inflation and the equality of the binomial probabilities in the multivariate binomial model. The simulation results demonstrate that the proposed EM algorithm has excellent performance for the computation of MLEs of parameters even for the moderately large dimension number in the MZIB model with/without covariates, and four test statistics maintain the nominal level well for the small dimensional numbers. However, the proposed test statistics do not work well if the dimension numbers of multivariate binomial variables are greater than 5, which is the limitation of these tests. The whitefly data is used to demonstrate the proposed model and inferential methods for the multivariate binomial data. The results show the existence of correlation and zero inflation and the equality of the binomial probabilities in the subsets of whitefly data.

However, it is very unlikely that all components of the binomial random vector are zero-inflated in the same way and/or by the same amount as measured by zero-inflated parameter, even for moderate number of dimension. Therefore, the proposed model has the limitation for the application. The solution to this question is to introduce more zero-inflated parameters for the components of multivariate zero-inflated binomial variable. Such model has been proposed for multivariate count data with excessive zeros. This model can obviously be extended to analyse the multivariate proportional data with excessive zeros and the corresponding research on such model can be done in the future. Moreover, the proposed score test method had a shortcoming with a large number of dimensions for multivariate proportional data. The main reason is that the denominator of score test statistics is very small, which results in the large variability of score test statistics. In fact, due to the same reason, the score test does not work well even for univariate binomial variable if the binomial denominator is much large. We are considering a modification of score test. For example, the modified score test method may work well for the moderate dimension of multivariate binomial variables. Furthermore, in addition to zero-inflation, there often exists the over-dispersion or under-dispersion in the count/proportional data. Such dispersion should be investigated and the multivariate zero-inflated beta-binomial (MZIBB) model could be used to fit such data with over-dispersion or under-dispersion. We will be doing such research in the future.

Supplementary Material

Supplemental File.pdf

Acknowledgments

The authors are very grateful to the editor, associate editor and two referees for their careful reading and valuable comments, which have greatly improved this paper. The research of the first author is partially supported by Natural Sciences and Engineering Research Council of Canada(NSERC).

Funding Statement

The research of first author is partially supported by Natural Sciences and Engineering Research Council of Canada(NSERC). Guo-Liang Tian's research was fully supported by National Natural Science Foundation of China (Grant No. 11771199).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Alevizakos V. and Koukouvinos C., Monitoring of zero-inflated binomial processes with a DEWMA control chart, J. Appl. Stat. 2020. doi: 10.1080/02664763.2020.1761950 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Alqawba M. and Diawara N., Copula-based Markov zero-inflated count time series models with application, J. Appl. Stat. 48 (2021), pp. 786–803. doi: 10.1080/02664763.2020.1748581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Broek V.J., A score test for zero-inflation in a Poisson distribution, Biometrics 51 (1995), pp. 738–743. [PubMed] [Google Scholar]
  • 4.Chandrasekarn B. and Balakrishnan N., Some properties and a characterization of trivariate and multivariate binomial distributions, Statistics 36 (2002), pp. 211–218. [Google Scholar]
  • 5.Cohen A.C., Estimation in Mixtures of Discrete Distributions, Statistical Pub. Society, Calcutta, 1963. [Google Scholar]
  • 6.Deng D. and Paul S.R., Score tests for zero-inflation in generalized linear models, Can. J. Stat. 28 (2000), pp. 563–570. [Google Scholar]
  • 7.Deng D. and Paul S.R., Score tests for zero-inflation and over-dispersion in generalized linear models, Stat. Sin. 15 (2005), pp. 257–276. [Google Scholar]
  • 8.Fang K.T., Kotz S. and Ng K.W., Symmetric Multivariate and Related Distributions, Chapman and Hall, New York & London, 1990. [Google Scholar]
  • 9.Hall D.B., Zero-inflated Poisson and binomial regression with random effects: A case study, Biometrics 56 (2000), pp. 1030–1039. [DOI] [PubMed] [Google Scholar]
  • 10.Hall D.B. and Berenhaut K.S., Score tests for heterogeneity and over-dispersion in zero-inflated Poisson and binomial regression models, Can. J. Stat. 30 (2002), pp. 415–430. [Google Scholar]
  • 11.van Iersel M.W., Oetting R.D. and Hall D.B., Imidacloprid applications by subirrigation for control of silverleaf whitefly (Homoptera: Aleyrodidae) on poinsettia, J. Econ. Entomol. 93 (2000), pp. 813–819. [DOI] [PubMed] [Google Scholar]
  • 12.Jansakul N and Hinde J.P., Score tests for zero-inflated Poisson models, Comput. Stat. Data Anal. 40 (2002), pp. 75–96. [Google Scholar]
  • 13.Johnson N.L. and Kotz S., Distribution in Statistics: Discrete Distribution, John Wiley & Sons, New York, 1969. [Google Scholar]
  • 14.Krishnamoorthy A.S., Multivariate binomial and Poisson distributions, Sankhya Indian J. Stat. 13 (1951), pp. 117–124. [Google Scholar]
  • 15.Lambert D., Zero-inflated Poisson regression, with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. [Google Scholar]
  • 16.Lauritzen L., BS2 Statistical Inference, Lecture 4, University of Oxford, 2009.
  • 17.Li C.H., Lu J.C., Park J., Kim K., Brinkley P.A. and Peterson J.P., Multivariate zero-inflated Poisson models and their applications, Technometrics 41 (1999), pp. 29–38. [Google Scholar]
  • 18.Liu Y. and Tian G.L., Type I multivariate zero-inflated Poisson distribution with applications, Comput. Stat. Data Anal. 83 (2015), pp. 200–222. [Google Scholar]
  • 19.Min A. and Czado C., Testing for zero-modification in count regression models, Stat. Sin. 20 (2010), pp. 323–341. [Google Scholar]
  • 20.Natwick E.T., Palumbo J.C. and Engle C.E., Effects of imidacloprid on colonization of aphids and silverleaf whitefly and growth, yield and phytotoxicity in cauliflower, Southwest. Entomol. 21 (1996), pp. 283–292. [Google Scholar]
  • 21.Ridout M., Hinde J. and Demétrio C.G.B., A score test for testing a zero-inflated Poisson regression model against zero-inflated negative binomial alternatives, Biometrics 57 (2001), pp. 219–223. [DOI] [PubMed] [Google Scholar]
  • 22.Song K.-S., Simultaneous statistical modelling of excess zeros, over/underdispersion, and multimodality with applications in hotel industry, J. Appl. Stat. 2020. doi: 10.1080/02664763.2020.1769577 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Vieira A.M.C., Hinde J.P. and Demétrio C.G.B., Zero-inflated proportion data models applied to a biological control assay, J. Appl. Stat. 27 (2000), pp. 373–389. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental File.pdf

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES