Abstract
Count reponses with structural zeros are very common in medical and psychosocial research, especially in alcohol and HIV research, and the zero-inflated poisson (ZIP) and zero-inflated negative binomial (ZINB) models are widely used for modeling such outcomes. However, as alcohol drinking outcomes such as days of drinkings are counts within a given period, their distributions are bounded above by an upper limit (total days in the period) and thus inherently follow a binomial or zero-inflated binomial (ZIB) distribution, rather than a Poisson or zero-inflated Poisson (ZIP) distribution, in the presence of structural zeros. In this paper, we develop a new semiparametric approach for modeling zero-inflated binomial (ZIB)-like count responses for cross-sectional as well as longitudinal data. We illustrate this approach with both simulated and real study data.
Keywords: Bounded count response, COMBINE Study, Distribution-free models, Generalized Estimating Equations, Structural zero, Zero-inated binomial (ZIB)
1. Introduction
The issue of structural zeros has drawn a considerable amount of attention during the last decade [7, 8, 11, 20, 23, 26, 29]. Structural zeros refer to zero responses from those subjects whose count responses will always be zero, i.e., constant zeros, in contrast to random (or sampling) zeros that occur to subjects whose count response can be greater than zero, but appear to be zero due to sampling variability. In regression analysis, the zero-inflated Poisson (ZIP) has become a popular approach for addressing structural zeros for such count responses [7, 8, 11, 23, 29]. But for some count responses such as the number of days of alcohol drinking within a given period, the range of the variable is bounded above. For example, in the NIH-funded Combined Pharmacotherapies and Behavior Interventions (COMBINE) Study [2], a large randomized trial that combines pharmacotherapies and behavioral interventions, among the primary outcomes are days of drinking (DAD) and days of heavy drinking (DHD) of alcohol during a given period. These drinking outcomes measure the number of days when an individual consumes a certain amount of alcohol within a given period such as a week and thus are bounded from above by an upper limit (total days for the period). Although popular for modelling count responses, Poisson-based approaches are not appropriate for modeling such bounded binomial-based count responses, at least when the range is small. This naturally calls for zero-inflated binomial (ZIB) models.
Most available methods for ZIB follow the parametric approach [8, 26]. However, such an approach has only limited applications because of the strong assumption about the data distribution. Semi-parametric, or distribution-free, models are more robust to misspecification than parametric models. The method of estimating equations, assuming only the conditional mean response, is a popular semi-parametric alternative. For longitudinal studies, the generalized estimating equations (GEE), a generalized version of the method of estimating equations, is commonly used to address correlation among repeated responses. However, since ZIB is a mixture of two distributions, we will not be able to identify the model parameters by simply modeling the mean response [4, 7]. Hall and Zhang [7] developed an approach for zero-inflated Poisson and binomial data by integrating maximum likelihood with GEE to deal with correlated longitudinal responses. This “hybrid” approach by-passes the distribution assumptions about the correlation as in the other methods. However, as it still employs parametric models for the marginal distribution of the response, this approach is sensitive to deviations from the assumed marginal distributions. More importantly, their method does not deal with missing values, a common issue in longitudinal studies such as the COMBINE study. Dobbie and Welsh [5] also developed a GEE approach for zero-inflated count data. However, their approach models the mixture of zeros and truncated Poisson, rather than a mixture of structural zeros and Poisson. As a result, structural zeros are not distinguished from random zeros, failing to provide inference about the likelihood of structural zeros, which is of great interest in practice. Furthermore, like the approach by Hall and Zhang [7], their method does not address missing values either.
In this article, we develop a new semi-parametric approach for modeling zero-inflated count responses bounded above by an upper limit. The new approach not only can handle missing data problem, but also is more robust for deviations from the assumed marginal distributions. Compared to [7, 8], the new approach is less computational intensive and can be used as a benchmark to evaluate the performance of the approaches that either have been used for analyzing such responses in the alcohol studies such as linear regression model (applied to the transformed count response) or existing alternatives such as the zero-inflated Poisson (ZIP) and Hall’s mixed effect and marginal ZIB models [7, 8]. The remainder of this paper is organized as follows. In Section 2, we introduce the notation and ZIB-like models for cross-sectional outcomes. In Section 3, we generalize the development of Section 2 to longitudinal data analysis in the complete data case. The extension of the approach to missing data is discussed in Section 4. In Section 5, we assess performance of the proposed models by simulation studies, while in Section 6, we apply the approach to drinking outcomes from the COMBINE. The paper concludes with a discussion in Section 7.
2. ZIB-like Models for Cross-sectional Data
Let yi denote a count response and xi a set of explanatory variables from the ith subject (1 ≤ i ≤ n). When yi | xi follows a binomial distribution, generalized linear models with a link function for binary responses are often applied. In other words, the probability of success (or fail) for each trial of the binomial, after a transformation, is a linear function of the explanatory variables xi. For example, if the popular logistic link function is used, we can assume
(1) |
) where mi is the size of the binomial sample for the ith subject, β is the vector of parameters and the logit transformation of the probability pi is modeled as a linear function of xi. In the presence of structural zeros, the model in (1) becomes in-appropriate. First, conceptually binomial is not the correct distribution. Formally, it is straightforward to check that the variance of the response will be more than the variance expected under a binomial distribution. Thus, if a binomial model is applied, one may face overdispersion. More important, the mean of the binomial in (1) modeled is no longer the mean of the mixture distribution in the presence of structural zeros. Thus, although some ad-hoc techniques are available to correct overdispersion, such as sandwich variance estimates, none applies to correct the flaw in the mean when using (1) to model such count responses.
The zero-inflated binomial (ZIB) model acknowledges the mixture distribution in our context defined by the binary response and the indicator of structural zeros. For simplicity of notation, we use the logistic link function for modeling the binary response throughout the paper. However, the same considerations apply to other link functions for modeling binary outcomes such as probit and complementary log-log.
Let ZIB(mi, pi, ρi) denote a zero-inflated binomial distribution, with ρi denoting the probability of structural zero and mi (pi) denoting the size (mean) of the binomial sample. Assume yi | xi follows a ZIB and generalized logit models are applied to both the structural zero and the binomial components, we may write the ZIB model as:
(2) |
where ui and vi are two subset of xi (not necessarily disjoint). The ZIB(mi, pi, ρi) in (2) has the following distribution function:
(3) |
where the binomial probability at 0, (1−pi)mi, is modified by ρi+(1 − ρi) (1 − ρi)mi to account for the presence of structural zeros. The maximum likelihood estimate may be applied for inference about the model parameters, . However, parametric approaches are very restricted in the sense that the true distribution may deviate from the assumed models, invaliding the inference. For example, out-comes measuring cumulative incidence over a period of time from the same subject, such as alcohol consumption per day over a week, are generally correlated and as a result may not follow the binomial distribution. Thus, models based on weaker assumptions, such as semiparametric approaches that assume no exact distribution, but rather some aspects of the distribution such as the mean, provide robust for a wider class of data distributions.
The estimating equations (EE) based on conditional mean responses from the generalized linear models (GLM) are commonly used semiparametric alternatives to parametric models. However, existing EEs are mostly for modelling a single mean response and as such are not sufficient to identify the parameters of mixture distributions within the current context. Thus, we model both the probability of structural zero and the mean of a zero-truncated binomial:
(4) |
The means, h1i (xi) and h2i (xi), are derived based on the ZIB model (2), and thus, the ZIB in (2) implies (4). However, since there is no assumption about the exact distribution model in (4), this ZIB-like model yields more robust inference. The mean h1i (xi) of a truncated version of (2) at 0 is used in (4) to enable inference based on observed data. Note that the use of the truncated binomial distribution in (4) for yi > 0 | xi bears some resemblance in theory with the approach of [5], as the latter also used truncated Poisson distributions for the positive response. However, the essential difference is that their approach models the mixture of zeros and truncated (positive) count responses and hence structural and random zeros are not distinguished. In contrast, by modeling the structural zero specifically, (4) has the ability to distinguish structural from random zeros. Note that the membership of the mixture components (zero or positive responses) in the approach of [5] is known, because it models the observed zeros, and thus it is really a hurdle model [24]. However, the membership of the mixture components (structural zeros or not) is unknown and thus (4) truly represents a mixture model for a mixed population consisting the at-risk (defined by random zero and positive responses) and non-risk (defined by structural zeros) subgroups.
The semiparametric, or distribution-free, ZIB-like model in (4) is a model for a 2-dimensional response vector, rather than a single response as in a classic GLM or EE model. Methods for restricted moment models may be applied for the inference [1, 9, 18, 25]. Let I(·) denotes a set of indicator functions, s1i = I (yi = 0) − E [I (yi = 0) | xi] = I (yi = 0) − ρi − (1 − ρi) (1 − pi)mi, s2i = I (yi > 0) (yi − E [yi | yi > 0, xi]) = I (yi > 0) (yi − mipi/ [1 − (1 − pi)mi]), and Si = (s1i, s2i). We have
Lemma 2.1
The variance matrix of the response function Si is given by
Cov(s1i, s2i) = 0.
A proof of the lemma is given in the Web Appedix A.
Within the current context, the ZIB-like model in (4) can be estimated by solving the following EE:
(5) |
where and Vi = V ar (Si) given in the above lemma.
It is straightforward to show that the EE in (5) is unbiased and thus provides asymptotically consistent and normally distributed estimates. We summarize the properties of the estimate in the following theorem for ease of reference.
Theorem 2.2
Under the assumption of model (4), estimating equation (5) is un-biased, and hence the estimate β̂, obtained by solving the estimating equation (5) is consistent. Furthermore, as n → ∞ we have
A consistent estimate of the asymptotic variance Σ is readily constructed by substituting consistent estimates of the respective quantities defining Σ, i.e.,
where V̂i and D̂i are Vi and Di with β̂ substituting in place of β.
A proof of the theorem is given in the Web Appendix B.
Remarks
Note that Vi in (5) can be any invertible matrix function of xi, although the commonly use of Vi = V ar (Si) provides the most efficient estimates when the data does follow the ZIB distribution. However, regardless of the choice of working correlation for Vi, inference based on (5) is valid as long as the specification in (4), an assumption weaker than the assumption of ZIB, is true.
3. ZIB-like Models for Longitudinal Data
Now consider longitudinal studies. Suppose there are l assessment times. For notational brevity, assume assessment times are fixed a priori. Let yit and xit be the outcome and the explanatory variables from time t and yi = (yi1, …, yil)⊤ and xi = (xi1, …, xil)⊤ be the outcome and the explanatory variables across the time points for the ith subject (1 ≤ i ≤ n). Both models discussed in the preceding section can be extended to longitudinal data. For the parametric ZIB in (2), the extension can be achieved through random effects, while the semiparametric ZIB-like model in (4) can be extended to longitudinal data by combining the marginal models for each of the assessment points as follows:
(6) |
where mit is the size of the binomial at time t. Let Si = (Si1, …, Sil), and
The semiparametric ZIB-like model above is also a restricted moment model, with more complicated functional responses than the model in (4), and thus the general theory of restricted moment models applies. An important difference is that the variance-covariance structure of the response functions in the longitudinal study case is more complicated. In fact, it may not be practical to model the true variance structure. Liang and Zeger [12] suggested the working correlation approach which first estimates a correlation for the estimating equation. If the working correlation captures the true variation, then it will provide optimal efficiency. Even though it is misspecified, the estimates are still consistent.
More precisely, we define the generalized estimating equations in similar forms as that in (5), but with Si defined in (6) and Di and Vi modified as follows. Let
(7) |
To define Vi, we assume the following working correlation model:
(8) |
The working correlation R(α) is often selected as independent or exchangeable for convenience. We may also specify some other forms for a specific study. The exact form selected for R(α) does not affect the consistency of the estimates, but rather the efficiency. For the GEE model, one may need to estimate the parameters in the working correlation R(α) first. Similar to (5), the GEE estimates are consistent as the estimating equations are unbiased. Then we have similar nice asymptotic properties, and we summarize them in the following theorem.
Theorem 3.1
Let β̂ denote the estimator of β obtained by solving the GEE above and α̂ the estimator of α. Under some mild regularity conditions and α̂ is , the estimator β̂ is consistent and is asymptotically normally distributed with zero mean and covariance matrix:
A consistent estimator of Σβ is given by , where
B̂i, D̂i, Ŝi and V̂i denote the corresponding quantities with β and α replaced by β̂ and α̂
Remarks
Hall and Zhang [7] considered an extension of ZIB to the longitudinal setting by modeling the marginal response based on (3) and within-subject correlations using the GEE. Specifically, let rit be an indicator with the value 1 if the ith subject at time t is from the binomial component of ZIB and 0 otherwise, i.e., rit = 1 if yit > 0. Since structure and random zeros cannot be distinguished from each other, rit is unknown if yit = 0. By modeling the (marginal) relationship between yit and xit using (2), Hall and Zhang [7] used a set of estimating equations to account for correlations among the yit’s. Although similar to the GEE model in (5), their estimating equations cannot be solved directly since rit is unobserved for yit = 0. To address the latent issue for variable rit for yit = 0, they developed the Expectation-Solution (ES), an expectation-maximization (EM)-type algorithm, with the E-step computing the expected value of rit and the S-step solving the resulted equations. While sound in theory, this approach is quite problematic in practice due to convergence issues with EM algorithms (see our simulation studies below). As in the cross-sectional case, our approach is based on weaker assumptions and thus is more robust than the ES method of [7]; instead of assuming the parametric ZIB for the marginal distribution, we only assume two specific conditional moments. Further we will extend our method to longitudinal data with missing values below.
Hall and Zhang [7] also mentioned an approach to model both the first- and second-moment to address the lack of information for identifying ZIB using GEE, akin to GEE II [19], and [29] applied the approach for modeling zero-inflated count data. Since such an approach imposes additional assumptions about the variance, it involves the 3rd- and 4th-order moments for inference, creating computational complexity and constraints on the variance of the response. Further, it is difficult to accommodate overdispersion in the binomial component of ZIB, a common occurrence in real studies such as the outcomes of DAD and DHD in COMBINE.
4. Inference under Missing Data
Missing data is inevitable for longitudinal studies. Patients often drop out of study in clinical trials, producing missing values in subsequent visits. The GEE approach above may yield biased estimate, if the missing values are simply excluded from analysis, unless the missing data mechanism follows the very restrictive missing completely at random (MCAR) model [22, 24]. The MCAR assumption implies that the missingness is independent of any other variables (observed or otherwise). Unfortunately, missing data mechanisms in clinical studies often depend on some variables and simply ignoring such a dependence structure as in GEE will in general yield biased estimates. As in the literature, we focus on the missing value in the response and assume the missingness follows a monotone missing data patterns (MMDP), i.e., if a missing response occurs at a time point, all subsequent responses after that time point are also missing [9, 14, 21].
In this section, we consider the missing at random (MAR) model, i.e., the missing data mechanism depends on the variables that are always observed. Within the context of longitudinal data discussed in the preceding section, we define a missing (or rather observed) data indicator for each ith subject as follows:
where t = 1, 2, …, l. We assume no missing data at baseline t = 1 such that ri1 = 1 for all 1 ≤ i ≤ n. Under MAR, the missingness of yit is independent of yit given the observed history and covariates, i.e.,
(9) |
The above condition allows one to integrate the inverse probability weighting (IPW) method with the GEE to provide valid inference for our current ZIB-like model within the current context.
Let
(10) |
where diagt (Δit) denotes an l × l diagonal matrix with Δit on the tth diagonal.
We can estimate β using the following Weighted GEE (WGEE):
(11) |
where Si, Di and Vi are defined the same as in (6), (7) and (8). As in the case for GEE, we need to estimate α in the working correlation R(α) and substitute the estimate in place of α before the WGEE can be solved for β. Also, in practice, the weight function πit in (10) is unknown and need to be modeled. In light of the assumption in (9), we can model the conditional probabilities using models for binary responses such as logistic models. Specifically, let pit = Pr (rit = 1 | ri(t−1) = 1, Hit−). If using logistic regression, we may model pit as:
(12) |
The parameters γ in (12) is estimated based on the subsample of subjects that were observed at least until time t. We estimate πit by the relationship .
We may estimate γ using the following estimating equations:
(13) |
With estimated πit, we can estimate β based on the following generalizing estimating equations:
(14) |
where Di, Vi and Si are defined the same as in (6), (7) and (8), and Δ̂i denotes Δi in (10) with estimated πit. Again, as in the complete data case, Vi may be a function of α if a working correlation model other than the working independence is used. In this case, we must estimate α first and replace the estimated value of α in (14) before we solve (14) for β.
The WGEE estimate β̂ based on (14) also has nice asymptotic properties summarized in the following theorem.
Theorem 4.1
Let β̂ denote the estimate of β obtained by solving the WGEE in (14) and α̂ denote some estimate of α. Under some mild regularity conditions and α̂ is , the estimator β̂ is consistent and is asymptotically normally distributed with zero mean and covariance matrix Σβ = B−1E[ΣU + Φ]B−𝖳, where
A consistent estimator of Σβ is given by replacing ΣU, B and Φ with the following estimates for the corresponding components:
B̂i, D̂i, Ŝi and V̂i denote the corresponding quantities with β replaced by β̂.
When γ is estimated, the asymptotic variance contains an additional term to account for the variability in estimating γ. A proof of the theorem is sketched in the Web Appendix C.
5. Simulation Study
In this section, we assess the performance of the proposed approach under small to large sample sizes using simulation studies. All simulations were performed with a Monte Carlo size of 1,000. We examine the performance of the new approach for both the cross-sectional and longitudinal data. For space consideration, we only report results for longitudinal data with sample size n = 50, 200 and 1000.
For notational brevity, we considered a relatively simple pre-post longitudinal study design, with only one explanatory variable xi following a normal N (1, 1). The count response at both the pre and post time points, yi = (yi1, yi2)⊤, is assumed to follow the following marginal ZIB model:
(15) |
The Copula method was used to simulate correlated data with the fixed marginal model (15) [6, 17, 28].
We first generated xi ~ N (1, 1), followed by simulating the membership of mixtures by generating an independent Bernoulli random variables ai with parameters ρi. If ai = 1, set yi1 = yi2 = 0, i.e., the subject was a structural zero. Otherwise, yit was simulated according to yit ~ Binomial(m, pi), from the at-risk group, where m was set 7, 15 and 30 corresponding to the cases where data are collected weekly, semimonthly and monthly. We set βu0 = −1.2 to produce about 20% structural zeros.
Copula was used to generate correlated multivariate responses for the at-risk group [17]. Let λi be the correlation between the pre- and post- measures among the at-risk group. Then, it is readily checked that for the ZIB in (15), the correlation between yi1 and yi2 is
(16) |
See Web Appendix D for the derivation.
We examined the performance of the approach for both the complete and missing data case. For the missing data case, we simulated missing values following both the MCAR and MAR to examine the impact of different missing mechanisms on the validity of the GEE inference. We assumed no missing data at baseline (t = 1) for the response and the covariate so that missing values only occurred to yi2. To create missing data under MAR, we simulated the missingness at time t = 2 using the following logistic model:
(17) |
where γ0, γx and γy were constants, controlling the amount of missing data as well as the strength of dependence of missingness of yi2 on xi1 and yi1.
5.1. Complete Data Case
We simulated data using different λis to assess the impact of the correlation between the repeated outcomes among the at-risk group. For space issues, we present the cases when λi = 0.001 and 0.5. When λi = 0.001, the pre and post outcomes are virtually independent for the at-risk group. Thus, the correlation between yi1 and yi2 for the whole sample, as indicated by (16), is due to the structural zeros. To keep the correlation between yi1 and yi2 in a reasonable range, we set βv0 = −0.1 and βv1 = −0.3 in our simulation study, where the corresponding correlations between yi1 and yi2 range from 0.52 to 0.82.
We fitted the proposed ZIB-like model to the simulated data under the working independence model. Shown in Table 1 are the estimates of β and empirical and sandwich standard errors from 1000 MC simulations in the complete data case for the case λi = 0.001. For comparison purposes, the simulation results from fitting the simulated data using the ZIB-ES method [7] are also presented. The results suggest that both the proposed and the ZIB-ES methods provide very similar estimates. The estimates of β are both quite close to the their respective true values. Note that Hall and Zhang’s ZIB-ES requires much more computing time; the computing time used was on average about 10 times that of our proposed method for these simulation studies. This is because ES is an EM-type algorithm, which is notorious for its slow convergence [10, 13, 15, 16, 27].
Table 1.
m | Sample Size |
Parameter | GEE Est. | ZIB-ES Est. | |
---|---|---|---|---|---|
Mean(Emp., Asym.) | Mean(Emp., Asym.) | ||||
7 | 50 | βu | −1.244 (0.354, 0.350) | −1.245 (0.355, 0.366) | |
βυ0 | −0.103 (0.129, 0.131) | −0.104 (0.129, 0.128) | |||
βυ1 | −0.302 (0.101, 0.101) | −0.303 (0.102, 0.099) | |||
200 | βu | −1.210 (0.177, 0.175) | −1.210 (0.175, 0.177) | ||
βυ0 | −0.101 (0.065, 0.063) | −0.101 (0.063, 0.064) | |||
βυ1 | −0.300 (0.050, 0.050) | −0.301 (0.051, 0.049) | |||
1000 | βu | −1.200 (0.079, 0.079) | −1.200 (0.079, 0.079) | ||
βυ0 | −0.099 (0.027, 0.029) | −0.099 (0.027, 0.029) | |||
βυ1 | −0.301 (0.022, 0.022) | −0.301 (0.022, 0.022) | |||
15 | 50 | βu | −1.235 (0.335, 0.335) | −1.235 (0.335, 0.344) | |
βυ0 | −0.101 (0.082, 0.085) | −0.101 (0.082, 0.083) | |||
βυ1 | −0.301 (0.066, 0.063) | −0.301 (0.063, 0.061) | |||
200 | βu | −1.208 (0.167, 0.168) | −1.208 (0.167, 0.169) | ||
βυ0 | −0.100 (0.042, 0.042) | −0.100 (0.042, 0.042) | |||
βυ1 | −0.300 (0.031, 0.032) | −0.300 (0.032, 0.031) | |||
1000 | βu | −1.201 (0.076, 0.075) | −1.201 (0.076, 0.075) | ||
βυ0 | −0.099 (0.019, 0.019) | −0.099 (0.019, 0.019) | |||
βυ1 | −0.301 (0.014, 0.014) | −0.301 (0.014, 0.014) | |||
30 | 50 | βu | −1.234 (0.334, 0.334) | −1.234 (0.334, 0.343) | |
βυ0 | −0.101 (0.058, 0.061) | −0.101 (0.058, 0.058) | |||
βυ1 | −0.300 (0.044, 0.045) | −0.300 (0.044, 0.43) | |||
200 | βu | −1.208 (0.167, 0.167) | −1.208 (0.167, 0.169) | ||
βυ0 | −0.100 (0.030, 0.030) | −0.100 (0.030, 0.030) | |||
βυ1 | −0.300 (0.022, 0.022) | −0.300 (0.022, 0.022) | |||
1000 | βu | −1.201 (0.076, 0.075) | −1.201 (0.076, 0.075) | ||
βυ0 | −0.099 (0.013, 0.013) | −0.099 (0.013, 0.013) | |||
βυ1 | −0.301 (0.010, 0.010) | −0.301 (0.010, 0.010) |
Simulation summary for GEE and ZIB-ES under complete data λ = 0.001, βu = −1.2, βυ0 = −0.1, βυ1 = −0.3
The simulation study for the λi = 0.5 case suggested similar conclusion and not reported here to save space.
5.2. Missing Data Case
We used model (17) to generate missing values. First we consider the MCAR cases. To generate missing data for MCAR, we set γx= γy = 0. The value of γ0 was selected so that there are about 20% missing values in yi2. Under MCAR, missing values can be simply ignored. Thus, we used listwise deletion to deal with missing values and simply applied GEE to the observed data. Shown in Table 2 are the estimates of β and empirical and sandwich standard errors based on 1000 MC simulations for both the ZIB-like WGEE method and ZIB-ES method for the case λi = 0.001. The performance of both methods are similar to that of the complete data cases, i.e., they yield similar estimates, but the ZIB-ES approach again requires significantly more time for the computation.
Table 2.
m | Sample Size |
WGEE Est. | ZIB-ES Est. | |
---|---|---|---|---|
Parameter | Mean(Emp., Asym.) | Mean(Emp., Asym.) | ||
7 | 50 | βu | −1.254 (0.370, 0.361) | −1.254 (0.367, 0.377) |
βυ0 | −0.101 (0.140, 0.138) | −0.102 (0.138, 0.134) | ||
βυ1 | −0.306 (0.111, 0.106) | −0.305 (0.110, 0.104) | ||
200 | βu | −1.207 (0.182, 0.181) | −1.208 (0.181, 0.182) | |
βυ0 | −0.101 (0.067, 0.068) | −0.102 (0.066, 0.068) | ||
βυ1 | −0.300 (0.053, 0.052) | −0.300 (0.053, 0.052) | ||
1000 | βu | −1.201 (0.081, 0.081) | −1.201 (0.081, 0.081) | |
βυ0 | −0.099 (0.029, 0.031) | −0.099 (0.029, 0.030) | ||
βυ1 | −0.301 (0.023, 0.023) | −0.301 (0.023, 0.023) | ||
15 | 50 | βu | −1.254 (0.370, 0.361) | −1.241 (0.343, 0.353) |
βυ0 | −0.101 (0.140, 0.138) | −0.100 (0.089, 0.087) | ||
βυ1 | −0.306 (0.111, 0.106) | −0.302 (0.068, 0.065) | ||
200 | βu | −1.207 (0.182, 0.181) | −1.206 (0.172, 0.173) | |
βυ0 | −0.101 (0.067, 0.068) | −0.100 (0.043, 0.044) | ||
βυ1 | −0.300 (0.053, 0.052) | −0.300 (0.033, 0.033) | ||
1000 | βu | −1.201 (0.081, 0.081) | −1.202 (0.078, 0.077) | |
βυ0 | −0.099 (0.029, 0.031) | −0.099 (0.020, 0.020) | ||
βυ1 | −0.301 (0.023, 0.023) | −0.301 (0.015, 0.015) | ||
30 | 50 | βu | −1.241 (0.344, 0.344) | −1.240 (0.342, 0.352) |
βυ0 | −0.100 (0.064, 0.064) | −0.100 (0.063, 0.061) | ||
βυ1 | −0.301 (0.048, 0.047) | −0.301 (0.047, 0.045) | ||
200 | βu | −1.205 (0.172, 0.173) | −1.206 (0.171, 0.173) | |
βυ0 | −0.100 (0.031, 0.032) | −0.100 (0.031, 0.031) | ||
βυ1 | −0.300 (0.023, 0.023) | −0.300 (0.023, 0.023) | ||
1000 | βu | −1.201 (0.078, 0.077) | −1.202 (0.078, 0.077) | |
βυ0 | −0.100 (0.014, 0.014) | −0.100 (0.014, 0.014) | ||
βυ1 | −0.301 (0.010, 0.010) | −0.301 (0.010, 0.010) |
Simulation summary for WGEE and ZIB-ES under MCAR λ = 0.001, βu = −1.2, βυ0 = −0.1, βυ1 = −0.3
Again, the simulation study for the λi = 0.5 case suggested similar conclusion and not reported here.
To simulate missing responses following MAR, we set in (17). Again, the value of γ0 was selected to create about 20% missing responses yi2 at post assessment. Unlike the MCAR case, we cannot simply ignore missing values in the MAR cases. We applied our weighted GEE approach to deal with the missing values. We used model (17) for the missing mechanism and the parameters were estimated from the simulated data. As an illustration that ZIB-ES cannot deal with missing values in the MAR cases, we also computed the ZIB-ES estimates ignoring missing values.
Shown in Tables 3 and 4 are the estimated results for both the ZIB-like WGEE method and ZIB-ES method for the cases λi = 0.001 and λi = 0.5, respectively. Our WGEE approach showed similar performance as in the complete and MCAR cases, demonstrating its capability to deal with missing values under MAR assumption. In contrast, the ZIB-ES method fail to address the missing values. The bias in the estimates of the coefficient in the mixture component, βu, is apparent for both λs. For the count component, the ZIB-ES estimates for βv0 and βv1 are still good when λ = 0.001; this is expected because the λ is so small that the pre and post outcomes are almost independent for the at-risk group. However, when λ = 0.5, obvious bias occur in ZIB-ES estimates for the intercept βv0, although the bias in ZIB-ES estimates for βv1 is not obvious. Again, this is expected, and in fact similar phenomenon occurs for binomial regression, i.e., the cases when there are no structural zeros. Thus, the ZIB-ES method does not apply to MAR in general. Further, it continues to suffer from the computation issue, using about 10 times more computing time than our WGEE approach.
Table 3.
m | Sample Size |
WGEE Est. | ZIB-ES Est. | |
---|---|---|---|---|
Mean(Emp., Asym.) | Mean(Emp., Asym.) | |||
7 | 50 | βu | −1.246 (0.366, 0.358) | −1.162 (0.378, 0.374) |
βυ0 | −0.105 (0.137, 0.139) | −0.099 (0.143, 0.133) | ||
βυ1 | −0.302 (0.111, 0.109) | −0.309 (0.113, 0.104) | ||
200 | βu | −1.209 (0.180, 0.179) | −1.121 (0.182, 0.180) | |
βυ0 | −0.101 (0.068, 0.069) | −0.097 (0.068, 0.067) | ||
βυ1 | −0.301 (0.055, 0.054) | −0.303 (0.052, 0.052) | ||
1000 | βu | −1.201 (0.082, 0.080) | −1.109 (0.080, 0.080) | |
βυ0 | −0.098 (0.029, 0.031) | −0.099 (0.032, 0.030) | ||
βυ1 | −0.301 (0.024, 0.024) | −0.302 (0.023, 0.023) | ||
15 | 50 | βu | −1.236 (0.342, 0.339) | −1.099 (0.347, 0.347) |
βυ0 | −0.101 (0.094, 0.094) | −0.096 (0.093, 0.089) | ||
βυ1 | −0.301 (0.071, 0.070) | −0.305 (0.070, 0.066) | ||
200 | βu | −1.208 (0.171, 0.170) | −1.078 (0.171, 0.170) | |
βυ0 | −0.100 (0.047, 0.047) | −0.098 (0.045, 0.045) | ||
βυ1 | −0.300 (0.036, 0.035) | −0.301 (0.033, 0.033) | ||
1000 | βu | −1.201 (0.078, 0.076) | −1.071 (0.078, 0.076) | |
βυ0 | −0.099 (0.020, 0.021) | −0.099 (0.022, 0.020) | ||
βυ1 | −0.301 (0.015, 0.016) | −0.301 (0.015, 0.015) | ||
30 | 50 | βu | −1.231 (0.342, 0.341) | −1.084 (0.347, 0.345) |
βυ0 | −0.102 (0.071, 0.071) | −0.099 (0.068, 0.064) | ||
βυ1 | −0.298 (0.052, 0.051) | −0.303 (0.050, 0.046) | ||
200 | βu | −1.207 (0.172, 0.171) | −1.068 (0.171, 0.170) | |
βυ0 | −0.102 (0.037, 0.036) | −0.098 (0.033, 0.033) | ||
βυ1 | −0.299 (0.027, 0.026) | −0.301 (0.024, 0.024) | ||
1000 | βu | −1.202 (0.079, 0.077) | −1.061 (0.078, 0.076) | |
βυ0 | −0.100 (0.016, 0.017) | −0.100 (0.018, 0.015) | ||
βυ1 | −0.300 (0.012, 0.012) | −0.301 (0.011, 0.011) |
Simulation summary for WGEE and ZIB-ES under MAR λ = 0.001, βu = −1.2, βυ0 = −0.1, βυ1 = −0.3
Table 4.
m | Samp Size |
WGEE Est. | ZIB-ES Est. | |
---|---|---|---|---|
Parameter | Mean(Emp., Asym.) | Mean(Emp., Asym.) | ||
7 | 50 | βu | −1.255(0.369, 0.377) | −1.164(0.371, 0.377) |
βυ0 | −0.106(0.165, 0.159) | −0.125(0.160, 0.157) | ||
βυ1 | −0.306(0.130, 0.120) | −0.308(0.125, 0.121) | ||
200 | βu | −1.211(0.180, 0.181) | −1.121(0.181, 0.181) | |
βυ0 | −0.100(0.080, 0.081) | −0.120(0.078, 0.079) | ||
βυ1 | −0.302(0.063, 0.062) | −0.305(0.061, 0.060) | ||
1000 | βu | −1.201(0.082, 0.080) | −1.111(0.083, 0.080) | |
βυ0 | −0.098(0.035, 0.036) | −0.118(0.034, 0.035) | ||
βυ1 | −0.302(0.028, 0.028) | −0.306(0.027, 0.027) | ||
15 | 50 | βu | −1.236(0.342, 0.349) | −1.107(0.348, 0.347) |
βυ0 | −0.104(0.110, 0.106) | −0.124(0.104, 0.103) | ||
βυ1 | −0.300(0.083, 0.078) | −0.300(0.079, 0.077) | ||
200 | βu | −1.208(0.171, 0.171) | −1.079(0.175, 0.170) | |
βυ0 | −0.101(0.055, 0.055) | −0.124(0.053, 0.052) | ||
βυ1 | −0.300(0.042, 0.041) | −0.301(0.040, 0.039) | ||
1000 | βu | −1.201(0.078, 0.076) | −1.072(0.079, 0.076) | |
βυ0 | −0.099(0.025, 0.025) | −0.123(0.023, 0.024) | ||
βυ1 | −0.301(0.018, 0.018) | −0.301(0.017, 0.017) | ||
30 | 50 | βu | −1.231(0.342, 0.349) | −1.094(0.348, 0.346) |
βυ0 | −0.104(0.085, 0.077) | −0.125(0.075, 0.074) | ||
βυ1 | −0.298(0.061, 0.055) | −0.296(0.056, 0.054) | ||
200 | βu | −1.208(0.171, 0.172) | −1.069(0.174, 0.170) | |
βυ0 | −0.102(0.043, 0.041) | −0.125(0.038, 0.037) | ||
βυ1 | −0.299(0.031, 0.029) | −0.296(0.028, 0.027) | ||
1000 | βu | −1.202(0.079, 0.077) | −1.063(0.080, 0.076) | |
βυ0 | −0.100(0.020, 0.020) | −0.125(0.017, 0.017) | ||
βυ1 | −0.300(0.014, 0.014) | −0.296(0.012, 0.012) |
Simulation summary for WGEE and ZIB-ES under MAR λ = 0.5, βu = −1.2, βυ0 = −0.1, βυ1 = −0.3
6. Real Study Data
To illustrate the proposed approach with real study data, we applied the approach to the COMBINE study. This multi-site randomized clinical trial was conducted from 2001 to 2004 for subjects with alcohol dependence. This study was designed to compare two pharmacological treatments for alcoholism, naltrexone and acamprosate, alone and in combination with an intensive behavioral treatment, combined behavioral intervention (CBI). In the study, 1383 subjects were randomly assigned to one of nine groups [2, 3]. Eight groups (n=1226) received medical management, a 9-session intervention focused on enhancing medication adherence and abstinence. Among the eight groups, four groups (n=619) also received CBI and the remained four (n=607) did not receive CBI. We group the first four group as the group of medical management (MM) and the last four groups as the group of combined MM and CBI treatment (Combined). The ninth group (n=157) received CBI alone (CBI), without pills or medical management, and hence serves as the control group. The subjects were assessed 9 times during the 16 weeks of treatment and at 26, 52, and 68 weeks after randomization, i.e., up to 1 year after treatment ended. We focused on one of the primary outcomes DAD, which measures a subject’s days of drinking in the last 30 days.
The DAD outcome did not have any zero at baseline, but had preponderance of zeros after the treatments were initiated. Table 5 shows the average percentage of DAD over 30 days period, i.e., the ratio of the number of days of drinking divided by the 30 days period, and the percentage of zeros of DAD at each assessment for the three treatment groups. A significant percentage of zeros were present during the 16 weeks of treatment (26% – 39%) and at the follow up visits (20% – 29%), providing a strong indication of success of the interventions by increasing the number of alcohol abstainers in this study population.
Table 5.
Assessment time |
CBI only | MM | Combined | |||
---|---|---|---|---|---|---|
Mean (SD) | Zeros (%) | Mean (SD) | Zeros (%) | Mean (SD) | Zeros (%) | |
Baseline | ||||||
0.76 (0.25) | 0 (0.00) | 0.74 (0.24) | 0 (0.00) | 0.75 (0.26) | 0 (0.00) | |
Treatment Period | ||||||
Week 4 | 0.33 (0.35) | 41 (26.45) | 0.20 (0.26) | 213 (35.15) | 0.22 (0.29) | 225 (36.64) |
Week 8 | 0.36 (0.37) | 39 (26.17) | 0.25 (0.32) | 202 (34.18) | 0.24 (0.31) | 205 (34.69) |
Week 12 | 0.35 (0.36) | 42 (28.38) | 0.26 (0.33) | 222 (38.47) | 0.23 (0.31) | 215 (36.69) |
Week 16 | 0.35 (0.37) | 45 (30.82) | 0.27 (0.33) | 201 (35.39) | 0.23 (0.32) | 225 (38.86) |
Follow-up | ||||||
Week 52 | 0.41 (0.37) | 28 (20.14) | 0.38 (0.37) | 132 (24.26) | 0.37 (0.37) | 149 (26.90) |
Week 68 | 0.42 (0.40) | 36 (28.80) | 0.38 (0.38) | 136 (26.25) | 0.37 (0.37) | 150 (28.41) |
Average percentage of DAD and the number of zeros at each assessment for COMBINE Study
We first model the outcome DAD at 4-week (yi1), 8-week (yi2), 12-week (yi3) and 16-week (yi4) as a function of treatment conditions (xi1, xi2), the baseline DAD, yi0, and the sites, where xi1 is defined as 1 for the treatment condition MM and 0 otherwise, and xi2 is 1 for the combined and 0 for others. The sites were controlled in the model because the sites were significantly different among the treatment conditions [2]. While random effect models are often used for such site effect given the number of the sites, we choose the fixed effect approach here for two reasons. First, it is hard to specify the random effects for mixture outcomes such as ZIB, if no prior information is available. Second, we are focusing semiparametric approaches in the paper, but available random effect models are parametric. Since there were 11 sites, 10 dummy variables (zi1, zi2, …, zi10) were created and included in the model.
We used the ZIB-like model in (6) with
(18) |
We also compared the treatment effect for the whole study period by modeling DAD at additional two assessments time 52-week (yi5) and 68-week (yi6) using the same model as (18).
Among the 1,383 patients, 90 (6.5%) dropped out of the study by the end of treatment period and this number increased to 212 (15.3%) by the end of the study. We examined the missing data mechanism by the logistic regression under MAR and MMDP assumptions. Specifically, we conducted logistic regressions at each assessment time except the baseline with the missing indictor as the dependent variable and the treatment conditions, the DAD from the prior assessment time, baseline demographic information age and gender as the covariates. The weights were estimated and then been used in the ZIB-like WGEE model.
Treating the CBI only as the reference group, the estimated treatment effects are presented in Table 6. The ten estimates for the sites were not included here due to the space limitation. For the treatment period, compared to the CBI group, both the MM and Combined groups had large proportion of subjects abstinent from alcohol drinking (structural zero component), and for those weren’t abstinent from alcohol drink, the subjects in these two groups were less likely to drink. Similar patterns were found for the whole study period, though the MM has a borderline p-value 0.075. Our estimates confirmed the findings in [2], but the ability of assessing the treatment effect in turning subjects to be not at-risk as well as reducing the alcohol use among at-risk subjects are more comprehensive, because models in [2] ignores the issue of structural zeros. It can also provide much welcomed information to target subjects who most need the intervention. Although the ZIB-ES method is not appropriate here as illustrated in the simulating studies, we did apply it to the data for comparison purpose. However, the fitting of the ZIB-ES method did not converge.
Table 6.
Treatment period only | Whole study period | |||||
---|---|---|---|---|---|---|
Parameters | Estimate | SE | p-value | Estimate | SE | p-value |
Strucural zero part (ρit) | ||||||
βu0 | −0.784 | 0.273 | 0.004 | −0.701 | 0.259 | 0.007 |
βu1 (MM vs CBI only) | 0.383 | 0.169 | 0.024 | 0.274 | 0.154 | 0.075 |
βu2 (Combined vs CBI only) | 0.427 | 0.169 | 0.012 | 0.345 | 0.155 | 0.026 |
βu3 (baseline DAD) | −0.006 | 0.007 | 0.327 | −0.010 | 0.006 | 0.124 |
Binomial part (pit) | ||||||
βυ0 | −1.322 | 0.208 | < 0.0001 | −1.264 | 0.183 | < 0.0001 |
βυ1 (MM vs CBI only) | −0.358 | 0.122 | 0.003 | −0.252 | 0.106 | 0.018 |
βυ2 (Combined vs CBI only) | −0.480 | 0.122 | < 0.0001 | −0.347 | 0.107 | 0.001 |
βυ3 (baseline DAD) | 0.059 | 0.006 | < 0.0001 | 0.060 | 0.005 | < 0.0001 |
Results of WGEE based on ZIB-like model for COMBINE Study p-value for H0 : β = 0
Discussion
In medical and psychosocial research, we frequently encounter bounded count responses with a preponderance of zeros. Such variables often arise when subjects/patients experience a number of events/activities such as DAD over a period of time. As they are the sum of finitely many zeros and ones, it is more reasonable to model these outcomes using the zero-inflated binomial, rather than the zero-inflated Poisson. However, since they are sums of dependent Bernoulli variables, which induce overdispersion, they are not exactly (zero-inflated) binomial. By integrating the generalized estimating equation for dependent responses and the inverse probability weighting technique for missing values, we developed a distribution-free approach for modeling such overdispersed ZIB-like outcomes for both cross-sectional and longitudinal studies. Unlike standard log-linear models for count responses in the absence of structural zeros, the proposed ZIB-like model has a more complex bivariate response function, key to identifying the two latent subgroups of a mixed population consisting of the at- and non-risk subgroups. Our approach only models two mean responses, thereby providing more robust inference than parametric alternatives. Further studies are needed to extend the approach to more complex situations such as non-monotone missing data under MAR and non-parametric form of the mean response functions. Addressing these and other limitations will further facilitate building more accurate models for ZIB-like data.
Supplementary Material
Acknowledgements
The authors thank professors Xin Tu and Wan Tang for their constructive comments and suggestions.
Funding
The study was supported in part by National Institute on Drug Abuse grant R33DA027521, National Institute of General Medical Sciences grant R01GM108337, UR CTSI grants 8UL1TR000042-07 and 8UL1TR000042-09.
Footnotes
Appendix
See the Web-based Supplementary Materials.
References
- 1.Chamberlain G. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics. 1987;34:305–334. [Google Scholar]
- 2.COMBINE Study Research Group. Testing combined pharmacotherapies and behavioral interventions in alcohol dependence: rationale and methods. Vol. 27. Alcoholism (NY): 2003. pp. 1107–1122. [DOI] [PubMed] [Google Scholar]
- 3.COMBINE Study Research Group. Combined pharmacotherapies and behavioral interventions for alcohol dependence – the COMBINE study: a randomized controlled trial. JAMA: The Journal of the American Medical Association. 2006;295:2003–2017. doi: 10.1001/jama.295.17.2003. [DOI] [PubMed] [Google Scholar]
- 4.Crowder M. On linear and quadratic estimating functions. Biometrika. 1987;74:591–597. [Google Scholar]
- 5.Dobbie MJ, Welsh AH. Modelling correlated zero-inflated count data. Aust. N. Z. J. Stat. 2001;43:431–444. [Google Scholar]
- 6.Freesm EW, Valdez E. Understanding relationships using copulas. North American Actuarial Journal. 1998:1–25. [Google Scholar]
- 7.Hall D, Zhang Z. Marginal models for zero inflated clustered data. Statistical Modelling. 2004;4:161–180. [Google Scholar]
- 8.Hall DB. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
- 9.Kowalski J, Tu XM. Wiley Series in Probability and Statistics. Hoboken, NJ: Wiley-Interscience; 2008. Modern applied U-statistics. [Google Scholar]
- 10.Kowalski J, et al. On the rate of convergence of the ecme algorithm for multiple regression models with t-distributed errors. Biometrika. 1997;84:269–281. [Google Scholar]
- 11.Lambert D. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
- 12.Liang K, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
- 13.Louis T. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1982:226–233. [Google Scholar]
- 14.Ma Y, et al. Inference for kappas for longitudinal study data: Applications to sexual health research. Biometrics. 2008;64:781–789. doi: 10.1111/j.1541-0420.2007.00934.x. [DOI] [PubMed] [Google Scholar]
- 15.Meilijson I. A fast improvement to the em algorithm on its own terms. Journal of the Royal Statistical Society. Series B (Methodological) 1989:127–138. [Google Scholar]
- 16.Meng X, Rubin D. Using EM to obtain asymptotic variance-covariance matrices: the sem algorithm. Journal of the American Statistical Association. 1991:899–909. [Google Scholar]
- 17.Nelsen RB. An introduction to copulas. Springer: 2006. [Google Scholar]
- 18.Newey WK. Adaptive estimation of regression models via moment restrictions. Journal of Econometrics. 1988;38:301–339. [Google Scholar]
- 19.Prentice RL, Zhao LP. Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics. 1991:825–839. [PubMed] [Google Scholar]
- 20.Ridout M, Hinde J, DemeAtrio CG. A score test for testing a zero-inflated poisson regression model against zero-inflated negative binomial alternatives. Biometrics. 2001;57:219–223. doi: 10.1111/j.0006-341x.2001.00219.x. [DOI] [PubMed] [Google Scholar]
- 21.Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association. 1995;90:122–129. [Google Scholar]
- 22.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
- 23.Tang W, He H, Gunzler D. Kernel smoothing density estimation when group membership is subject to missing. Journal of Statistical Planning and Inference. 2012;142:685–694. doi: 10.1016/j.jspi.2011.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tang W, He H, Tu X. Applied Categorical and Count Data Analysis. Chapman & Hall/CRC; 2012. [Google Scholar]
- 25.Tsiatis AA. Semiparametric Theory and Missing Data. Springer, New York: Springer Series in Statistics; 2006. [Google Scholar]
- 26.Vieira A, Hinde JP, Demétrio CG. Zero-inflated proportion data models applied to a biological control assay. Journal of Applied Statistics. 2000;27:373–389. [Google Scholar]
- 27.Wu C. On the convergence properties of the em algorithm. The Annals of Statistics. 1983:95–103. [Google Scholar]
- 28.Yan JR. Package copula on cran, multivariate dependence with copula. 2009 [Google Scholar]
- 29.Yu Q, et al. Distribution-free models for longitudinal count responses with overdispersion and structural zeros. Statistics in medicine. 2013;32:2390–2405. doi: 10.1002/sim.5691. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.