Summary
Overdispersion and structural zeros are two major manifestations of departure from the Poisson assumption when modeling count responses using Poisson loglinear regression. As noted in a large body of literature, ignoring such departures could yield bias and lead to wrong conclusions. Different approaches have been developed to tackle these two major problems. In this paper, we review available methods for dealing with overdispersion and structural zeros within a longitudinal data setting and propose a distribution-free modeling approach to address the limitations of these methods by utilizing a new class of functional response models (FRM). We illustrate our approach with both simulated and real study data.
Keywords: functional response models, monotone missing data pattern, negative binomial, zero-inflated Poisson, weighted generalized estimating equations
1 Introduction
Count (or frequency) responses such as number of heart attacks, days of hospitalization, suicide attempts or unprotected vaginal sex arise quite often in biomedical and psychosocial research. The Poisson distribution and more generally Poisson-based log-linear regression are widely used for modeling such data. However, heterogeneity in study populations such as data clustering often creates extra variability, which renders the Poisson distribution inappropriate for modeling count data in such instances. One approach for addressing this extra Poisson, or overdispersion, is the popular negative binomial (NB) distribution. This modeling strategy, however, is rendered ineffective when the extra variability is caused by an excessive number of zeros above and beyond the number of zeros expected by the Poisson law. For example, when modeling behavioral outcomes such as the number of unprotected vaginal sex over a period of time in HIV prevention research, the specific study population often contains a subgroup of individuals who are not at risk for such a behavior during the study period, in which case neither the Poisson nor NB is able to accommodate such cases of structural zeros in the study population. One popular approach for addressing such inflated zero counts is the zero-inflated Poisson (ZIP) model, which has been applied to a diverse range of studies(1-16). The inherent methodological problems with structural zeros have received a great deal of attention in the literature(4; 9; 10; 19; 17; 18).
When modelling count responses in the presence of overdispersion and structural zeros within a longitudinal data setting, one of the current strategies is to employ random effects within the context of the generalized linear mixed-effects model (GLMM) to account for correlated responses from repeated assessments over time(19). However, as it relies on parametric assumptions about random effects and response for inference, such an approach lacks robustness when real study data depart from the assumed distributional models. Further, the random effects induce overdispersion into the marginal model at each assessment, giving rise to quite different results and findings than those from the marginal models(20; 21; 22). In addition, such an approach computes estimates using the expectation/maximization (EM) algorithm, which can be problematic since EM is notorious for its slow convergence and may yield local rather than global maxima, making it difficult to apply such methods in routine analyses.
A popular alternative is to use the generalized estimating equations (GEE) to address correlated longitudinal responses. The GEE approach is widely used for modeling the mean response, or first-order moment. Unlike GLMM, model parameters have the same interpretations between the marginal and joint models across assessment times. In addition, as GEE models the marginal mean of the response variable at each assessment time, it ignores both layers of assumptions and thereby provides consistent estimates regardless of the complexity of the correlation structure and the distribution of the response. GEE estimates are also much easier to compute than those based on the GLMM approach.
As the key difference between the standard (Poisson) log-linear model and other models for count responses such as ZIP lies in the variance, or the second-order moment, GEE does not apply directly to extending such models to a longitudinal data setting(23; 24; 25). Also, since ZIP is a mixture of two distributions, we will not be able to identify the model parameters by simply modeling the mean response(24; 26). One approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, as the structural and sampling zeros are mixed into a single category, this approach is unable to identify the parameters for modeling the structural zeros, which is often of great interest in practice. For example, in the hospitalization example, this approach will only model those who are hospitalized, since the at-risk subgroup for hospitalization is mixed with those who are healthy and are not at risk for hospitalization. In many studies, it is of great importance to model the at- and non-risk subgroups separately. An alternative to address the identifiability issue is to include a modeling component for the variance and apply GEE to both the specified mean and variance(24; 25; 28; 29). However, all these methods do not sufficiently address missing data, yielding biased inference if missing data does not follow the missing completely at random (MCAR) assumption(30; 31).
In this paper, we propose an approach to overcome the aforementioned difficulties by utilizing a class of functional response models (FRM) and the popular weighted generalized estimating equations (WGEE). In Section 2, we first give a brief overview of the problems with overdispersed and zero-inflated count data and popular models for addressing them. We then introduce FRM and discuss its application to the current setting. In Section 3, we discuss inference for the FRM-based models under both complete and missing data. In Section 4, we illustrate the proposed models with real study data and investigate their performance using simulated data. In Section 5, we give our concluding remarks.
2 Functional Response Models for Count Response
We start with a brief review of existing approaches for addressing overdispersion and structural zeros.
2.1 Models for Overdispersion and Structural Zeros
Consider first a cross-sectional study with n subjects, and let yi denote a count response and xi a vector of explanatory variables. The popular Poisson log-linear regression, a member of the generalized linear model (GLM) family, models the conditional mean of yi given xi, μi = E(yi | xi), by applying the log function to link μi to the linear predictor :
(1) |
where i.d. denotes independently distributed and P(μ) the Poisson distribution with mean μ. Under (1), the conditional mean E(yi | xi) and variance Var(yi | xi) of yi given xi satisfy:
(2) |
As mentioned, the conditional variance Var(yi | xi) often exceeds the conditional mean μi in real study applications, making (1) inappropriate for modeling such count data. When overdispersion occurs, the standard error of the parameter estimate of the Poisson model is artificially deflated, giving rise to artificially inflated effect size estimates and false significant findings.
Overdispersion can often be empirically detected by goodness of fit statistics or even formally tested(25; 32; 33). When deemed present, overdispersion may be corrected post hoc by using robust variance estimates(25). An alternative is to use models that explicitly address this issue. For example, the popular negative binomial (NB) model allows the variance to exceed the mean:
(3) |
Unlike the Poisson, the NB has an extra parameter α to indicate the degree of overdispersion. As α → 0, Var(yi ∣ μi, α) → μi. Thus, unless α = 0, the variance of NB is always larger than the mean, addressing overdispersion. Under NB, we can check overdispersion by testing the null: H0 : α = 0. Note, however that under H0, α = 0 is a boundary point of α ≥ 0 and the maximum likelihood estimate (MLE) αˆ of α cannot be used directly for testing H0, and alternative score statistics must be used(33; 34; 35).
Count responses in many biomedical and psychosocial studies are dominated by a preponderance of zeros that exceeds the expected frequency of the Poisson. Such excess or structural zeros not only cause overdispersion, but also affect the conditional mean, leading to biased estimates of model parameters. The zero-inflated Poisson (ZIP) model is a popular approach to address the twin effects of structural zeros on both the mean and variance.
Let ui and vi be two subsets of xi, which may overlap one another or even identical, and thus may not be a partition of xi. The ZIP regression model is defined by:
(4) |
where ZIP(μ, ρ) denotes the ZIP distribution defined by:
(5) |
with f0(y) denoting a degenerate distribution centered at 0. In (5), the Poisson probability at 0, fP (0 ∣ μ), is modified by ρf0 (0) + (1 − ρ) fP (0 ∣ μ) with ρf0 (0) = ρ to account for structural zeros.
Consider these models within a longitudinal setting with m assessments, with yit, xit, uit and vit denoting the respective variables at time t (1 ≤ t ≤ m). We may model yit as a function of xit (or uit and vit for ZIP) by using either a parametric or distribution-free modeling approach. As mentioned, the former suffers interpretational and computational problems. A popular distribution-free alternative with inference based on the generalized estimating equations (GEE) is to specify the conditional mean of yit given xit, which for count response has the following form,
(6) |
This mean-based specification, however, is not sufficient to distinguish the Poisson from the NB, as the two models only differ in the conditional variance V ar(yit ∣ xit). The classic model specification also does not work for ZIP, since the conditional mean of yit given xit in this case is
(7) |
which in general does not provide sufficient information to identify βu and βv.
To help distinguish among the three models, one can augment the GEE by including the distinct form of the conditional variance V ar(yit ∣ xit) for each model and use the resulting GEE II for inference(23; 24; 28; 29). However, this approach is ad-hoc in the sense that GEE II is a method of inference primarily used for improving efficiency over GEE, rather than a formal model akin to (6), since the added response (or dependent variable) V ar(yit ∣ xit) is a function of parameters(25). In addition, it does not effectively address missing data. Another approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, this approach is unable to identify the parameters for modeling the structural zero, which is often of greater interest in practice. Next we utilize a new class of regression models to address the limitations of the aforementioned approaches.
2.2 Functional Response Models
Consider a class of distribution-free regression models defined by:
(8) |
where yi = (yi1, …, yim)˕ denotes the vector of responses from the ith subject, f some vector-valued function, h(θ) some vector-valued smooth function (e.g. with continuous derivatives up to the second order), θ a vector of parameters of interest, q some positive integer, and the set of combinations of q distinct elements (i1, …, iq) from the integer set {1, …, n}. The functional response models (FRM) (8) extend the single-subject response in the classic GLM to a function of responses from multiple subjects. For example, by setting q = 1, we immediately obtain from (8) the class of distribution-free GLMs for longitudinal data with m assessments. With FRM, we can express a broader class of problems under a regression-like framework(25; 36; 37; 38; 39; 40). Below, we focus on the application of FRM within our setting for modeling count responses.
Consider first the simpler cross-sectional study setting. For the cross-sectional parametric ZIP in (4), let
(9) |
where ui(vi) denotes a subset of xi. Under (4), E(fi ∣ ui, vi) = hi(ui, vi). For NB, f (yi) is defined the same as for ZIP in (9), but with hi = (h1i, h2i)⊤ modified as follows:
(10) |
As a special case with α = 0, the FRM for NB reduces to a distribution-free Poisson with . Note that under the FRM-based NB, we can allow α to be negative and thus α = 0 is no longer a boundary point. Thus, we can readily use the estimate of α to test the null H0 : α = 0 to determine whether the Poisson loglinear model is appropriate.
For longitudinal data, suppose that each subject is assessed m times, with yit and xit denoting the respective variables at time t (1 ≤ t ≤ m). Define the FRM-based ZIP model as follows:
(11) |
Likewise, we obtain a longitudinal version of FRM-based NB by defining the same fit, but modifying hit as follows:
(12) |
Note that we have assumed a constant α for NB, though the model above readily accommodates a time-varying α. In many studies, clusters causing overdispersion such as those formed by the subjects sampled from a common habitat may not change over time during the study, and this assumption is reasonable.
Both the ZIP and NB models for longitudinal data in (11) and (12) yield the same first-and second-order moment as their respective cross-sectional versions in (9)-(10) at each time t (1 ≤ t ≤ m). Thus, unlike their GLMM-based parametric counterparts, estimates from the FRM-based ZIP and NB models for longitudinal data can be readily compared to their corresponding cross-sectional versions. These distribution-free models are also called semiparametric or moment-based in the literature(41; 42). We refer to these as distribution-free models throughout the text unless otherwise stated.
3 Distribution-free Inference
We first discuss inference for cross-sectional data, and then extend the considerations to the longitudinal setting.
3.1 Distribution-free Inference for Cross-sectional Data
For the FRM-based ZIP model in (??), let and
(13) |
We estimate θ by the following set of generalized estimating equations,
(14) |
Given the ZIP model in (4), the elements of Vi in (13) are functions of the conditional moments of yi given xi up to the 4th order, which can be expressed in closed form (see Appendix A). Thus, the quantities Di, Vi and Si in (13) are readily evaluated. Note that (14) bears a close resemblance to the generalized estimating equations II (GEE II) for generalized linear models(25; 28; 29; 43).
By defining Di, Vi and Si the same way as in (13), but with θ = (β⊤, α)⊤ and hi defined in (10), the GEE in (14) can be used to obtain estimates of θ for NB as well.
Under (9), the GEE estimate θˆ of θ obtained as the solution to (14) is consistent and asymptotically normal (see Theorem 1 below):
(15) |
where →d denotes convergence in distribution(25). Unlike the MLE, the asymptotic results above do not require that yi (given ui and vi) follow the ZIP distribution in (4). If yi does follow such a parametric model, Σθ in (15) simplifies to Σθ = B−1, which is the model-based asymptotic variance.
A consistent estimate of Σθ is obtained by substituting moment estimates in place of the respective parameters:
where Bˆi, Dˆ, Sˆi and Vˆi denote the corresponding quantities with θ replaced by θˆ. Our simulations indicate that the model-based asymptotic variance estimate Bˆ outperforms its sandwich alternative by yielding slightly more accurate type I error rates under the correct parametric model(44).
3.2 Distribution-free Inference for Longitudinal Data
We begin with inference under complete data and then extend the discussion to include missing data.
3.2.1 Inference under Complete Data
Let
(16) |
where fit and hit are defined by (11) for the ZIP or by (12) for the NB model. We again apply the GEE in (14), but with Di and Si revised to reflect the changed dimension, and Vi modified to reflect the correlation between the fit's over time:
(17) |
where R(τ) is a working correlation matrix among the components of fi parameterized by τ. As in the cross-sectional data case, Ait is readily computed. For R(τ), the popular choices are the working independence model (R(τ) = I2m) and the exchangeable correlation structure given by:
Thus, τ is known for the working independence model, but unknown for the exchangeable correlation model with τ = ρ.
Note that since the GEE estimate may not be consistent under working correlation structures other than the independence model, especially in the presence of time-varying covariates(45), we focus on this model in what to follow unless otherwise stated. With this choice of R(τ), the GEE is readily solved for θ. However, when the working correlation model used involves an unknown τ, an estimate must be substituted before the GEE is solved to obtain estimates of θ.
As in the cross-sectional data case, the GEE estimate has nice asymptotic properties summarized in Theorem 1 below. Since this is a special case of Theorem 2, its proof is omitted. Since Theorem 1 is stated for general working correlation models, it includes the condition for the estimate of τ to ensure such nice properties.
Theorem 1
Let θˆ denote the GEE estimate and let
(18) |
Under mild regularity conditions, θˆ is consistent. Further, if τˆ is , i.e., is bounded in probability(25), then θˆ is asymptotically normal with the asymptotic variance Σθ. A consistent estimate of Σθ is given by:
(19) |
where Bˆi, Dˆi, Sˆi and Vˆi denote the corresponding quantities with θ replaced by θˆ.
Note that given the limited choices for the working correlation matrix R(τ), generally is not true in practice. Thus, unlike the cross-sectional data case, there is no model-based asymptotic variance.
3.2.2 Inference under Missing Data
Missing data arise frequently in real studies. For mean-based distribution-free models such as the GLM, the weighted generalized estimating equations (WGEE) is the most popular for inference about model parameters. By integrating the inverse probability weighting (IPW) technique with the GEE, the WGEE ensures valid inference when the missing data follows the missing at random (MAR) model, a plausible and general missing data mechanism applicable to many studies in practice(25; 31; 41; 46; 47). We discuss below how to extend this IPW approach to the current FRM-based models for count responses.
Within the context of longitudinal data discussed in the preceding section, we define a missing (or rather observed) data indicator for each subject as follows:
We assume no missing data at baseline t = 1 such that ri1 = 1 for all 1 ≤ i ≤ n. Let
(20) |
In most applications, the weight function πit is unknown and must be estimated. Under MCAR, ri is independent of xi and yi and thus πit = Pr (rit = 1) = πt. In this case, πt is a constant independent of xi and yi and is readily estimated by the sample moment: .
Under MAR, πit becomes dependent on the observed xi and yi, making it difficult to model and estimate πit without imposing the monotone missing data pattern (MMDP) assumption because of the large number of missing data patterns(25; 37; 41). Under MMDP, yit (xit) is observed only if all yis (xis) prior to time t are observed (1 ≤ s ≤ t ≤ m).
Let
where X̃it and ỹit contain the explanatory and response variables prior to time t, respectively. Under MAR we have:
Let pit = Pr (rit = 1 ∣ ri(t−1) = 1, Hit), the one-step transition probability for observing the response from time t − 1 to t. We can model pit using logistic regression:
(21) |
where . Let . Then, under MMDP,
The above provides a relationship to estimate πit from the model for pit in (21).
We may estimate γ using the following estimating equations:
(22) |
With estimated πit, we can estimate θ by generalizing the WGEE for mean-based response models to a WGEE for the current context as follows:
(23) |
where Di, Vi and Si are defined the same as in the GEE in the complete data case, and Δˆi denotes Δi in (21) with estimated πit. Also, as in the complete data case, Vi may be a function of τ if working dependence correlation models are used, which must replaced with an estimate before (23) is used for inference about θ.
The WGEE estimate θˆ has nice asymptotic properties, as summarized by the theorem below (see Appendix B for a proof).
Theorem 2
Let θˆ denote the WGEE II estimate. Under mild regularity conditions,
θˆ is consistent.
If τˆ is θˆ is asymptotically normal with asymptotic variance given by:
(24) |
A consistent estimate of Σθ is given by:
Note that the asymptotic variance in (24) contains a correction term B−1ΦB−⊤ to account for the sampling variability in the estimated γˆ.
3.2.3 Score Test
As Wald-type tests are typically anti-conservative(21; 48; 49), score statistics are often used as an alternative to reduce bias, especially in type I error rates for small to moderate samples. Within the current context, let , with p and q denoting the dimension of θ(1) and θ(2), respectively. Consider testing the null H0 : θ(2) = θ(20), with θ(20) denoting a vector of known constants.
Under H0 : θ(2) = θ(20),
(26) |
Let θ̃(1) denote the estimate from solving the reduced WGEE:
(27) |
Set
(28) |
where q is the dimension of wn(2), B11 denotes the p × p submatrix, B12 the p × q submatrix, and B22 the q × q submatrix from the partitioned (p + q) × (p + q) matrix B. Then, under H0 : θ(2) = θ(20), the following score statistic has as an asymptotic (central) distribution with q degrees of freedom (see Appendix C for a proof):
(29) |
where Σ̃(2) = G̃Σ̃θG̃⊤ with G̃ and Σ̃θ denoting the corresponding quantities with θ replaced by θ̃.
4 Applications
We first investigate the performance of the approach with small to moderate sample sizes by simulation and then present a real data application. In all the examples, we set the statistical significance level at α = 0.05.
4.1 Simulation Study
For space considerations, we only report results from the ZIP model for longitudinal data with sample size n = 50, 100 and 200. All simulations were performed with a Monte Carlo sample of 1,000. We start with data simulations under complete data.
4.1.1 Complete Data Case
For notational brevity, we considered a relatively simple pre-post longitudinal study design, with only one explanatory variable xi following a normal distribution N(1,1), and simulated the bivariate count response, yi = (yi1, yi2)⊤, to satisfy the following marginal ZIP model:
(30) |
We set βu0 = −1, β0 = β1 = 1. We first simulated xi from N(1, 1), and then conditional on xi, generated yit by using a copula approach(50; 51; 52). The copula method can generate correlated multivariate responses for any specified marginal distribution and correlation structure. For our simulation study, we set Corr(yi1, yi2 ∣ xi) = 0.5.
To examine type I error rates, we considered the null, H0 : β1 = 1, and computed the Wald statistic, , where denotes the element of the estimated asymptotic variance Σ̃θ corresponding to β̃1. Let denote this statistic at the kth MC simulation (1 ≤ k ≤ 1000). The type I error rate for testing H0 was estimated by: , with q1,0.95 denoting the 95th percentile of a central with one degree of freedom.
Since Wald statistics are often anti-conservative, we also applied the score test in Section 3.2. Let , where θ(1) = (βu0, β0)⊤ and θ(2) = β1. Under H0, θ(2) = 1, the score statistic Ts (θ̃(1), 1) in (29) has an distribution. The type I error rate for testing H0 was again estimated by: , where denotes this statistic at the kth MC simulation (1 ≤ m ≤ 1000).
Shown in Table 1 are the estimates of θ, standard errors, and type I errors for the ZIP model in (30). For comparison purposes, we also included “Empirical” variance estimates and type I error rates based on such a variance estimate. The “Empirical” type I error rates were computed based on substituting Σθ with the Empirical variance estimate in the Wald test statistic. It is seen that type I error rates were a bit inflated for sample sizes 50 and 100 under the Wald test, but were closer to the nominal 0.05 under the “Score” and “Empirical” tests even for samples as small as n = 50.
Table 1.
Simulation summary for ZIP under complete data | ||||||
---|---|---|---|---|---|---|
βu0 = −1, β0 = 1, β1 = 1 | ||||||
Parameter | Mean | Standard errors | Type I error for H0 : β1 = 1 | |||
WGEE | Empirical | Wald | Score | |||
WGEE | Empirical | |||||
Sample size of 50 | ||||||
βu 0 | −1.052 | 0.363 | 0.385 | |||
β 0 | 1.000 | 0.090 | 0.100 | |||
β 1 | 0.998 | 0.039 | 0.048 | |||
0.095 | 0.061 | 0.045 | ||||
Sample size of 100 | ||||||
βu 0 | −1.021 | 0.252 | 0.256 | |||
β 0 | 1.000 | 0.063 | 0.067 | |||
β 1 | 0.999 | 0.027 | 0.031 | |||
0.076 | 0.052 | 0.054 | ||||
Sample size of 200 | ||||||
βu 0 | −1.012 | 0.177 | 0.176 | |||
β 0 | 0.999 | 0.044 | 0.046 | |||
β 1 | 1.000 | 0.019 | 0.021 | |||
0.065 | 0.042 | 0.042 |
To compare our approach with GEE II, we also estimated the parameters using a program developed for such an alternative by Hall and Zhang (2004)(24). As noted earlier, their method modeled the conditional variance, rather than the second moment. In addition, they assumed working independence between the mean and variance. We obtain quite similar results (not shown), which may not be surprising, as such differences are likely to have minor impact on inference given the marginal ZIP model in (30).
4.1.2 Missing Data Case
Assuming no missing data at baseline t = 1, we simulated missing responses under MCAR and MAR with about 20% missing data at t = 2. By applying the discussion in Section 3.2 to the context of the pre-post design, we modeled the missingness at time t = 2 under MAR by:
We again considered the null H0: β1 = 1, and computed the Wald and score statistics and the associated type I error rates. The Wald statistic Qn is computed the same way as in the complete data case except that the estimate of θ is obtained from the WGEE in (23).
Shown in Table 2(3) are the estimates of θ, standard errors, and type I errors for the ZIP model under MCAR (MAR). As in the complete data case, the score test again performed a marvelous job in correcting the upward bias in type I error rates by the Wald statistic in testing the null H0: β1 = 1, especially for the sample size n = 50, 100. For inference under MAR, the Wald statistic again yielded inflated type I error rates for testing the null, but the score test corrected the upward bias and maintained a type I error rate consistently near 0.05 across all sample sizes.
Table 2.
Simulation summary for ZIP under missing data following MCAR | ||||||
---|---|---|---|---|---|---|
βu0 = −1, β0 = 1, β1 = 1 | ||||||
Parameter | Mean | Standard errors | Type I error for H0 : β1 = 1 | |||
GEE | Empirical | Wald | Score | |||
GEE | Empirical | |||||
Sample size of 50 | ||||||
βu 0 | −1.077 | 0.378 | 0.402 | |||
β 0 | 0.991 | 0.112 | 0.120 | |||
β 1 | 0.997 | 0.115 | 0.135 | |||
0.108 | 0.061 | 0.046 | ||||
Sample size of 100 | ||||||
βu 0 | −1.026 | 0.257 | 0.258 | |||
β 0 | 0.997 | 0.080 | 0.082 | |||
β 1 | 0.998 | 0.082 | 0.088 | |||
0.075 | 0.057 | 0.044 | ||||
Sample size of 200 | ||||||
βu 0 | −1.016 | 0.180 | 0.183 | |||
β 0 | 0.998 | 0.057 | 0.055 | |||
β 1 | 1.000 | 0.059 | 0.060 | |||
0.055 | 0.049 | 0.045 |
4.2 Real Study Data
To illustrate the approach to real study data, we applied it to a multi-center, NIDA-sponsored study entitled “HIV/STD Safer Sex Skills Groups For Men In Methadone Maintenance Or Drug-free Outpatient Treatment Programs,” known as CTN0018 within the Clinical Trials Network (CTN) studies. This study was designed to examine the effectiveness of 5 session motivational and skills training in HIV/AIDS group interventions developed to reduce sexual risk behaviors in men, as compared to an HIV education only control condition. Unlike most community-based studies in which the HIV education provided is limited to information, this trial integrated a component to provide skill-training programs such as role plays to reducing sex risk behaviors. The primary outcome of the study is the number of unprotected vaginal and anal sexual intercourse occasions (USO) which was assessed at baseline, 2 weeks, 3- and 6-months(53; 54).
Out of 573 eligible subjects screened, 422 subjects completed assessment at baseline. Among these, 381 (91.27%) and 345 (60.2 %) came for assessment at 3- and 6-months. Since 2 weeks was too short to observe a reasonably large USO, we limited our analysis to the period from baseline to 3- and 6-months follow-up visits.
Shown in Table 4 are the mean USOs and percent of zero USO at baseline, 3- and 6-months for the two treatment groups. It is evident that there was a preponderance of zeros in the distribution of this outcome at each assessment time. Accordingly, we modeled the USO at 3-month (yi1) and 6-month (yi2) as a function of treatment condition, time and time by treatment interaction, controlling for baseline USO, yi0, using the FRM-based ZIP model in (11) with
Table 4.
Mean USO and number of zeros at each assessment time for CTN0018 study | |||
---|---|---|---|
Intervention (S.D.) | Without intervention (S.D.) | zeros (%) | |
Baseline | 21.46(26.66) | 22.34(27.77) | 65(15.40) |
USO at 3 months | 15.71(25.43) | 18.14(27.21) | 125(32.80) |
USO at 6 months | 15.05(23.35) | 17.19(25.89) | 132(38.26) |
(31) |
where xi was an indicator with xi = 1 for the intervention and 0 otherwise.
To account for potential response-dependent MAR missingness, we modeled the missingness under MMDP using logistic regression:
(32) |
We assumed a Markov condition in (32) so that the missingness only depended on the most recent observed response.
Shown in Table 5 are the estimates of parameters from the logistic regression, their standard errors and corresponding p-values. The results show that the missingness at time t = 1 depended on the treatment assignment, while at time t = 2 depended on the observed response at time t = 1. In other words, the subjects in the intervention group were more likely to drop out than those in the control at time t = 2, while those with smaller values of USO at t = 1 were also more likely to drop out at t = 2. Based on these results, we proceeded with inference under MAR.
Table 5.
Estimates of logistic regression for modeling missingness for CTN0018 study | |||
---|---|---|---|
Assessment time t = 1 | |||
Predictors | Estimates | Standard errors | P-values |
Intercept | 2.777 | 0.319 | < 0.001 |
yi 1 | −0.002 | 0.006 | 0.752 |
intervention | −0.869 | 0.351 | 0.013 |
Assessment time t = 2 | |||
Intercept | 1.443 | 0.206 | < 0.001 |
yi 2 | 0.019 | 0.007 | 0.007 |
intervention | −0.325 | 0.257 | 0.206 |
Shown in Table 6 are the estimates of parameters of the ZIP model, their standard errors and associated p-values. As the interaction term involving time and intervention was neither significant in the logistic (ρit) nor in the Poisson (μit) component of the model, we refit the model without this term, with the results from the revised model shown in Table 7.
Table 6.
Results of FRM-based ZIP model for CTN0018 study | ||||
---|---|---|---|---|
P-value for H0 : β = 0 | ||||
Parameter | Estimate | Standard errors | Wald | Score |
Log-linear part (μit) | ||||
β 0 | 2.69 | 0.196 | < 0.001 | < 0.001 |
β1 (intervention) | −0.08 | 0.028 | < 0.001 | < 0.001 |
β2 (baseline USO) | 0.012 | 0.001 | < 0.001 | < 0.001 |
β3 (time) | −0.017 | 0.118 | 0.885 | 0.883 |
β4(intervention*time) | −0.062 | 0.187 | 0.742 | 0.741 |
Logistic part (ρit) | ||||
βu 0 | −0.52 | 0.354 | 0.142 | 0.140 |
βu1 (intervention) | 0.301 | 0.499 | 0.564 | 0.562 |
βu2 (baseline USO) | −0.017 | 0.004 | < 0.001 | < 0.001 |
βu3(time) | 0.126 | 0.221 | 0.568 | 0.566 |
βu4(intervention*time) | −0.121 | 0.314 | 0.701 | 0.700 |
Table 7.
Results from revised additive ZIP model for CTN0018 study | ||||
---|---|---|---|---|
Parameter | Estimate | Standard errors | P-value for H0 : β = 0 | |
Wald | Score | |||
Log-linear part (μit) | ||||
β 0 | 2.90 | 0.021 | < 0.001 | < 0.001 |
β1 (intervention) | −0.09 | 0.025 | < 0.001 | < 0.001 |
β2 (baseline USO) | 0.012 | 0.0004 | < 0.001 | < 0.001 |
Logistic part (ρit) | ||||
βu 0 | −0.68 | 0.144 | < 0.001 | < 0.001 |
βu1 (intervention) | 0.371 | 0.200 | 0.065 | 0.068 |
βu2 (baseline USO) | −0.015 | 0.004 | < 0.001 | < 0.001 |
For treatment effectiveness based on the results from the additive model, the logistic part of the model indicates that the intervention increased the likelihood of no risk for USO during the study, while the log-linear component shows that the intervention also significantly reduced the mean frequency of USO for the at-risk subgroup. The ratio of the mean USO of the treated to that of the control condition is exp(-0.09) = 0.9, suggesting a 10% decrease in USO for the treated subjects.
Baseline USO also played a significant role. The logistic component indicates that lower baseline USO would significantly increase the likelihood of being at no risk for USO during the study period. The log-linear part of the model shows that higher baseline USO was significantly associated with higher USO during the study. The findings suggest that substance abuse treatment programs should consider offering motivational exercises and skills training to achieve greater reductions in risky sexual activities.
5 Discussion
Count responses are a common type of outcome in biomedical, psychosocial and related services research. We discussed two major manifestations of departure from the Poisson assumption, overdispersion and structural zeros, and reviewed existing methods for addressing these two important issues. In particular, we focused on the limitations of available approaches with respect to longitudinal data analysis and proposed an approach to systematically tackle these problems under a unified modeling framework.
We applied the proposed approach to a real study in HIV prevention, allowing us to address important methodological issues in a timely application. In addition, the results from the simulation study show that the proposed approach works well for longitudinal study data under both complete and missing data settings. Although inference is derived based on large samples, the approach seems to provide valid inference for samples with sample size as small as 50.
Table 3.
Simulation summary for ZIP under missing data following MAR | |||||||||
---|---|---|---|---|---|---|---|---|---|
βu0 = −1, β0 = 1, β1 = 1 | |||||||||
Parameter | Mean | Standard errors | Type I error for H0 : β1 = 1 | ||||||
WGEE | Empirical | Wald | Score | ||||||
WGEE | Empirical | ||||||||
Sample size of 50 | |||||||||
βu 0 | −1.05 | 0.402 | 0.400 | ||||||
β 0 | 0.995 | 0.128 | 0.105 | ||||||
β 1 | 1.000 | 0.168 | 0.151 | ||||||
0.094 | 0.062 | 0.052 | |||||||
Sample size of 100 | |||||||||
βu 0 | −1.02 | 0.253 | 0.261 | ||||||
β 0 | 1.001 | 0.064 | 0.066 | ||||||
β 1 | 0.998 | 0.088 | 0.080 | ||||||
0.087 | 0.058 | 0.043 | |||||||
Sample size of 200 | |||||||||
βu 0 | −1.01 | 0.176 | 0.177 | ||||||
β 0 | 0.999 | 0.044 | 0.044 | ||||||
β 1 | 1.000 | 0.066 | 0.060 | ||||||
0.055 | 0.056 | 0.051 |
Acknowledgments
This research was supported in part by NIH grant R21 DA027521-01. We want to thank two anonymous reviewers for very careful reviews of the manuscript, with constructive comments and edits that led to a significantly improved manuscript.
Appendix.
A
The variance V ar(fi ∣ xi) for the cross-sectional data case is readily computed using the moments up to the 4th order under either ZIP or NB distribution. The first two order moments for ZIP and NB are given in (9) and (10), while the 3rd and 4th order moments for the two models are given by:
(33) |
B. Proof of Theorem 2
Let and πi = (πi1, …, πim1)⊤. Then, , with GiΔiSi = Gi(xi, θ, α)Δi(ri, πi, γ)Si(yi, xi,θ). It follows from the iterated conditional expectation that E(GiΔiSi) = E [GiSiE(Δi ∣ ri, yi, xi)]. By definition, Δiis a m × m block diagonal matrix with the tth block diagonal matrix given by , with Im denoting the m × m identify matrix. Since , it follows that E(GiΔiSi) = E(GiSi) = 0. Thus, the WGEE II is unbiased and the estimate θˆ obtained as the solution to the equations is consistent.
Let γˆ be the solution to the (22). By a Taylor expansion of the estimating equations in (22) and solving for γˆ−γ, we obtain
(34) |
where op(1) denotes the stochastic o(1)(25). Also, by applying a Taylor series expansion to the WGEE II in (23), we have
(35) |
If αˆ is , it follows that
By substituting op(1) for in (35) and solving for (θˆ − θ), we obtain
(36) |
It follows from (34) and (36) that
(37) |
Since
(38) |
where →p denotes convergence in probability, it follows from (37) and (38) that
(39) |
By applying the central limit theorem and Slutsky's theorem to (39)(25), θˆ is asymptotically normal with the asymptotic variance given by Σθ in (24).
C. Asymptotic Normality of Score Statistic
First, assume no missing data. Then, By applying the law of large numbers,
(40) |
It follows from a Taylor's series expansion and (40) that
Thus,
(41) |
Similarly, since , we have:
(42) |
It follows from (41) and (42) that
By the central limit theorem,
(43) |
where G is defined in (28) and Σθ in (24).
In the presence of missing data, as defined in (28). By a similar argument, wn(2) (θ̃(1), θ(20)) has an asymptotic normal distribution, which implies that the score statistic Ts((θ̃(1), θ(2))) has the asymptotic distribution.
References
- 1.Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
- 2.Crepon B, Duguet E. Research and development, competition and innovation — pseudo-maximum likelihood and simulated maximum likelihood methods applied to count data models with heterogeneity. Journal of Econometrics. 1997;79:355–378. [Google Scholar]
- 3.Miaou SP. The relationship between truck accidents and geometric design of road sections — Poisson versus negative binomial regressions. Accident Analysis & Prevention. 1994;26:471–482. doi: 10.1016/0001-4575(94)90038-8. [DOI] [PubMed] [Google Scholar]
- 4.Welsh A, Cunningham RB, Donnelly CF, Lindenmayer DB. Modeling the abundance of rare species: statistical-models for counts with extra zeros. Ecological Modelling. 1996;88:297–308. [Google Scholar]
- 5.Faddy M. Stochastic models for analysis of species abundance data. In: Fletcher DJ, Kavalieris L, Manly BF, editors. Statistics in Ecology and Environmental Monitoring 2: Decision Making and Risk Assessment in Biology. University of Otago Press; 1998. pp. 33–40. [Google Scholar]
- 6.Gurmu S, Trivedi P. Excess zeros in count models for recreational trips. Journal of Business & Economic Statistics. 1996;14:469–477. [Google Scholar]
- 7.Gurmu S. Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics. 1997;12:225–242. [Google Scholar]
- 8.Shonkwiler J, Shaw W. Hurdle count-data models in recreation demand analysis. Journal of Agricultural and Resource Economics. 1996;21:210–219. [Google Scholar]
- 9.Hall DB. Zero-Inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
- 10.Yau KW, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]
- 11.World Health Organization. Optimal duration of exclusive breastfeeding. Geneva: WHO; 2001. [Google Scholar]
- 12.Donath S, Amir LH. Rates of breastfeeding in Australia by State and socio-economic status: Evidence from the 1995 National Health Survey. Journal of Pediatrics and Child Health. 2000;36(2):164–168. doi: 10.1046/j.1440-1754.2000.00486.x. [DOI] [PubMed] [Google Scholar]
- 13.Cheung YB. Zero-infated models for regression analysis of count study of growth and development. Statistics in Medicine. 2002;21:1461–1469. doi: 10.1002/sim.1088. [DOI] [PubMed] [Google Scholar]
- 14.Wyman PA, Cross W, Brown HC, Yu Q, Tu XM. Intervention to strengthen emotional self-regulation in children with emerging mental health problems: Proximal impact on school behavior. Journal of Abnormal Child Psychology. doi: 10.1007/s10802-010-9398-x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Abma JC, Martinez GM, Mosher WD, Dawson BS. Teenagers in the United States: Sexual activity, contraceptive use, and child bearing. Vital Health Statistics. 2002;23(24) [PubMed] [Google Scholar]
- 16.Abe T, Martin I, Roche L. Clusters of Census Tracts with High Proportions of Men with Distant-Stage Prostate Cancer Incidence in New Jersey, 1995 to 1999. American Journal of Preventive Medicine. 2006;30(2):S60–S66. doi: 10.1016/j.amepre.2005.09.003. [DOI] [PubMed] [Google Scholar]
- 17.Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Health Services and Outcomes Research Methodology. 2002;3:5–20. [Google Scholar]
- 18.Lachenbruch PA. Analysis of data with excess zeros. Statistical Methods in Medical Research. 2002;11:297–302. doi: 10.1191/0962280202sm289ra. [DOI] [PubMed] [Google Scholar]
- 19.Lee AH, Wang K, Scott JA, Yau KKW, McLachlan GJ. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research. 2006;15:47–61. doi: 10.1191/0962280206sm429oa. [DOI] [PubMed] [Google Scholar]
- 20.Ritz J, Spiegelman D. Equivalence of conditional and marginal regression models for clustered and longitudinal data. Statistical Methods in Medical Research. 2004;13:309–323. [Google Scholar]
- 21.Zhang H, Xia Y, Chen R, Lu N, Tang W, Tu X. On Modeling Longitudinal Binomial Responses — Implications from Two Dueling Paradigms. Journal of Applied Statistics. 2011;38:2373–2390. [Google Scholar]
- 22.Zhang H, Tang W, Yu Q, Feng C, Gunzler D, Tu X. A New Look at the Differerence between GEE and GLMM When Modeling Longitudinal Count Responses. Journal of Applied Statistics [Google Scholar]
- 23.Estimating Equations. Oxford University Press; New York: 1991. Estimating equations for mixed Poisson models; pp. 35–46. [Google Scholar]
- 24.Hall DB, Zhang ZG. Marginal models for zero inflated clustered data. Statistical Modeling. 2004;4:161–180. [Google Scholar]
- 25.Kowalski J, Tu XM. Modern Applied U Statistics. Wiley; New York: 2007. [Google Scholar]
- 26.Crowder M. On linear and quadratic estimating functions. Biometrika. 1987;74:591–97. [Google Scholar]
- 27.Dobbie MJ, Welsh AH. Modeling correlated zero-inflated count data. Australian & New Zealand Journal of Statistics. 2001;43:431–444. [Google Scholar]
- 28.Prentice RL, Zhao LP. Estimating Equations for Parameters in Means and Covariances of Multivariate Discrete and Continuous Responses. Biometrics. 1991;47:825–839. [PubMed] [Google Scholar]
- 29.Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. J R Statist Soc B. 1992;54:3–40. Rubeussin and Liang, 1998. [Google Scholar]
- 30.Rubin DB. Inference and Missing Data. Biometrika. 1976;63:581–592. [Google Scholar]
- 31.Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]
- 32.McCullagh P, Nelder JA. Generalized Linear Models. 2nd. Chapman and Hall; London: 1989. [Google Scholar]
- 33.Dean CB, Lawless JF. Tests for detecting overdispersion in Poisson regression models. J Amer Statist Assoc. 1989;84:467–472. [Google Scholar]
- 34.Cameron AC, Trivedi PK. Econometric models based on count data: Comparisons and applications of some estimators and tests. Journal of Applied Econometrics. 1986;1:29–53. [Google Scholar]
- 35.Lee LF. Specification test for Poisson regression models. International Economic Review. 1986;27:689–706. [Google Scholar]
- 36.Tu XM, Feng C, Kowalski J, Tang W, Wang H, Wan C, Ma Y. Correlation analysis for longitudinal data: Applications to HIV and psychosocial research. Statistics in Medicine. 2007;26:4116–4138. doi: 10.1002/sim.2857. [DOI] [PubMed] [Google Scholar]
- 37.Ma Y, Tang W, Feng C, Tu XM. Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics. 2008;64:781–789. doi: 10.1111/j.1541-0420.2007.00934.x. [DOI] [PubMed] [Google Scholar]
- 38.Ma Y, Tang W, Yu Q, Tu XM. Modeling concordance correlation coefficient for longitudinal study data. Psychometrika. 2010;75:99–119. [Google Scholar]
- 39.Ma Y, Gonzalez Della Valle A, Zhang H, Tu XM. A U-statistics based approach for modeling Cronbach Coefficient Alpha within a longitudinal data setting. Statistics in Medicine. 2011;29(6):659–670. doi: 10.1002/sim.3853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Yu Q, Tang W, Kowalski J, Tu XM. Multivariate U-Statistics: A Tutorial with applications. Wiley Interdisciplinary Reviews – Computational Statistics. 2011;3:457–471. [Google Scholar]
- 41.Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. JASA. 1995;90:106–121. [Google Scholar]
- 42.Cameron AC, Trivedi PK. Regression analysis of counter data. Cambridge Univ. Press; London: 1998. [Google Scholar]
- 43.Reboussin BA, Liang KY. An estimating equations approach for the LISCOMP Model. Psychometrika. 1998;63:165–182. [Google Scholar]
- 44.Yu Q. Department of Biostatistics and Computational Biology School of Medicine and Dentistry. University of Rochester; Rochester, New York: 2009. Distribution-free models for longitudinal count data. Ph.D. Thesis. [Google Scholar]
- 45.Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics: Simulation and Computation. 1994;23:939–951. [Google Scholar]
- 46.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semi-parametric nonresponse models. 448. Vol. 94. Journal of the American Statistical Association; 1999. pp. 1096–1146. [Google Scholar]
- 47.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Spring; 2006. [Google Scholar]
- 48.Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]
- 49.Pan W. On the robust variance estimator in generalized estimating equations. Biometrika. 2001;88:901–906. [Google Scholar]
- 50.Freesm EW, Valdez EA. Understanding relationships using copulas. North American Actuarial Journal. 1998;2:1–25. [Google Scholar]
- 51.Nelsen RB. An introduction to Copulas. Springer; New York: 2006. [Google Scholar]
- 52.Yan JR. Package copula on CRAN, multivariate dependence with copula. 2009. http://cran.r-project.org/web/packages/copula/index.html .
- 53.Calsyn DA, Wells EA, Saxon AJ, Jackson R, Heiman JR. Sexual activity under the influence of drugs is common among methadone clients. In: Harris L, editor. Problems of Drug Dependence 1999. Vol. 315. National Institute on Drug Abuse; 2000. NIH Pub. No. 00-4773. [Google Scholar]
- 54.Calsyn DA, Hatch-Maillette M, Tross S, et al. Motivational and Skills Training HIV/Sexually Transmitted Infection Sexual Risk Reduction Groups for Men. Journal of Substance Abuse Treatment. 2009;37(2):138–150. doi: 10.1016/j.jsat.2008.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]