Distribution-free Models for Longitudinal Count Responses with Overdispersion and Structural zeros

Q Yu; R Chen; W Tang; H He; R Gallop; P Crits-Christoph; J Hu; XM Tu

doi:10.1002/sim.5691

. Author manuscript; available in PMC: 2014 Jun 30.

Published in final edited form as: Stat Med. 2012 Dec 12;32(14):2390–2405. doi: 10.1002/sim.5691

Distribution-free Models for Longitudinal Count Responses with Overdispersion and Structural zeros

Q Yu ¹, R Chen ¹, W Tang ¹, H He ^1,², R Gallop ³, P Crits-Christoph ³, J Hu ¹, XM Tu ^1,²

PMCID: PMC3806502 NIHMSID: NIHMS499723 PMID: 23239019

Summary

Overdispersion and structural zeros are two major manifestations of departure from the Poisson assumption when modeling count responses using Poisson loglinear regression. As noted in a large body of literature, ignoring such departures could yield bias and lead to wrong conclusions. Different approaches have been developed to tackle these two major problems. In this paper, we review available methods for dealing with overdispersion and structural zeros within a longitudinal data setting and propose a distribution-free modeling approach to address the limitations of these methods by utilizing a new class of functional response models (FRM). We illustrate our approach with both simulated and real study data.

Keywords: functional response models, monotone missing data pattern, negative binomial, zero-inflated Poisson, weighted generalized estimating equations

1 Introduction

Count (or frequency) responses such as number of heart attacks, days of hospitalization, suicide attempts or unprotected vaginal sex arise quite often in biomedical and psychosocial research. The Poisson distribution and more generally Poisson-based log-linear regression are widely used for modeling such data. However, heterogeneity in study populations such as data clustering often creates extra variability, which renders the Poisson distribution inappropriate for modeling count data in such instances. One approach for addressing this extra Poisson, or overdispersion, is the popular negative binomial (NB) distribution. This modeling strategy, however, is rendered ineffective when the extra variability is caused by an excessive number of zeros above and beyond the number of zeros expected by the Poisson law. For example, when modeling behavioral outcomes such as the number of unprotected vaginal sex over a period of time in HIV prevention research, the specific study population often contains a subgroup of individuals who are not at risk for such a behavior during the study period, in which case neither the Poisson nor NB is able to accommodate such cases of structural zeros in the study population. One popular approach for addressing such inflated zero counts is the zero-inflated Poisson (ZIP) model, which has been applied to a diverse range of studies(1-16). The inherent methodological problems with structural zeros have received a great deal of attention in the literature(4; 9; 10; 19; 17; 18).

When modelling count responses in the presence of overdispersion and structural zeros within a longitudinal data setting, one of the current strategies is to employ random effects within the context of the generalized linear mixed-effects model (GLMM) to account for correlated responses from repeated assessments over time(19). However, as it relies on parametric assumptions about random effects and response for inference, such an approach lacks robustness when real study data depart from the assumed distributional models. Further, the random effects induce overdispersion into the marginal model at each assessment, giving rise to quite different results and findings than those from the marginal models(20; 21; 22). In addition, such an approach computes estimates using the expectation/maximization (EM) algorithm, which can be problematic since EM is notorious for its slow convergence and may yield local rather than global maxima, making it difficult to apply such methods in routine analyses.

A popular alternative is to use the generalized estimating equations (GEE) to address correlated longitudinal responses. The GEE approach is widely used for modeling the mean response, or first-order moment. Unlike GLMM, model parameters have the same interpretations between the marginal and joint models across assessment times. In addition, as GEE models the marginal mean of the response variable at each assessment time, it ignores both layers of assumptions and thereby provides consistent estimates regardless of the complexity of the correlation structure and the distribution of the response. GEE estimates are also much easier to compute than those based on the GLMM approach.

As the key difference between the standard (Poisson) log-linear model and other models for count responses such as ZIP lies in the variance, or the second-order moment, GEE does not apply directly to extending such models to a longitudinal data setting(23; 24; 25). Also, since ZIP is a mixture of two distributions, we will not be able to identify the model parameters by simply modeling the mean response(24; 26). One approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, as the structural and sampling zeros are mixed into a single category, this approach is unable to identify the parameters for modeling the structural zeros, which is often of great interest in practice. For example, in the hospitalization example, this approach will only model those who are hospitalized, since the at-risk subgroup for hospitalization is mixed with those who are healthy and are not at risk for hospitalization. In many studies, it is of great importance to model the at- and non-risk subgroups separately. An alternative to address the identifiability issue is to include a modeling component for the variance and apply GEE to both the specified mean and variance(24; 25; 28; 29). However, all these methods do not sufficiently address missing data, yielding biased inference if missing data does not follow the missing completely at random (MCAR) assumption(30; 31).

In this paper, we propose an approach to overcome the aforementioned difficulties by utilizing a class of functional response models (FRM) and the popular weighted generalized estimating equations (WGEE). In Section 2, we first give a brief overview of the problems with overdispersed and zero-inflated count data and popular models for addressing them. We then introduce FRM and discuss its application to the current setting. In Section 3, we discuss inference for the FRM-based models under both complete and missing data. In Section 4, we illustrate the proposed models with real study data and investigate their performance using simulated data. In Section 5, we give our concluding remarks.

2 Functional Response Models for Count Response

We start with a brief review of existing approaches for addressing overdispersion and structural zeros.

2.1 Models for Overdispersion and Structural Zeros

Consider first a cross-sectional study with n subjects, and let y_i denote a count response and x_i a vector of explanatory variables. The popular Poisson log-linear regression, a member of the generalized linear model (GLM) family, models the conditional mean of y_i given x_i, μ_i = E(y_i | x_i), by applying the log function to link μ_i to the linear predictor $η_{i} = x_{i}^{⊤} β (32)$ :

y_{i} ∣ x_{i} ~ i.d. P (μ_{i}), log (μ_{i}) = x_{i}^{⊤} β, 1 \leq i \leq n,

(1)

where i.d. denotes independently distributed and P(μ) the Poisson distribution with mean μ. Under (1), the conditional mean E(y_i | x_i) and variance Var(y_i | x_i) of y_i given x_i satisfy:

E (y_{i} ∣ x_{i}) = Var (y_{i} ∣ x_{i}) = μ_{i} .

(2)

As mentioned, the conditional variance Var(y_i | x_i) often exceeds the conditional mean μ_i in real study applications, making (1) inappropriate for modeling such count data. When overdispersion occurs, the standard error of the parameter estimate of the Poisson model is artificially deflated, giving rise to artificially inflated effect size estimates and false significant findings.

Overdispersion can often be empirically detected by goodness of fit statistics or even formally tested(25; 32; 33). When deemed present, overdispersion may be corrected post hoc by using robust variance estimates(25). An alternative is to use models that explicitly address this issue. For example, the popular negative binomial (NB) model allows the variance to exceed the mean:

E (y_{i} ∣ μ_{i}, α) = μ_{i}, Var (y_{i} ∣ μ_{i}, α) = μ_{i} (1 + α μ_{i}) .

(3)

Unlike the Poisson, the NB has an extra parameter α to indicate the degree of overdispersion. As α → 0, Var(y_i ∣ μ_i, α) → μ_i. Thus, unless α = 0, the variance of NB is always larger than the mean, addressing overdispersion. Under NB, we can check overdispersion by testing the null: H₀ : α = 0. Note, however that under H₀, α = 0 is a boundary point of α ≥ 0 and the maximum likelihood estimate (MLE) αˆ of α cannot be used directly for testing H₀, and alternative score statistics must be used(33; 34; 35).

Count responses in many biomedical and psychosocial studies are dominated by a preponderance of zeros that exceeds the expected frequency of the Poisson. Such excess or structural zeros not only cause overdispersion, but also affect the conditional mean, leading to biased estimates of model parameters. The zero-inflated Poisson (ZIP) model is a popular approach to address the twin effects of structural zeros on both the mean and variance.

Let u_i and v_i be two subsets of x_i, which may overlap one another or even identical, and thus may not be a partition of x_i. The ZIP regression model is defined by:

y_{i} ∣ x_{i} ~ i.d. ZIP (μ_{i}, ρ_{i}), logit (ρ_{i}) = u_{i}^{⊤} β_{u}, log (μ_{i}) = v_{i}^{⊤} β_{v}, 1 \leq i \leq n,

(4)

where ZIP(μ, ρ) denotes the ZIP distribution defined by:

f_{ZIP} (y ∣ μ, ρ) = {\begin{matrix} ρ f_{0} (0) + (1 - ρ) f_{P} (0 ∣ μ) & if y = 0 \\ (1 - ρ) f_{P} (y ∣ μ) & if y > 0 \end{matrix} .

(5)

with f₀(y) denoting a degenerate distribution centered at 0. In (5), the Poisson probability at 0, f_P (0 ∣ μ), is modified by ρf₀ (0) + (1 − ρ) f_P (0 ∣ μ) with ρf₀ (0) = ρ to account for structural zeros.

Consider these models within a longitudinal setting with m assessments, with y_it, x_it, u_it and v_it denoting the respective variables at time t (1 ≤ t ≤ m). We may model y_it as a function of x_it (or u_it and v_it for ZIP) by using either a parametric or distribution-free modeling approach. As mentioned, the former suffers interpretational and computational problems. A popular distribution-free alternative with inference based on the generalized estimating equations (GEE) is to specify the conditional mean of y_it given x_it, which for count response has the following form,

E (y_{it} ∣ x_{it}) = exp (x_{it}^{⊤} β), 1 \leq t \leq m, 1 \leq i \leq n .

(6)

This mean-based specification, however, is not sufficient to distinguish the Poisson from the NB, as the two models only differ in the conditional variance V ar(y_it ∣ x_it). The classic model specification also does not work for ZIP, since the conditional mean of y_it given x_it in this case is

E (y_{it} ∣ x_{it}) = (1 - ρ_{it}) μ_{it}, 1 \leq t \leq m, 1 \leq i \leq n,

(7)

which in general does not provide sufficient information to identify β_u and β_v.

To help distinguish among the three models, one can augment the GEE by including the distinct form of the conditional variance V ar(y_it ∣ x_it) for each model and use the resulting GEE II for inference(23; 24; 28; 29). However, this approach is ad-hoc in the sense that GEE II is a method of inference primarily used for improving efficiency over GEE, rather than a formal model akin to (6), since the added response (or dependent variable) V ar(y_it ∣ x_it) is a function of parameters(25). In addition, it does not effectively address missing data. Another approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, this approach is unable to identify the parameters for modeling the structural zero, which is often of greater interest in practice. Next we utilize a new class of regression models to address the limitations of the aforementioned approaches.

2.2 Functional Response Models

Consider a class of distribution-free regression models defined by:

E [f (y_{i_{1}}, \dots, y_{i_{q}}) ∣ x_{i_{1}}, \dots, x_{i_{q}}] = h (x_{i_{1}}, \dots, x_{i_{q}}; θ), (i_{1}, \dots, i_{q}) \in C_{q}^{n}, 1 \leq q, 1 \leq i \leq n,

(8)

where y_i = (y_i₁, …, y_im)^˕ denotes the vector of responses from the ith subject, f some vector-valued function, h(θ) some vector-valued smooth function (e.g. with continuous derivatives up to the second order), θ a vector of parameters of interest, q some positive integer, and $C_{q}^{n}$ the set of $(_{q}^{n})$ combinations of q distinct elements (i₁, …, i_q) from the integer set {1, …, n}. The functional response models (FRM) (8) extend the single-subject response in the classic GLM to a function of responses from multiple subjects. For example, by setting q = 1, we immediately obtain from (8) the class of distribution-free GLMs for longitudinal data with m assessments. With FRM, we can express a broader class of problems under a regression-like framework(25; 36; 37; 38; 39; 40). Below, we focus on the application of FRM within our setting for modeling count responses.

Consider first the simpler cross-sectional study setting. For the cross-sectional parametric ZIP in (4), let

\begin{matrix} f_{i} = \begin{matrix} f (y_{i}) = {(f_{1 i}, f_{2 i})}^{⊤}, & h_{i} = h (u_{i}, v_{i}) = {(h_{1 i}, h_{2 i})}^{⊤}, & f_{1 i} = y_{i}, & f_{2 i} = y_{i}^{2}, \end{matrix} \\ h_{1 i} = \begin{matrix} (1 - ρ_{i}) μ_{i}, & h_{2 i} = μ_{i} (1 - ρ_{i}) {(1 + μ_{i})}^{⊤}, & logit (ρ_{i}) = u_{i}^{⊤} β_{u}, & log (μ_{i}) = v_{i}^{⊤} β_{v}, \end{matrix} \end{matrix}

(9)

where u_i(v_i) denotes a subset of x_i. Under (4), E(f_i ∣ u_i, v_i) = h_i(u_i, v_i). For NB, f (y_i) is defined the same as for ZIP in (9), but with h_i = (h₁_i, h₂_i)^⊤ modified as follows:

h_{1 i} = μ_{i}, h_{2 i} = μ_{i} + α μ_{i}^{2} + μ_{i}^{2}, log (μ_{i}) = x_{i}^{⊤} β .

(10)

As a special case with α = 0, the FRM for NB reduces to a distribution-free Poisson with $μ_{i} = exp (x_{i}^{⊤} β)$ . Note that under the FRM-based NB, we can allow α to be negative and thus α = 0 is no longer a boundary point. Thus, we can readily use the estimate of α to test the null H₀ : α = 0 to determine whether the Poisson loglinear model is appropriate.

For longitudinal data, suppose that each subject is assessed m times, with y_it and x_it denoting the respective variables at time t (1 ≤ t ≤ m). Define the FRM-based ZIP model as follows:

f_{it} = f (y_{it}) = {(f_{1 it}, f_{2 it})}^{⊤} = {(y_{it}, y_{it}^{2})}^{⊤}, h_{it} = {(h_{1 it}, h_{2 it})}^{⊤}, h_{1 it} = (1 - ρ_{it}) μ_{it}, h_{2 it} = μ_{it} (1 - ρ_{it}) {(1 + μ_{it})}^{⊤}, logit (ρ_{it}) = u_{it}^{⊤} β_{u}, log (μ_{it}) = v_{it}^{⊤} β_{v}, 1 \leq t \leq m .

(11)

Likewise, we obtain a longitudinal version of FRM-based NB by defining the same f_it, but modifying h_it as follows:

h_{1 it} = μ_{it}, h_{2 it} = μ_{it} + α μ_{it}^{2} + μ_{it}^{2}, log (μ_{it}) = x_{it}^{⊤} β, 1 \leq t \leq m, 1 \leq i \leq n .

(12)

Note that we have assumed a constant α for NB, though the model above readily accommodates a time-varying α. In many studies, clusters causing overdispersion such as those formed by the subjects sampled from a common habitat may not change over time during the study, and this assumption is reasonable.

Both the ZIP and NB models for longitudinal data in (11) and (12) yield the same first-and second-order moment as their respective cross-sectional versions in (9)-(10) at each time t (1 ≤ t ≤ m). Thus, unlike their GLMM-based parametric counterparts, estimates from the FRM-based ZIP and NB models for longitudinal data can be readily compared to their corresponding cross-sectional versions. These distribution-free models are also called semiparametric or moment-based in the literature(41; 42). We refer to these as distribution-free models throughout the text unless otherwise stated.

3 Distribution-free Inference

We first discuss inference for cross-sectional data, and then extend the considerations to the longitudinal setting.

3.1 Distribution-free Inference for Cross-sectional Data

For the FRM-based ZIP model in (??), let $θ = {(β_{u}^{⊤}, β_{v}^{⊤})}^{⊤}$ and

\begin{matrix} D_{i} = \frac{\partial}{\partial θ} h_{i}, & S_{i} = f_{i} - h_{i}, & V_{i} = (\begin{matrix} Var (f_{1 i} ∣ u_{i}, v_{i}) & Cov ((f_{1 i}, f_{2 i}) ∣ u_{i}, v_{i}) \\ Cov ((f_{1 i}, f_{2 i}) ∣ u_{i}, v_{i}) & Var (f_{2 i} ∣ u_{i}, v_{i}) \end{matrix}) \end{matrix} .

(13)

We estimate θ by the following set of generalized estimating equations,

w_{n} (θ) = \sum_{i = 1}^{n} D_{i} V_{i}^{- 1} S_{i} = 0 .

(14)

Given the ZIP model in (4), the elements of V_i in (13) are functions of the conditional moments of y_i given x_i up to the 4th order, which can be expressed in closed form (see Appendix A). Thus, the quantities D_i, V_i and S_i in (13) are readily evaluated. Note that (14) bears a close resemblance to the generalized estimating equations II (GEE II) for generalized linear models(25; 28; 29; 43).

By defining D_i, V_i and S_i the same way as in (13), but with θ = (β^⊤, α)^⊤ and h_i defined in (10), the GEE in (14) can be used to obtain estimates of θ for NB as well.

Under (9), the GEE estimate θˆ of θ obtained as the solution to (14) is consistent and asymptotically normal (see Theorem 1 below):

\sqrt{n} (\hat{θ} - θ) \to_{d} N (0, \sum_{θ}), B = E (D_{i}^{⊤} V_{i}^{- 1} D_{i}), \sum_{θ} = B^{- 1} E (D_{i} V_{i}^{- 1} S_{i} S_{i}^{⊤} V_{i}^{- 1} D_{i}^{⊤}) B^{- ⊤},

(15)

where →_d denotes convergence in distribution(25). Unlike the MLE, the asymptotic results above do not require that y_i (given u_i and v_i) follow the ZIP distribution in (4). If y_i does follow such a parametric model, Σ_θ in (15) simplifies to Σ_θ = B⁻¹, which is the model-based asymptotic variance.

A consistent estimate of Σ_θ is obtained by substituting moment estimates in place of the respective parameters:

{\sum^{^}}_{θ} = {\hat{B}}^{- 1} (\frac{1}{n - 1} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{S}}_{i} {\hat{S}}_{i}^{⊤} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤}) {\hat{B}}^{- ⊤}, \hat{B} = \frac{1}{n - 1} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤},

where Bˆ_i, Dˆ, Sˆ_i and Vˆ_i denote the corresponding quantities with θ replaced by θˆ. Our simulations indicate that the model-based asymptotic variance estimate Bˆ outperforms its sandwich alternative by yielding slightly more accurate type I error rates under the correct parametric model(44).

3.2 Distribution-free Inference for Longitudinal Data

We begin with inference under complete data and then extend the discussion to include missing data.

3.2.1 Inference under Complete Data

Let

f_{i} = {(f_{i 1}^{⊤}, f_{i 2}^{⊤}, \dots, f_{im}^{⊤})}^{⊤}, h_{i} = {(h_{i 1}^{⊤}, h_{i 2}^{⊤}, \dots, h_{im}^{⊤})}^{⊤}, D_{i} = \frac{\partial}{\partial θ} h_{i}, S_{i} = f_{i} - h_{i},

(16)

where f_it and h_it are defined by (11) for the ZIP or by (12) for the NB model. We again apply the GEE in (14), but with D_i and S_i revised to reflect the changed dimension, and V_i modified to reflect the correlation between the f_it's over time:

\begin{matrix} V_{i} = A_{i}^{\frac{1}{2}} R (α) A_{i}^{\frac{1}{2}}, A_{i} = {diag}_{t} (A_{it}), 1 \leq i \leq n, 1 \leq t \leq m, \\ A_{it} = {diag}_{t} (\begin{matrix} Var (f_{1 it} ∣ x_{it}) & Cov ((f_{1 it}, f_{2 it}) ∣ x_{it}) \\ Cov ((f_{1 it}, f_{2 it}) ∣ x_{it}) & Var (f_{2 it} ∣ x_{it}) \end{matrix}), \end{matrix}

(17)

where R(τ) is a working correlation matrix among the components of f_i parameterized by τ. As in the cross-sectional data case, A_it is readily computed. For R(τ), the popular choices are the working independence model (R(τ) = I_2m) and the exchangeable correlation structure given by:

\begin{matrix} R (ρ) = (\begin{matrix} I_{2} & ρ J_{2} & \dots & ρ J_{2} \\ I_{2} & \dots & ρ J_{2} \\ ⋱ & ⋮ \\ I_{2} \end{matrix}), & I_{2} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}), & J_{2} = (\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}), & 0 < ρ < 1. \end{matrix}

Thus, τ is known for the working independence model, but unknown for the exchangeable correlation model with τ = ρ.

Note that since the GEE estimate may not be consistent under working correlation structures other than the independence model, especially in the presence of time-varying covariates(45), we focus on this model in what to follow unless otherwise stated. With this choice of R(τ), the GEE is readily solved for θ. However, when the working correlation model used involves an unknown τ, an estimate must be substituted before the GEE is solved to obtain estimates of θ.

As in the cross-sectional data case, the GEE estimate has nice asymptotic properties summarized in Theorem 1 below. Since this is a special case of Theorem 2, its proof is omitted. Since Theorem 1 is stated for general working correlation models, it includes the condition for the estimate of τ to ensure such nice properties.

Theorem 1

Let θˆ denote the GEE estimate and let

B = E (D_{i} V_{i}^{- 1} D_{i}^{⊤}), \sum_{θ} = B^{- 1} E (D_{i} V_{i}^{- 1} S_{i} S_{i}^{⊤} V_{i}^{- 1} D_{i}^{⊤}) B^{- ⊤} .

(18)

Under mild regularity conditions, θˆ is consistent. Further, if τˆ is $\sqrt{n} - consistent$ , i.e., $\sqrt{n} (\hat{τ} - τ)$ is bounded in probability(25), then θˆ is asymptotically normal with the asymptotic variance Σ_θ. A consistent estimate of Σ_θ is given by:

{\sum^{^}}_{θ} = {\hat{B}}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{S}}_{i} {\hat{S}}_{i}^{⊤} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤}) {\hat{B}}^{- ⊤}, \hat{B} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤},

(19)

where Bˆ_i, Dˆ_i, Sˆ_i and Vˆ_i denote the corresponding quantities with θ replaced by θˆ.

Note that given the limited choices for the working correlation matrix R(τ), $E (S_{i} S_{i}^{⊤} ∣ x_{i}) = V_{i}$ generally is not true in practice. Thus, unlike the cross-sectional data case, there is no model-based asymptotic variance.

3.2.2 Inference under Missing Data

Missing data arise frequently in real studies. For mean-based distribution-free models such as the GLM, the weighted generalized estimating equations (WGEE) is the most popular for inference about model parameters. By integrating the inverse probability weighting (IPW) technique with the GEE, the WGEE ensures valid inference when the missing data follows the missing at random (MAR) model, a plausible and general missing data mechanism applicable to many studies in practice(25; 31; 41; 46; 47). We discuss below how to extend this IPW approach to the current FRM-based models for count responses.

Within the context of longitudinal data discussed in the preceding section, we define a missing (or rather observed) data indicator for each subject as follows:

\begin{matrix} r_{it} = {\begin{matrix} 1 & if y_{it} is observed \\ 0 & if y_{it} is missing \end{matrix}, & r_{i} = {(r_{i 1}, \dots, r_{im})}^{⊤}, & 1 \leq i \leq n . \end{matrix}

We assume no missing data at baseline t = 1 such that r_i₁ = 1 for all 1 ≤ i ≤ n. Let

π_{it} = Pr (r_{it} = 1 ∣ x_{i}, y_{i}), Δ_{it} = \frac{r_{it}}{π_{it}}, Δ_{i} = {diag}_{t} (Δ_{it}) .

(20)

In most applications, the weight function π_it is unknown and must be estimated. Under MCAR, r_i is independent of x_i and y_i and thus π_it = Pr (r_it = 1) = π_t. In this case, π_t is a constant independent of x_i and y_i and is readily estimated by the sample moment: ${\hat{π}}_{t} = \frac{1}{n} \sum_{i = 1}^{n} r_{it} (2 \leq t \leq m)$ .

Under MAR, π_it becomes dependent on the observed x_i and y_i, making it difficult to model and estimate π_it without imposing the monotone missing data pattern (MMDP) assumption because of the large number of missing data patterns(25; 37; 41). Under MMDP, y_it (x_it) is observed only if all y_is (x_is) prior to time t are observed (1 ≤ s ≤ t ≤ m).

Let

H_{it} = {{\tilde{x}}_{it}, {\tilde{y}}_{it}; 2 \leq t \leq m}, {\tilde{x}}_{it} = {(x_{i 1}^{⊤}, \dots, x_{i (t - 1)}^{⊤})}^{⊤}, {\tilde{y}}_{it} = {(y_{i 1}, \dots, y_{i (t - 1)})}^{⊤},

where X̃_it and ỹ_it contain the explanatory and response variables prior to time t, respectively. Under MAR we have:

π_{it} = Pr (r_{it} = 1 ∣ x_{i}, y_{i}) = Pr (r_{it} = 1 ∣ H_{it}), 2 \leq t \leq m .

Let p_it = Pr (r_it = 1 ∣ r_i₍_t₋₁₎ = 1, H_it), the one-step transition probability for observing the response from time t − 1 to t. We can model p_it using logistic regression:

logit (p_{it} (γ_{t})) = γ_{0 t} + γ_{x t}^{⊤} {\tilde{x}}_{it} + γ_{y t}^{⊤} {\tilde{y}}_{it}, 2 \leq t \leq m,

(21)

where $γ_{t} = {(γ_{0 t}, γ_{x t}^{⊤}, γ_{y t}^{⊤})}^{⊤}$ . Let $γ = {(γ_{2}^{⊤}, \dots, γ_{m}^{⊤})}^{⊤}$ . Then, under MMDP,

π_{it} (γ) = p_{it} Pr (r_{i (t - 1)} = 1 ∣ H_{i (t - 1)}) = \prod_{s = 2}^{t} p_{i s} (γ_{s}), 2 \leq t \leq m, 1 \leq i \leq n .

The above provides a relationship to estimate π_it from the model for p_it in (21).

We may estimate γ using the following estimating equations:

\begin{matrix} Q_{n} (γ) = \begin{matrix} {\sum_{i = 1}^{n} (Q_{i 2}^{⊤}, \dots, Q_{im}^{⊤})}^{⊤} = 0, & 2 \leq t \leq m, & 1 \leq i \leq n, \end{matrix} \\ Q_{it} = \frac{\partial}{\partial γ_{t}} {r_{i (t - 1)} [r_{it} log (p_{it}) - (1 - r_{it}) log (1 - p_{it})]}, \end{matrix}

(22)

With estimated π_it, we can estimate θ by generalizing the WGEE for mean-based response models to a WGEE for the current context as follows:

w_{n} (θ) = \sum_{i = 1}^{n} w_{ni} = \sum_{i = 1}^{n} D_{i} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} = \sum_{i = 1}^{n} D_{i} V_{i}^{- 1} {\hat{Δ}}_{i} (y_{i} - h_{i}) = 0,

(23)

where D_i, V_i and S_i are defined the same as in the GEE in the complete data case, and Δˆ_i denotes Δ_i in (21) with estimated π_it. Also, as in the complete data case, V_i may be a function of τ if working dependence correlation models are used, which must replaced with an estimate before (23) is used for inference about θ.

The WGEE estimate θˆ has nice asymptotic properties, as summarized by the theorem below (see Appendix B for a proof).

Theorem 2

Let θˆ denote the WGEE II estimate. Under mild regularity conditions,

θˆ is consistent.
If τˆ is $\sqrt{n} - consistent$ θˆ is asymptotically normal with asymptotic variance given by:

\begin{matrix} \sum_{θ} = \begin{matrix} B^{- 1} (\sum_{U} + Φ) B^{- ⊤}, & \sum_{U} = E (D_{i} V_{i}^{- 1} Δ_{i} S_{i} S_{i}^{⊤} Δ_{i} V_{i}^{- 1} D_{i}^{⊤}), \end{matrix} \\ B = \begin{matrix} E (D_{i} V_{i}^{- 1} Δ_{i} D_{i}^{⊤}), & G = E (D_{i} V_{i}^{- 1} Δ_{i} S_{i} Q_{ni}^{⊤} H^{- ⊤} C^{⊤}), & C = E {[\frac{\partial}{\partial γ} (D_{i} V_{i}^{- 1} Δ_{i} S_{i})]}^{⊤}, \end{matrix} \\ H = \begin{matrix} E {(\frac{\partial}{\partial γ} Q_{ni})}^{⊤}, & Φ = C H^{- ⊤} C^{⊤} - G - G^{⊤} . \end{matrix} \end{matrix}

(24)

A consistent estimate of Σ_θ is given by:

{\sum^{^}}_{θ} = {\hat{B}}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{Δ}}_{i} {\hat{S}}_{i} {\hat{S}}_{i}^{⊤} {\hat{Δ}}_{i} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤}) {\hat{B}}^{- ⊤}, \hat{B} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{Δ}}_{i} {\hat{V}}_{i}^{- 1} {\hat{D}}_{i}^{⊤}, \hat{G} = (\frac{1}{n} \sum_{i = 1}^{n} {\hat{D}}_{i} {\hat{V}}_{i}^{- 1} {\hat{Δ}}_{i} {\hat{S}}_{i} {\hat{Q}}_{ni}^{⊤}) {\hat{H}}^{- ⊤} {\hat{C}}^{⊤},

Note that the asymptotic variance in (24) contains a correction term B⁻¹ΦB^−⊤ to account for the sampling variability in the estimated γˆ.

3.2.3 Score Test

As Wald-type tests are typically anti-conservative(21; 48; 49), score statistics are often used as an alternative to reduce bias, especially in type I error rates for small to moderate samples. Within the current context, let $θ = {(θ_{(1)}^{⊤}, θ_{(2)}^{⊤})}^{⊤}$ , with p and q denoting the dimension of θ₍₁₎ and θ₍₂₎, respectively. Consider testing the null H₀ : θ₍₂₎ = θ₍₂₀₎, with θ₍₂₀₎ denoting a vector of known constants.

Under H₀ : θ₍₂₎ = θ₍₂₀₎,

\begin{matrix} D_{i} = \begin{matrix} (\begin{matrix} \frac{\partial h (θ)}{\partial θ_{(1)}} \\ \frac{\partial h (θ)}{\partial θ_{(2)}} \end{matrix}) = (\begin{matrix} D_{i (1)} \\ D_{i (2)} \end{matrix}), & D_{i} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} = (\begin{matrix} D_{i (1)} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} \\ D_{i (2)} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} \end{matrix}), \end{matrix} \\ w_{n} (θ) = (\begin{matrix} w_{n (1)} (θ) \\ w_{n (2)} (θ) \end{matrix}) = \frac{1}{n} \sum_{i = 1}^{n} D_{i} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} = \frac{1}{n} (\begin{matrix} \sum_{i = 1}^{n} D_{i (1)} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} \\ \sum_{i = 1}^{n} D_{i (2)} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} \end{matrix}) . \end{matrix}

(26)

Let θ̃₍₁₎ denote the estimate from solving the reduced WGEE:

w_{n (1)} (θ_{(1)}, θ_{(20)}) = \frac{1}{n} \sum_{i = 1}^{n} D_{i (1)} V_{i}^{- 1} {\hat{Δ}}_{i} S_{i} = 0 .

(27)

Set

\begin{matrix} \tilde{θ} = (\begin{matrix} {\tilde{θ}}_{(1)} \\ θ_{(2)} \end{matrix}), & B = E (D_{i} V_{i}^{- 1} Δ_{i} D_{i}) = (\begin{matrix} B_{11} & B_{12} \\ B_{12}^{⊤} & B_{22} \end{matrix}), & G = (- B_{21} B_{11}^{- 1}, I_{v}), \end{matrix}

(28)

where q is the dimension of w_n₍₂₎, B₁₁ denotes the p × p submatrix, B₁₂ the p × q submatrix, and B₂₂ the q × q submatrix from the partitioned (p + q) × (p + q) matrix B. Then, under H₀ : θ₍₂₎ = θ₍₂₀₎, the following score statistic has as an asymptotic (central) $χ_{q}^{2}$ distribution with q degrees of freedom (see Appendix C for a proof):

T_{s} (({\tilde{θ}}_{(1)}, θ_{(2)})) = n {\tilde{w}}_{n (2)}^{⊤} (({\tilde{θ}}_{(1)}, θ_{(2)})) {\sum^{\sim}}_{(2)}^{- 1} (({\tilde{θ}}_{(1)}, θ_{(2)})) {\tilde{w}}_{n (2)} (({\tilde{θ}}_{(1)}, θ_{(2)})) \to_{d} χ_{q}^{2},

(29)

where Σ̃₍₂₎ = G̃Σ̃_θG̃^⊤ with G̃ and Σ̃_θ denoting the corresponding quantities with θ replaced by θ̃.

4 Applications

We first investigate the performance of the approach with small to moderate sample sizes by simulation and then present a real data application. In all the examples, we set the statistical significance level at α = 0.05.

4.1 Simulation Study

For space considerations, we only report results from the ZIP model for longitudinal data with sample size n = 50, 100 and 200. All simulations were performed with a Monte Carlo sample of 1,000. We start with data simulations under complete data.

4.1.1 Complete Data Case

For notational brevity, we considered a relatively simple pre-post longitudinal study design, with only one explanatory variable x_i following a normal distribution N(1,1), and simulated the bivariate count response, y_i = (y_i₁, y_i₂)^⊤, to satisfy the following marginal ZIP model:

y_{it} ∣ x_{i} ~ ZIP (ρ_{i}, μ_{i}), logit (ρ_{i}) = β_{u 0}, log (μ_{i}) = β_{0} + x_{i} β_{1}, 1 \leq t \leq 2, 1 \leq i \leq n .

(30)

We set β_u₀ = −1, β₀ = β₁ = 1. We first simulated x_i from N(1, 1), and then conditional on x_i, generated y_it by using a copula approach(50; 51; 52). The copula method can generate correlated multivariate responses for any specified marginal distribution and correlation structure. For our simulation study, we set Corr(y_i₁, y_i₂ ∣ x_i) = 0.5.

To examine type I error rates, we considered the null, H₀ : β₁ = 1, and computed the Wald statistic, $Q_{n} = n {({\hat{σ}}_{β 1}^{2})}^{- 1} {({\hat{β}}_{1} - 1)}^{2}$ , where ${\hat{σ}}_{β 1}^{2}$ denotes the element of the estimated asymptotic variance Σ̃_θ corresponding to β̃₁. Let $Q_{n}^{(k)}$ denote this statistic at the kth MC simulation (1 ≤ k ≤ 1000). The type I error rate for testing H₀ was estimated by: $\hat{α} = \frac{1}{1000} \sum_{k = 1}^{1000} I_{{Q_{n}^{(k)} \geq q_{1, 0.95}}}$ , with q_1,0.95 denoting the 95th percentile of a central $χ_{1}^{2}$ with one degree of freedom.

Since Wald statistics are often anti-conservative, we also applied the score test in Section 3.2. Let $θ = {(θ_{(1)}^{⊤}, θ_{(2)})}^{⊤}$ , where θ₍₁₎ = (β_u₀, β₀)^⊤ and θ₍₂₎ = β₁. Under H₀, θ₍₂₎ = 1, the score statistic T_s (θ̃₍₁₎, 1) in (29) has an $χ_{1}^{2}$ distribution. The type I error rate for testing H₀ was again estimated by: $\hat{α} = \frac{1}{1000} \sum_{k = 1}^{1000} I_{{T_{s}^{(k)} \geq q_{1, 0.95}}}$ , where $T_{s}^{(k)}$ denotes this statistic at the kth MC simulation (1 ≤ m ≤ 1000).

Shown in Table 1 are the estimates of θ, standard errors, and type I errors for the ZIP model in (30). For comparison purposes, we also included “Empirical” variance estimates and type I error rates based on such a variance estimate. The “Empirical” type I error rates were computed based on substituting Σ_θ with the Empirical variance estimate in the Wald test statistic. It is seen that type I error rates were a bit inflated for sample sizes 50 and 100 under the Wald test, but were closer to the nominal 0.05 under the “Score” and “Empirical” tests even for samples as small as n = 50.

Table 1.

GEE estimates of parameters, standard errors, and type I error rates based on Wald and score tests, along with empirical standard errors and type I error rates for ZIP under complete data from 1,000 MC simulations.

Simulation summary for ZIP under complete data
β_u₀ = −1, β₀ = 1, β₁ = 1
Parameter	Mean	Standard errors		Type I error for H₀ : β₁ = 1
		WGEE	Empirical	Wald		Score
				WGEE	Empirical
Sample size of 50
β_u ₀	−1.052	0.363	0.385
β ₀	1.000	0.090	0.100
β ₁	0.998	0.039	0.048
				0.095	0.061	0.045
Sample size of 100
β_u ₀	−1.021	0.252	0.256
β ₀	1.000	0.063	0.067
β ₁	0.999	0.027	0.031
				0.076	0.052	0.054
Sample size of 200
β_u ₀	−1.012	0.177	0.176
β ₀	0.999	0.044	0.046
β ₁	1.000	0.019	0.021
				0.065	0.042	0.042

Open in a new tab

To compare our approach with GEE II, we also estimated the parameters using a program developed for such an alternative by Hall and Zhang (2004)(24). As noted earlier, their method modeled the conditional variance, rather than the second moment. In addition, they assumed working independence between the mean and variance. We obtain quite similar results (not shown), which may not be surprising, as such differences are likely to have minor impact on inference given the marginal ZIP model in (30).

4.1.2 Missing Data Case

Assuming no missing data at baseline t = 1, we simulated missing responses under MCAR and MAR with about 20% missing data at t = 2. By applying the discussion in Section 3.2 to the context of the pre-post design, we modeled the missingness at time t = 2 under MAR by:

logit (π_{i 2} (γ)) = γ_{0} + γ_{x}^{⊤} x_{i 1} + γ_{y}^{⊤} y_{i 1}, γ_{x} = γ_{y} = \frac{1}{2} .

We again considered the null H₀: β₁ = 1, and computed the Wald and score statistics and the associated type I error rates. The Wald statistic Q_n is computed the same way as in the complete data case except that the estimate of θ is obtained from the WGEE in (23).

Shown in Table 2(3) are the estimates of θ, standard errors, and type I errors for the ZIP model under MCAR (MAR). As in the complete data case, the score test again performed a marvelous job in correcting the upward bias in type I error rates by the Wald statistic in testing the null H₀: β₁ = 1, especially for the sample size n = 50, 100. For inference under MAR, the Wald statistic again yielded inflated type I error rates for testing the null, but the score test corrected the upward bias and maintained a type I error rate consistently near 0.05 across all sample sizes.

Table 2.

WGEE estimates of parameters, standard errors, and type I error rates based on Wald and Score tests, along with empirical standard errors and type I error rates for ZIP under MCAR from 1,000 MC simulations.

Simulation summary for ZIP under missing data following MCAR
β_u₀ = −1, β₀ = 1, β₁ = 1
Parameter	Mean	Standard errors		Type I error for H₀ : β₁ = 1
		GEE	Empirical	Wald		Score
				GEE	Empirical
Sample size of 50
β_u ₀	−1.077	0.378	0.402
β ₀	0.991	0.112	0.120
β ₁	0.997	0.115	0.135
				0.108	0.061	0.046
Sample size of 100
β_u ₀	−1.026	0.257	0.258
β ₀	0.997	0.080	0.082
β ₁	0.998	0.082	0.088
				0.075	0.057	0.044
Sample size of 200
β_u ₀	−1.016	0.180	0.183
β ₀	0.998	0.057	0.055
β ₁	1.000	0.059	0.060
				0.055	0.049	0.045

Open in a new tab

4.2 Real Study Data

To illustrate the approach to real study data, we applied it to a multi-center, NIDA-sponsored study entitled “HIV/STD Safer Sex Skills Groups For Men In Methadone Maintenance Or Drug-free Outpatient Treatment Programs,” known as CTN0018 within the Clinical Trials Network (CTN) studies. This study was designed to examine the effectiveness of 5 session motivational and skills training in HIV/AIDS group interventions developed to reduce sexual risk behaviors in men, as compared to an HIV education only control condition. Unlike most community-based studies in which the HIV education provided is limited to information, this trial integrated a component to provide skill-training programs such as role plays to reducing sex risk behaviors. The primary outcome of the study is the number of unprotected vaginal and anal sexual intercourse occasions (USO) which was assessed at baseline, 2 weeks, 3- and 6-months(53; 54).

Out of 573 eligible subjects screened, 422 subjects completed assessment at baseline. Among these, 381 (91.27%) and 345 (60.2 %) came for assessment at 3- and 6-months. Since 2 weeks was too short to observe a reasonably large USO, we limited our analysis to the period from baseline to 3- and 6-months follow-up visits.

Shown in Table 4 are the mean USOs and percent of zero USO at baseline, 3- and 6-months for the two treatment groups. It is evident that there was a preponderance of zeros in the distribution of this outcome at each assessment time. Accordingly, we modeled the USO at 3-month (y_i₁) and 6-month (y_i₂) as a function of treatment condition, time and time by treatment interaction, controlling for baseline USO, y_i₀, using the FRM-based ZIP model in (11) with

Table 4.

Comparison of mean USO and percent of zero USO between the two treatment groups at baseline, 3- and 6-month follow-up for the CTN0018 Study.

Mean USO and number of zeros at each assessment time for CTN0018 study
	Intervention (S.D.)	Without intervention (S.D.)	zeros (%)
Baseline	21.46(26.66)	22.34(27.77)	65(15.40)
USO at 3 months	15.71(25.43)	18.14(27.21)	125(32.80)
USO at 6 months	15.05(23.35)	17.19(25.89)	132(38.26)

Open in a new tab

logit (ρ_{it}) = β_{u 0} + β_{u 1} x_{i} + β_{u 2} y_{i 0} + β_{u 3} t + β_{u 4} t \cdot x_{i}, log (μ_{it}) = β_{0} + x_{i} β_{1} = β_{0} + β_{1} x_{i} + β_{2} y_{i 0} + β_{3} t + β_{4} t \cdot x_{i}, 1 \leq t \leq 2, 1 \leq i \leq n,

(31)

where x_i was an indicator with x_i = 1 for the intervention and 0 otherwise.

To account for potential response-dependent MAR missingness, we modeled the missingness under MMDP using logistic regression:

logit (p_{it} (γ_{t})) = γ_{0 t} + γ_{x t} x_{i} + γ_{y t} y_{i (t - 1)}, 1 \leq t \leq 2, 1 \leq i \leq n .

(32)

We assumed a Markov condition in (32) so that the missingness only depended on the most recent observed response.

Shown in Table 5 are the estimates of parameters from the logistic regression, their standard errors and corresponding p-values. The results show that the missingness at time t = 1 depended on the treatment assignment, while at time t = 2 depended on the observed response at time t = 1. In other words, the subjects in the intervention group were more likely to drop out than those in the control at time t = 2, while those with smaller values of USO at t = 1 were also more likely to drop out at t = 2. Based on these results, we proceeded with inference under MAR.

Table 5.

Estimates of logistic regression for modeling missingness under MAR and MMDP for CTN0018 Study.

Estimates of logistic regression for modeling missingness for CTN0018 study
Assessment time t = 1
Predictors	Estimates	Standard errors	P-values
Intercept	2.777	0.319	< 0.001
y_i ₁	−0.002	0.006	0.752
intervention	−0.869	0.351	0.013
Assessment time t = 2
Intercept	1.443	0.206	< 0.001
y_i ₂	0.019	0.007	0.007
intervention	−0.325	0.257	0.206

Open in a new tab

Shown in Table 6 are the estimates of parameters of the ZIP model, their standard errors and associated p-values. As the interaction term involving time and intervention was neither significant in the logistic (ρ_it) nor in the Poisson (μ_it) component of the model, we refit the model without this term, with the results from the revised model shown in Table 7.

Table 6.

WGEE estimates of parameters, standard errors, and p-values from FRM-based ZIP model with treatment by time interaction based on Wald and score tests under MAR and MMDP for the CTN0018 Study.

Results of FRM-based ZIP model for CTN0018 study
			P-value for H₀ : β = 0
Parameter	Estimate	Standard errors	Wald	Score
Log-linear part (μ_it)
β ₀	2.69	0.196	< 0.001	< 0.001
β₁ (intervention)	−0.08	0.028	< 0.001	< 0.001
β₂ (baseline USO)	0.012	0.001	< 0.001	< 0.001
β₃ (time)	−0.017	0.118	0.885	0.883
β₄(intervention*time)	−0.062	0.187	0.742	0.741
Logistic part (ρ_it)
β_u ₀	−0.52	0.354	0.142	0.140
β_u₁ (intervention)	0.301	0.499	0.564	0.562
β_u₂ (baseline USO)	−0.017	0.004	< 0.001	< 0.001
β_u₃(time)	0.126	0.221	0.568	0.566
β_u₄(intervention*time)	−0.121	0.314	0.701	0.700

Open in a new tab

Table 7.

WGEE estimates of parameters, standard errors, and p-values from revised additive ZIP model based on Wald and score tests under MAR and MMDP for the CTN0018 Study.

Results from revised additive ZIP model for CTN0018 study
Parameter	Estimate	Standard errors	P-value for H₀ : β = 0
			Wald	Score
Log-linear part (μ_it)
β ₀	2.90	0.021	< 0.001	< 0.001
β₁ (intervention)	−0.09	0.025	< 0.001	< 0.001
β₂ (baseline USO)	0.012	0.0004	< 0.001	< 0.001
Logistic part (ρ_it)
β_u ₀	−0.68	0.144	< 0.001	< 0.001
β_u₁ (intervention)	0.371	0.200	0.065	0.068
β_u₂ (baseline USO)	−0.015	0.004	< 0.001	< 0.001

Open in a new tab

For treatment effectiveness based on the results from the additive model, the logistic part of the model indicates that the intervention increased the likelihood of no risk for USO during the study, while the log-linear component shows that the intervention also significantly reduced the mean frequency of USO for the at-risk subgroup. The ratio of the mean USO of the treated to that of the control condition is exp(-0.09) = 0.9, suggesting a 10% decrease in USO for the treated subjects.

Baseline USO also played a significant role. The logistic component indicates that lower baseline USO would significantly increase the likelihood of being at no risk for USO during the study period. The log-linear part of the model shows that higher baseline USO was significantly associated with higher USO during the study. The findings suggest that substance abuse treatment programs should consider offering motivational exercises and skills training to achieve greater reductions in risky sexual activities.

5 Discussion

Count responses are a common type of outcome in biomedical, psychosocial and related services research. We discussed two major manifestations of departure from the Poisson assumption, overdispersion and structural zeros, and reviewed existing methods for addressing these two important issues. In particular, we focused on the limitations of available approaches with respect to longitudinal data analysis and proposed an approach to systematically tackle these problems under a unified modeling framework.

We applied the proposed approach to a real study in HIV prevention, allowing us to address important methodological issues in a timely application. In addition, the results from the simulation study show that the proposed approach works well for longitudinal study data under both complete and missing data settings. Although inference is derived based on large samples, the approach seems to provide valid inference for samples with sample size as small as 50.

Table 3.

WGEE estimates of parameters, standard errors, and type I error rates based on Wald and Score tests, along with empirical standard errors and type I error rates for ZIP under MAR from 1,000 MC simulations.

Simulation summary for ZIP under missing data following MAR
β_u₀ = −1, β₀ = 1, β₁ = 1
Parameter	Mean	Standard errors		Type I error for H₀ : β₁ = 1
		WGEE	Empirical	Wald			Score
				WGEE		Empirical
Sample size of 50
β_u ₀	−1.05	0.402	0.400
β ₀	0.995	0.128	0.105
β ₁	1.000	0.168	0.151
				0.094	0.062		0.052
Sample size of 100
β_u ₀	−1.02	0.253	0.261
β ₀	1.001	0.064	0.066
β ₁	0.998	0.088	0.080
				0.087	0.058		0.043
Sample size of 200
β_u ₀	−1.01	0.176	0.177
β ₀	0.999	0.044	0.044
β ₁	1.000	0.066	0.060
				0.055	0.056		0.051

Open in a new tab

Acknowledgments

This research was supported in part by NIH grant R21 DA027521-01. We want to thank two anonymous reviewers for very careful reviews of the manuscript, with constructive comments and edits that led to a significantly improved manuscript.

Appendix.

A

The variance V ar(f_i ∣ x_i) for the cross-sectional data case is readily computed using the moments up to the 4th order under either ZIP or NB distribution. The first two order moments for ZIP and NB are given in (9) and (10), while the 3rd and 4th order moments for the two models are given by:

\begin{matrix} ZIP : E (y^{4}) = μ + 7 (α + 1) μ^{2} + 6 (α + 1) (2 α + 1) μ^{3} + (α + 1) (2 α + 1) (3 α + 1) μ^{4}, : E (y^{3}) = μ + 3 (α + 1) μ^{2} + (α + 1) (2 α + 1) μ^{3}, \\ NB : \begin{matrix} E (y^{4} ∣ x) = (1 - ρ) μ (1 + 7 μ + 6 μ^{2} + μ^{3}), & \begin{matrix} E (y^{3} ∣ x) = (1 - ρ) μ (1 + 3 μ + μ^{2}), \end{matrix} \end{matrix} \end{matrix}

(33)

B. Proof of Theorem 2

Let $G_{i} = D_{i} V_{i}^{- 1}$ and π_i = (π_i₁, …, π_im₁)^⊤. Then, $w_{n} = \frac{1}{n} \sum_{i = 1}^{n} G_{i} Δ_{i} S_{i}$ , with G_iΔ_iS_i = G_i(x_i, θ, α)Δ_i(r_i, π_i, γ)S_i(y_i, x_i,θ). It follows from the iterated conditional expectation that E(G_iΔ_iS_i) = E [G_iS_iE(Δ_i ∣ r_i, y_i, x_i)]. By definition, Δ_iis a m × m block diagonal matrix with the tth block diagonal matrix given by $\frac{r_{it}}{π_{it}} I_{2} (1 \leq t \leq m)$ , with I_m denoting the m × m identify matrix. Since $E (\frac{r_{it}}{π_{it}} I_{2} ∣ r_{i}, y_{i}, x_{i}) = I_{2}$ , it follows that E(G_iΔ_iS_i) = E(G_iS_i) = 0. Thus, the WGEE II is unbiased and the estimate θˆ obtained as the solution to the equations is consistent.

Let γˆ be the solution to the (22). By a Taylor expansion of the estimating equations in (22) and solving for γˆ−γ, we obtain

\sqrt{n} (\hat{γ} - γ) = - H^{- 1} \frac{\sqrt{n}}{n} \sum_{i = 1}^{n} Q_{ni} + o_{p} (1),

(34)

where o_p(1) denotes the stochastic o(1)(25). Also, by applying a Taylor series expansion to the WGEE II in (23), we have

\sqrt{n} w_{n} = - {(\frac{\partial}{\partial θ} w_{n})}^{⊤} \sqrt{n} (\hat{θ} - θ) - {(\frac{\partial}{\partial α} w_{n})}^{⊤} \sqrt{n} (\hat{α} - α) - - {(\frac{\partial}{\partial γ} w_{n})}^{⊤} \sqrt{n} (\hat{γ} - γ) + o_{p} (1) .

(35)

If αˆ is $\sqrt{n} - consistent$ , it follows that

{(\frac{\partial}{\partial α} w_{n} (θ, α))}^{⊤} \sqrt{n} (\hat{α} - α) = o_{p} (1) \sqrt{n} (\hat{α} - α) = o_{p} (1) .

By substituting o_p(1) for ${(\frac{\partial}{\partial α} w_{n} (θ, α))}^{⊤} \sqrt{n} (\hat{α} - α)$ in (35) and solving for $\sqrt{n} (\hat{θ} - θ)$ (θˆ − θ), we obtain

\sqrt{n} (\hat{θ} - θ) = {(- \frac{\partial}{\partial θ} w_{n})}^{- ⊤} \sqrt{n} [w_{n} + C (\hat{γ} - γ)] + o_{p} (1) .

(36)

It follows from (34) and (36) that

\sqrt{n} (\hat{θ} - θ) = {(- \frac{\partial}{\partial θ} w_{n})}^{- ⊤} \frac{\sqrt{n}}{n} \sum_{i = 1}^{n} (w_{ni} - {CH}^{- 1} Q_{ni}) + o_{p} (1) .

(37)

Since

\frac{\partial}{\partial θ} w_{n} = \frac{1}{n} \sum_{i = 1}^{n} (\frac{\partial}{\partial θ} Δ_{i} S_{i}) G_{i}^{⊤} + o_{p} (1) = \frac{1}{n} \sum_{i = 1}^{n} D_{i} Δ_{i} G_{i}^{⊤} + o_{p} (1) \to_{p} - B^{⊤} .

(38)

where →_p denotes convergence in probability, it follows from (37) and (38) that

\sqrt{n} (\hat{θ} - θ) = - B^{- ⊤} \frac{\sqrt{n}}{n} \sum_{i = 1}^{n} (w_{ni} - {CH}^{- 1} Q_{ni}) + o_{p} (1) .

(39)

By applying the central limit theorem and Slutsky's theorem to (39)(25), θˆ is asymptotically normal with the asymptotic variance given by Σ_θ in (24).

C. Asymptotic Normality of Score Statistic

First, assume no missing data. Then, $B = E (D_{i} V_{i}^{- 1} D_{i})$ By applying the law of large numbers,

\frac{\partial}{\partial θ} w_{n} (θ) = (\begin{matrix} \frac{\partial}{\partial θ_{(1)}} w_{n (1)} (θ) & \frac{\partial}{\partial θ_{(1)}} w_{n (2)} (θ) \\ \frac{\partial}{\partial θ_{(2)}} w_{n (2)} (θ) & \frac{\partial}{\partial θ_{(2)}} w_{n (2)} (θ) \end{matrix}) \to_{p} B = E (D_{i} V_{i}^{- 1} D_{i}) = (\begin{matrix} B_{11} & B_{12} \\ B_{12}^{⊤} & B_{22} \end{matrix}) .

(40)

It follows from a Taylor's series expansion and (40) that

0 = w_{n (1)} ({\tilde{θ}}_{(1)}, θ_{(20)}) w_{n (1)} (θ) - B_{11}^{- ⊤} ({\tilde{θ}}_{(1)} - θ_{(1)}) + o_{p} (n^{- \frac{1}{2}}) .

Thus,

{\tilde{θ}}_{(1)} - θ_{(1)} = B_{11}^{- 1} w_{n (1)} (θ) + o_{p} (n^{- \frac{1}{2}}) .

(41)

Similarly, since $B_{12}^{⊤} = B_{21}$ , we have:

\begin{matrix} w_{n (2)} ({\tilde{θ}}_{(1)}, θ_{(20)}) = w_{n (2)} (θ) - (\frac{\partial^{⊤}}{\partial θ_{(1)}} w_{n (2)}) ({\tilde{θ}}_{(1)} - θ_{(1)}) + o_{p} (n^{- \frac{1}{2}}) \\ = w_{n (2)} (θ) - B_{21} ({\tilde{θ}}_{(1)} - θ_{(1)}) + o_{p} (n^{- \frac{1}{2}}) . \end{matrix}

(42)

It follows from (41) and (42) that

\begin{matrix} w_{n (2)} ({\tilde{θ}}_{(1)}, θ_{(20)}) = w_{n (2)} (θ) - B_{21} [B_{11}^{- 1} w_{n (1)} (θ) + o_{p} (n^{- \frac{1}{2}})] + o_{p} (n^{- \frac{1}{2}}) \\ = G w_{n} (θ) + o_{p} (n^{- \frac{1}{2}}) . \end{matrix}

By the central limit theorem,

\sqrt{n} w_{n (2)} ({\tilde{θ}}_{(1)}, θ_{(20)}) = \sqrt{n} G w_{n} (θ) + o_{p} (1) \to_{d} N (0, \sum_{(2)} = G \sum_{θ} G^{⊤}) .

(43)

where G is defined in (28) and Σ_θ in (24).

In the presence of missing data, $B = E (D_{i} V_{i}^{- 1} Δ_{i} D_{i})$ as defined in (28). By a similar argument, w_n₍₂₎ (θ̃₍₁₎, θ₍₂₀₎) has an asymptotic normal distribution, which implies that the score statistic T_s((θ̃₍₁₎, θ₍₂₎)) has the asymptotic $χ_{q}^{2}$ distribution.

References

1.Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
2.Crepon B, Duguet E. Research and development, competition and innovation — pseudo-maximum likelihood and simulated maximum likelihood methods applied to count data models with heterogeneity. Journal of Econometrics. 1997;79:355–378. [Google Scholar]
3.Miaou SP. The relationship between truck accidents and geometric design of road sections — Poisson versus negative binomial regressions. Accident Analysis & Prevention. 1994;26:471–482. doi: 10.1016/0001-4575(94)90038-8. [DOI] [PubMed] [Google Scholar]
4.Welsh A, Cunningham RB, Donnelly CF, Lindenmayer DB. Modeling the abundance of rare species: statistical-models for counts with extra zeros. Ecological Modelling. 1996;88:297–308. [Google Scholar]
5.Faddy M. Stochastic models for analysis of species abundance data. In: Fletcher DJ, Kavalieris L, Manly BF, editors. Statistics in Ecology and Environmental Monitoring 2: Decision Making and Risk Assessment in Biology. University of Otago Press; 1998. pp. 33–40. [Google Scholar]
6.Gurmu S, Trivedi P. Excess zeros in count models for recreational trips. Journal of Business & Economic Statistics. 1996;14:469–477. [Google Scholar]
7.Gurmu S. Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics. 1997;12:225–242. [Google Scholar]
8.Shonkwiler J, Shaw W. Hurdle count-data models in recreation demand analysis. Journal of Agricultural and Resource Economics. 1996;21:210–219. [Google Scholar]
9.Hall DB. Zero-Inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
10.Yau KW, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]
11.World Health Organization. Optimal duration of exclusive breastfeeding. Geneva: WHO; 2001. [Google Scholar]
12.Donath S, Amir LH. Rates of breastfeeding in Australia by State and socio-economic status: Evidence from the 1995 National Health Survey. Journal of Pediatrics and Child Health. 2000;36(2):164–168. doi: 10.1046/j.1440-1754.2000.00486.x. [DOI] [PubMed] [Google Scholar]
13.Cheung YB. Zero-infated models for regression analysis of count study of growth and development. Statistics in Medicine. 2002;21:1461–1469. doi: 10.1002/sim.1088. [DOI] [PubMed] [Google Scholar]
14.Wyman PA, Cross W, Brown HC, Yu Q, Tu XM. Intervention to strengthen emotional self-regulation in children with emerging mental health problems: Proximal impact on school behavior. Journal of Abnormal Child Psychology. doi: 10.1007/s10802-010-9398-x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Abma JC, Martinez GM, Mosher WD, Dawson BS. Teenagers in the United States: Sexual activity, contraceptive use, and child bearing. Vital Health Statistics. 2002;23(24) [PubMed] [Google Scholar]
16.Abe T, Martin I, Roche L. Clusters of Census Tracts with High Proportions of Men with Distant-Stage Prostate Cancer Incidence in New Jersey, 1995 to 1999. American Journal of Preventive Medicine. 2006;30(2):S60–S66. doi: 10.1016/j.amepre.2005.09.003. [DOI] [PubMed] [Google Scholar]
17.Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Health Services and Outcomes Research Methodology. 2002;3:5–20. [Google Scholar]
18.Lachenbruch PA. Analysis of data with excess zeros. Statistical Methods in Medical Research. 2002;11:297–302. doi: 10.1191/0962280202sm289ra. [DOI] [PubMed] [Google Scholar]
19.Lee AH, Wang K, Scott JA, Yau KKW, McLachlan GJ. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research. 2006;15:47–61. doi: 10.1191/0962280206sm429oa. [DOI] [PubMed] [Google Scholar]
20.Ritz J, Spiegelman D. Equivalence of conditional and marginal regression models for clustered and longitudinal data. Statistical Methods in Medical Research. 2004;13:309–323. [Google Scholar]
21.Zhang H, Xia Y, Chen R, Lu N, Tang W, Tu X. On Modeling Longitudinal Binomial Responses — Implications from Two Dueling Paradigms. Journal of Applied Statistics. 2011;38:2373–2390. [Google Scholar]
22.Zhang H, Tang W, Yu Q, Feng C, Gunzler D, Tu X. A New Look at the Differerence between GEE and GLMM When Modeling Longitudinal Count Responses. Journal of Applied Statistics [Google Scholar]
23.Estimating Equations. Oxford University Press; New York: 1991. Estimating equations for mixed Poisson models; pp. 35–46. [Google Scholar]
24.Hall DB, Zhang ZG. Marginal models for zero inflated clustered data. Statistical Modeling. 2004;4:161–180. [Google Scholar]
25.Kowalski J, Tu XM. Modern Applied U Statistics. Wiley; New York: 2007. [Google Scholar]
26.Crowder M. On linear and quadratic estimating functions. Biometrika. 1987;74:591–97. [Google Scholar]
27.Dobbie MJ, Welsh AH. Modeling correlated zero-inflated count data. Australian & New Zealand Journal of Statistics. 2001;43:431–444. [Google Scholar]
28.Prentice RL, Zhao LP. Estimating Equations for Parameters in Means and Covariances of Multivariate Discrete and Continuous Responses. Biometrics. 1991;47:825–839. [PubMed] [Google Scholar]
29.Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. J R Statist Soc B. 1992;54:3–40. Rubeussin and Liang, 1998. [Google Scholar]
30.Rubin DB. Inference and Missing Data. Biometrika. 1976;63:581–592. [Google Scholar]
31.Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]
32.McCullagh P, Nelder JA. Generalized Linear Models. 2nd. Chapman and Hall; London: 1989. [Google Scholar]
33.Dean CB, Lawless JF. Tests for detecting overdispersion in Poisson regression models. J Amer Statist Assoc. 1989;84:467–472. [Google Scholar]
34.Cameron AC, Trivedi PK. Econometric models based on count data: Comparisons and applications of some estimators and tests. Journal of Applied Econometrics. 1986;1:29–53. [Google Scholar]
35.Lee LF. Specification test for Poisson regression models. International Economic Review. 1986;27:689–706. [Google Scholar]
36.Tu XM, Feng C, Kowalski J, Tang W, Wang H, Wan C, Ma Y. Correlation analysis for longitudinal data: Applications to HIV and psychosocial research. Statistics in Medicine. 2007;26:4116–4138. doi: 10.1002/sim.2857. [DOI] [PubMed] [Google Scholar]
37.Ma Y, Tang W, Feng C, Tu XM. Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics. 2008;64:781–789. doi: 10.1111/j.1541-0420.2007.00934.x. [DOI] [PubMed] [Google Scholar]
38.Ma Y, Tang W, Yu Q, Tu XM. Modeling concordance correlation coefficient for longitudinal study data. Psychometrika. 2010;75:99–119. [Google Scholar]
39.Ma Y, Gonzalez Della Valle A, Zhang H, Tu XM. A U-statistics based approach for modeling Cronbach Coefficient Alpha within a longitudinal data setting. Statistics in Medicine. 2011;29(6):659–670. doi: 10.1002/sim.3853. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Yu Q, Tang W, Kowalski J, Tu XM. Multivariate U-Statistics: A Tutorial with applications. Wiley Interdisciplinary Reviews – Computational Statistics. 2011;3:457–471. [Google Scholar]
41.Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. JASA. 1995;90:106–121. [Google Scholar]
42.Cameron AC, Trivedi PK. Regression analysis of counter data. Cambridge Univ. Press; London: 1998. [Google Scholar]
43.Reboussin BA, Liang KY. An estimating equations approach for the LISCOMP Model. Psychometrika. 1998;63:165–182. [Google Scholar]
44.Yu Q. Department of Biostatistics and Computational Biology School of Medicine and Dentistry. University of Rochester; Rochester, New York: 2009. Distribution-free models for longitudinal count data. Ph.D. Thesis. [Google Scholar]
45.Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics: Simulation and Computation. 1994;23:939–951. [Google Scholar]
46.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semi-parametric nonresponse models. 448. Vol. 94. Journal of the American Statistical Association; 1999. pp. 1096–1146. [Google Scholar]
47.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Spring; 2006. [Google Scholar]
48.Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]
49.Pan W. On the robust variance estimator in generalized estimating equations. Biometrika. 2001;88:901–906. [Google Scholar]
50.Freesm EW, Valdez EA. Understanding relationships using copulas. North American Actuarial Journal. 1998;2:1–25. [Google Scholar]
51.Nelsen RB. An introduction to Copulas. Springer; New York: 2006. [Google Scholar]
52.Yan JR. Package copula on CRAN, multivariate dependence with copula. 2009. http://cran.r-project.org/web/packages/copula/index.html .
53.Calsyn DA, Wells EA, Saxon AJ, Jackson R, Heiman JR. Sexual activity under the influence of drugs is common among methadone clients. In: Harris L, editor. Problems of Drug Dependence 1999. Vol. 315. National Institute on Drug Abuse; 2000. NIH Pub. No. 00-4773. [Google Scholar]
54.Calsyn DA, Hatch-Maillette M, Tross S, et al. Motivational and Skills Training HIV/Sexually Transmitted Infection Sexual Risk Reduction Groups for Men. Journal of Substance Abuse Treatment. 2009;37(2):138–150. doi: 10.1016/j.jsat.2008.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]

[R2] 2.Crepon B, Duguet E. Research and development, competition and innovation — pseudo-maximum likelihood and simulated maximum likelihood methods applied to count data models with heterogeneity. Journal of Econometrics. 1997;79:355–378. [Google Scholar]

[R3] 3.Miaou SP. The relationship between truck accidents and geometric design of road sections — Poisson versus negative binomial regressions. Accident Analysis & Prevention. 1994;26:471–482. doi: 10.1016/0001-4575(94)90038-8. [DOI] [PubMed] [Google Scholar]

[R4] 4.Welsh A, Cunningham RB, Donnelly CF, Lindenmayer DB. Modeling the abundance of rare species: statistical-models for counts with extra zeros. Ecological Modelling. 1996;88:297–308. [Google Scholar]

[R5] 5.Faddy M. Stochastic models for analysis of species abundance data. In: Fletcher DJ, Kavalieris L, Manly BF, editors. Statistics in Ecology and Environmental Monitoring 2: Decision Making and Risk Assessment in Biology. University of Otago Press; 1998. pp. 33–40. [Google Scholar]

[R6] 6.Gurmu S, Trivedi P. Excess zeros in count models for recreational trips. Journal of Business & Economic Statistics. 1996;14:469–477. [Google Scholar]

[R7] 7.Gurmu S. Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics. 1997;12:225–242. [Google Scholar]

[R8] 8.Shonkwiler J, Shaw W. Hurdle count-data models in recreation demand analysis. Journal of Agricultural and Resource Economics. 1996;21:210–219. [Google Scholar]

[R9] 9.Hall DB. Zero-Inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]

[R10] 10.Yau KW, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]

[R11] 11.World Health Organization. Optimal duration of exclusive breastfeeding. Geneva: WHO; 2001. [Google Scholar]

[R12] 12.Donath S, Amir LH. Rates of breastfeeding in Australia by State and socio-economic status: Evidence from the 1995 National Health Survey. Journal of Pediatrics and Child Health. 2000;36(2):164–168. doi: 10.1046/j.1440-1754.2000.00486.x. [DOI] [PubMed] [Google Scholar]

[R13] 13.Cheung YB. Zero-infated models for regression analysis of count study of growth and development. Statistics in Medicine. 2002;21:1461–1469. doi: 10.1002/sim.1088. [DOI] [PubMed] [Google Scholar]

[R14] 14.Wyman PA, Cross W, Brown HC, Yu Q, Tu XM. Intervention to strengthen emotional self-regulation in children with emerging mental health problems: Proximal impact on school behavior. Journal of Abnormal Child Psychology. doi: 10.1007/s10802-010-9398-x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Abma JC, Martinez GM, Mosher WD, Dawson BS. Teenagers in the United States: Sexual activity, contraceptive use, and child bearing. Vital Health Statistics. 2002;23(24) [PubMed] [Google Scholar]

[R16] 16.Abe T, Martin I, Roche L. Clusters of Census Tracts with High Proportions of Men with Distant-Stage Prostate Cancer Incidence in New Jersey, 1995 to 1999. American Journal of Preventive Medicine. 2006;30(2):S60–S66. doi: 10.1016/j.amepre.2005.09.003. [DOI] [PubMed] [Google Scholar]

[R17] 17.Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Health Services and Outcomes Research Methodology. 2002;3:5–20. [Google Scholar]

[R18] 18.Lachenbruch PA. Analysis of data with excess zeros. Statistical Methods in Medical Research. 2002;11:297–302. doi: 10.1191/0962280202sm289ra. [DOI] [PubMed] [Google Scholar]

[R19] 19.Lee AH, Wang K, Scott JA, Yau KKW, McLachlan GJ. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research. 2006;15:47–61. doi: 10.1191/0962280206sm429oa. [DOI] [PubMed] [Google Scholar]

[R20] 20.Ritz J, Spiegelman D. Equivalence of conditional and marginal regression models for clustered and longitudinal data. Statistical Methods in Medical Research. 2004;13:309–323. [Google Scholar]

[R21] 21.Zhang H, Xia Y, Chen R, Lu N, Tang W, Tu X. On Modeling Longitudinal Binomial Responses — Implications from Two Dueling Paradigms. Journal of Applied Statistics. 2011;38:2373–2390. [Google Scholar]

[R22] 22.Zhang H, Tang W, Yu Q, Feng C, Gunzler D, Tu X. A New Look at the Differerence between GEE and GLMM When Modeling Longitudinal Count Responses. Journal of Applied Statistics [Google Scholar]

[R23] 23.Estimating Equations. Oxford University Press; New York: 1991. Estimating equations for mixed Poisson models; pp. 35–46. [Google Scholar]

[R24] 24.Hall DB, Zhang ZG. Marginal models for zero inflated clustered data. Statistical Modeling. 2004;4:161–180. [Google Scholar]

[R25] 25.Kowalski J, Tu XM. Modern Applied U Statistics. Wiley; New York: 2007. [Google Scholar]

[R26] 26.Crowder M. On linear and quadratic estimating functions. Biometrika. 1987;74:591–97. [Google Scholar]

[R27] 27.Dobbie MJ, Welsh AH. Modeling correlated zero-inflated count data. Australian & New Zealand Journal of Statistics. 2001;43:431–444. [Google Scholar]

[R28] 28.Prentice RL, Zhao LP. Estimating Equations for Parameters in Means and Covariances of Multivariate Discrete and Continuous Responses. Biometrics. 1991;47:825–839. [PubMed] [Google Scholar]

[R29] 29.Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. J R Statist Soc B. 1992;54:3–40. Rubeussin and Liang, 1998. [Google Scholar]

[R30] 30.Rubin DB. Inference and Missing Data. Biometrika. 1976;63:581–592. [Google Scholar]

[R31] 31.Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]

[R32] 32.McCullagh P, Nelder JA. Generalized Linear Models. 2nd. Chapman and Hall; London: 1989. [Google Scholar]

[R33] 33.Dean CB, Lawless JF. Tests for detecting overdispersion in Poisson regression models. J Amer Statist Assoc. 1989;84:467–472. [Google Scholar]

[R34] 34.Cameron AC, Trivedi PK. Econometric models based on count data: Comparisons and applications of some estimators and tests. Journal of Applied Econometrics. 1986;1:29–53. [Google Scholar]

[R35] 35.Lee LF. Specification test for Poisson regression models. International Economic Review. 1986;27:689–706. [Google Scholar]

[R36] 36.Tu XM, Feng C, Kowalski J, Tang W, Wang H, Wan C, Ma Y. Correlation analysis for longitudinal data: Applications to HIV and psychosocial research. Statistics in Medicine. 2007;26:4116–4138. doi: 10.1002/sim.2857. [DOI] [PubMed] [Google Scholar]

[R37] 37.Ma Y, Tang W, Feng C, Tu XM. Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics. 2008;64:781–789. doi: 10.1111/j.1541-0420.2007.00934.x. [DOI] [PubMed] [Google Scholar]

[R38] 38.Ma Y, Tang W, Yu Q, Tu XM. Modeling concordance correlation coefficient for longitudinal study data. Psychometrika. 2010;75:99–119. [Google Scholar]

[R39] 39.Ma Y, Gonzalez Della Valle A, Zhang H, Tu XM. A U-statistics based approach for modeling Cronbach Coefficient Alpha within a longitudinal data setting. Statistics in Medicine. 2011;29(6):659–670. doi: 10.1002/sim.3853. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Yu Q, Tang W, Kowalski J, Tu XM. Multivariate U-Statistics: A Tutorial with applications. Wiley Interdisciplinary Reviews – Computational Statistics. 2011;3:457–471. [Google Scholar]

[R41] 41.Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. JASA. 1995;90:106–121. [Google Scholar]

[R42] 42.Cameron AC, Trivedi PK. Regression analysis of counter data. Cambridge Univ. Press; London: 1998. [Google Scholar]

[R43] 43.Reboussin BA, Liang KY. An estimating equations approach for the LISCOMP Model. Psychometrika. 1998;63:165–182. [Google Scholar]

[R44] 44.Yu Q. Department of Biostatistics and Computational Biology School of Medicine and Dentistry. University of Rochester; Rochester, New York: 2009. Distribution-free models for longitudinal count data. Ph.D. Thesis. [Google Scholar]

[R45] 45.Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics: Simulation and Computation. 1994;23:939–951. [Google Scholar]

[R46] 46.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semi-parametric nonresponse models. 448. Vol. 94. Journal of the American Statistical Association; 1999. pp. 1096–1146. [Google Scholar]

[R47] 47.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Spring; 2006. [Google Scholar]

[R48] 48.Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]

[R49] 49.Pan W. On the robust variance estimator in generalized estimating equations. Biometrika. 2001;88:901–906. [Google Scholar]

[R50] 50.Freesm EW, Valdez EA. Understanding relationships using copulas. North American Actuarial Journal. 1998;2:1–25. [Google Scholar]

[R51] 51.Nelsen RB. An introduction to Copulas. Springer; New York: 2006. [Google Scholar]

[R52] 52.Yan JR. Package copula on CRAN, multivariate dependence with copula. 2009. http://cran.r-project.org/web/packages/copula/index.html .

[R53] 53.Calsyn DA, Wells EA, Saxon AJ, Jackson R, Heiman JR. Sexual activity under the influence of drugs is common among methadone clients. In: Harris L, editor. Problems of Drug Dependence 1999. Vol. 315. National Institute on Drug Abuse; 2000. NIH Pub. No. 00-4773. [Google Scholar]

[R54] 54.Calsyn DA, Hatch-Maillette M, Tross S, et al. Motivational and Skills Training HIV/Sexually Transmitted Infection Sexual Risk Reduction Groups for Men. Journal of Substance Abuse Treatment. 2009;37(2):138–150. doi: 10.1016/j.jsat.2008.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Distribution-free Models for Longitudinal Count Responses with Overdispersion and Structural zeros

Q Yu

R Chen

W Tang

H He

R Gallop

P Crits-Christoph

J Hu

XM Tu

Summary

1 Introduction

2 Functional Response Models for Count Response

2.1 Models for Overdispersion and Structural Zeros

2.2 Functional Response Models

3 Distribution-free Inference

3.1 Distribution-free Inference for Cross-sectional Data

3.2 Distribution-free Inference for Longitudinal Data

3.2.1 Inference under Complete Data

Theorem 1

3.2.2 Inference under Missing Data

Theorem 2

3.2.3 Score Test

4 Applications

4.1 Simulation Study

4.1.1 Complete Data Case

Table 1.

4.1.2 Missing Data Case

Table 2.

4.2 Real Study Data

Table 4.

Table 5.

Table 6.

Table 7.

5 Discussion

Table 3.

Acknowledgments

Appendix.

A

B. Proof of Theorem 2

C. Asymptotic Normality of Score Statistic

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases