Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jun 30.
Published in final edited form as: Stat Med. 2012 Dec 12;32(14):2390–2405. doi: 10.1002/sim.5691

Distribution-free Models for Longitudinal Count Responses with Overdispersion and Structural zeros

Q Yu 1, R Chen 1, W Tang 1, H He 1,2, R Gallop 3, P Crits-Christoph 3, J Hu 1, XM Tu 1,2
PMCID: PMC3806502  NIHMSID: NIHMS499723  PMID: 23239019

Summary

Overdispersion and structural zeros are two major manifestations of departure from the Poisson assumption when modeling count responses using Poisson loglinear regression. As noted in a large body of literature, ignoring such departures could yield bias and lead to wrong conclusions. Different approaches have been developed to tackle these two major problems. In this paper, we review available methods for dealing with overdispersion and structural zeros within a longitudinal data setting and propose a distribution-free modeling approach to address the limitations of these methods by utilizing a new class of functional response models (FRM). We illustrate our approach with both simulated and real study data.

Keywords: functional response models, monotone missing data pattern, negative binomial, zero-inflated Poisson, weighted generalized estimating equations

1 Introduction

Count (or frequency) responses such as number of heart attacks, days of hospitalization, suicide attempts or unprotected vaginal sex arise quite often in biomedical and psychosocial research. The Poisson distribution and more generally Poisson-based log-linear regression are widely used for modeling such data. However, heterogeneity in study populations such as data clustering often creates extra variability, which renders the Poisson distribution inappropriate for modeling count data in such instances. One approach for addressing this extra Poisson, or overdispersion, is the popular negative binomial (NB) distribution. This modeling strategy, however, is rendered ineffective when the extra variability is caused by an excessive number of zeros above and beyond the number of zeros expected by the Poisson law. For example, when modeling behavioral outcomes such as the number of unprotected vaginal sex over a period of time in HIV prevention research, the specific study population often contains a subgroup of individuals who are not at risk for such a behavior during the study period, in which case neither the Poisson nor NB is able to accommodate such cases of structural zeros in the study population. One popular approach for addressing such inflated zero counts is the zero-inflated Poisson (ZIP) model, which has been applied to a diverse range of studies(1-16). The inherent methodological problems with structural zeros have received a great deal of attention in the literature(4; 9; 10; 19; 17; 18).

When modelling count responses in the presence of overdispersion and structural zeros within a longitudinal data setting, one of the current strategies is to employ random effects within the context of the generalized linear mixed-effects model (GLMM) to account for correlated responses from repeated assessments over time(19). However, as it relies on parametric assumptions about random effects and response for inference, such an approach lacks robustness when real study data depart from the assumed distributional models. Further, the random effects induce overdispersion into the marginal model at each assessment, giving rise to quite different results and findings than those from the marginal models(20; 21; 22). In addition, such an approach computes estimates using the expectation/maximization (EM) algorithm, which can be problematic since EM is notorious for its slow convergence and may yield local rather than global maxima, making it difficult to apply such methods in routine analyses.

A popular alternative is to use the generalized estimating equations (GEE) to address correlated longitudinal responses. The GEE approach is widely used for modeling the mean response, or first-order moment. Unlike GLMM, model parameters have the same interpretations between the marginal and joint models across assessment times. In addition, as GEE models the marginal mean of the response variable at each assessment time, it ignores both layers of assumptions and thereby provides consistent estimates regardless of the complexity of the correlation structure and the distribution of the response. GEE estimates are also much easier to compute than those based on the GLMM approach.

As the key difference between the standard (Poisson) log-linear model and other models for count responses such as ZIP lies in the variance, or the second-order moment, GEE does not apply directly to extending such models to a longitudinal data setting(23; 24; 25). Also, since ZIP is a mixture of two distributions, we will not be able to identify the model parameters by simply modeling the mean response(24; 26). One approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, as the structural and sampling zeros are mixed into a single category, this approach is unable to identify the parameters for modeling the structural zeros, which is often of great interest in practice. For example, in the hospitalization example, this approach will only model those who are hospitalized, since the at-risk subgroup for hospitalization is mixed with those who are healthy and are not at risk for hospitalization. In many studies, it is of great importance to model the at- and non-risk subgroups separately. An alternative to address the identifiability issue is to include a modeling component for the variance and apply GEE to both the specified mean and variance(24; 25; 28; 29). However, all these methods do not sufficiently address missing data, yielding biased inference if missing data does not follow the missing completely at random (MCAR) assumption(30; 31).

In this paper, we propose an approach to overcome the aforementioned difficulties by utilizing a class of functional response models (FRM) and the popular weighted generalized estimating equations (WGEE). In Section 2, we first give a brief overview of the problems with overdispersed and zero-inflated count data and popular models for addressing them. We then introduce FRM and discuss its application to the current setting. In Section 3, we discuss inference for the FRM-based models under both complete and missing data. In Section 4, we illustrate the proposed models with real study data and investigate their performance using simulated data. In Section 5, we give our concluding remarks.

2 Functional Response Models for Count Response

We start with a brief review of existing approaches for addressing overdispersion and structural zeros.

2.1 Models for Overdispersion and Structural Zeros

Consider first a cross-sectional study with n subjects, and let yi denote a count response and xi a vector of explanatory variables. The popular Poisson log-linear regression, a member of the generalized linear model (GLM) family, models the conditional mean of yi given xi, μi = E(yi | xi), by applying the log function to link μi to the linear predictor ηi=xiβ(32):

yixi~i.d.P(μi),log(μi)=xiβ,1in, (1)

where i.d. denotes independently distributed and P(μ) the Poisson distribution with mean μ. Under (1), the conditional mean E(yi | xi) and variance Var(yi | xi) of yi given xi satisfy:

E(yixi)=Var(yixi)=μi. (2)

As mentioned, the conditional variance Var(yi | xi) often exceeds the conditional mean μi in real study applications, making (1) inappropriate for modeling such count data. When overdispersion occurs, the standard error of the parameter estimate of the Poisson model is artificially deflated, giving rise to artificially inflated effect size estimates and false significant findings.

Overdispersion can often be empirically detected by goodness of fit statistics or even formally tested(25; 32; 33). When deemed present, overdispersion may be corrected post hoc by using robust variance estimates(25). An alternative is to use models that explicitly address this issue. For example, the popular negative binomial (NB) model allows the variance to exceed the mean:

E(yiμi,α)=μi,Var(yiμi,α)=μi(1+αμi). (3)

Unlike the Poisson, the NB has an extra parameter α to indicate the degree of overdispersion. As α → 0, Var(yiμi, α) → μi. Thus, unless α = 0, the variance of NB is always larger than the mean, addressing overdispersion. Under NB, we can check overdispersion by testing the null: H0 : α = 0. Note, however that under H0, α = 0 is a boundary point of α ≥ 0 and the maximum likelihood estimate (MLE) αˆ of α cannot be used directly for testing H0, and alternative score statistics must be used(33; 34; 35).

Count responses in many biomedical and psychosocial studies are dominated by a preponderance of zeros that exceeds the expected frequency of the Poisson. Such excess or structural zeros not only cause overdispersion, but also affect the conditional mean, leading to biased estimates of model parameters. The zero-inflated Poisson (ZIP) model is a popular approach to address the twin effects of structural zeros on both the mean and variance.

Let ui and vi be two subsets of xi, which may overlap one another or even identical, and thus may not be a partition of xi. The ZIP regression model is defined by:

yixi~i.d.ZIP(μi,ρi),logit(ρi)=uiβu,log(μi)=viβv,1in, (4)

where ZIP(μ, ρ) denotes the ZIP distribution defined by:

fZIP(yμ,ρ)={ρf0(0)+(1-ρ)fP(0μ)ify=0(1-ρ)fP(yμ)ify>0. (5)

with f0(y) denoting a degenerate distribution centered at 0. In (5), the Poisson probability at 0, fP (0 ∣ μ), is modified by ρf0 (0) + (1 − ρ) fP (0 ∣ μ) with ρf0 (0) = ρ to account for structural zeros.

Consider these models within a longitudinal setting with m assessments, with yit, xit, uit and vit denoting the respective variables at time t (1 ≤ tm). We may model yit as a function of xit (or uit and vit for ZIP) by using either a parametric or distribution-free modeling approach. As mentioned, the former suffers interpretational and computational problems. A popular distribution-free alternative with inference based on the generalized estimating equations (GEE) is to specify the conditional mean of yit given xit, which for count response has the following form,

E(yitxit)=exp(xitβ),1tm,1in. (6)

This mean-based specification, however, is not sufficient to distinguish the Poisson from the NB, as the two models only differ in the conditional variance V ar(yitxit). The classic model specification also does not work for ZIP, since the conditional mean of yit given xit in this case is

E(yitxit)=(1-ρit)μit,1tm,1in, (7)

which in general does not provide sufficient information to identify βu and βv.

To help distinguish among the three models, one can augment the GEE by including the distinct form of the conditional variance V ar(yitxit) for each model and use the resulting GEE II for inference(23; 24; 28; 29). However, this approach is ad-hoc in the sense that GEE II is a method of inference primarily used for improving efficiency over GEE, rather than a formal model akin to (6), since the added response (or dependent variable) V ar(yitxit) is a function of parameters(25). In addition, it does not effectively address missing data. Another approach is to model the zero and positive outcomes separately using a truncated Poisson for the positive response and a logistic regression for the zero outcome(27). However, this approach is unable to identify the parameters for modeling the structural zero, which is often of greater interest in practice. Next we utilize a new class of regression models to address the limitations of the aforementioned approaches.

2.2 Functional Response Models

Consider a class of distribution-free regression models defined by:

E[f(yi1,,yiq)xi1,,xiq]=h(xi1,,xiq;θ),(i1,,iq)Cqn,1q,1in, (8)

where yi = (yi1, …, yim)˕ denotes the vector of responses from the ith subject, f some vector-valued function, h(θ) some vector-valued smooth function (e.g. with continuous derivatives up to the second order), θ a vector of parameters of interest, q some positive integer, and Cqn the set of (qn) combinations of q distinct elements (i1, …, iq) from the integer set {1, …, n}. The functional response models (FRM) (8) extend the single-subject response in the classic GLM to a function of responses from multiple subjects. For example, by setting q = 1, we immediately obtain from (8) the class of distribution-free GLMs for longitudinal data with m assessments. With FRM, we can express a broader class of problems under a regression-like framework(25; 36; 37; 38; 39; 40). Below, we focus on the application of FRM within our setting for modeling count responses.

Consider first the simpler cross-sectional study setting. For the cross-sectional parametric ZIP in (4), let

fi=f(yi)=(f1i,f2i),hi=h(ui,vi)=(h1i,h2i),f1i=yi,f2i=yi2,h1i=(1-ρi)μi,h2i=μi(1-ρi)(1+μi),logit(ρi)=uiβu,log(μi)=viβv, (9)

where ui(vi) denotes a subset of xi. Under (4), E(fiui, vi) = hi(ui, vi). For NB, f (yi) is defined the same as for ZIP in (9), but with hi = (h1i, h2i) modified as follows:

h1i=μi,h2i=μi+αμi2+μi2,log(μi)=xiβ. (10)

As a special case with α = 0, the FRM for NB reduces to a distribution-free Poisson with μi=exp(xiβ). Note that under the FRM-based NB, we can allow α to be negative and thus α = 0 is no longer a boundary point. Thus, we can readily use the estimate of α to test the null H0 : α = 0 to determine whether the Poisson loglinear model is appropriate.

For longitudinal data, suppose that each subject is assessed m times, with yit and xit denoting the respective variables at time t (1 ≤ tm). Define the FRM-based ZIP model as follows:

fit=f(yit)=(f1it,f2it)=(yit,yit2),hit=(h1it,h2it),h1it=(1-ρit)μit,h2it=μit(1-ρit)(1+μit),logit(ρit)=uitβu,log(μit)=vitβv,1tm. (11)

Likewise, we obtain a longitudinal version of FRM-based NB by defining the same fit, but modifying hit as follows:

h1it=μit,h2it=μit+αμit2+μit2,log(μit)=xitβ,1tm,1in. (12)

Note that we have assumed a constant α for NB, though the model above readily accommodates a time-varying α. In many studies, clusters causing overdispersion such as those formed by the subjects sampled from a common habitat may not change over time during the study, and this assumption is reasonable.

Both the ZIP and NB models for longitudinal data in (11) and (12) yield the same first-and second-order moment as their respective cross-sectional versions in (9)-(10) at each time t (1 ≤ tm). Thus, unlike their GLMM-based parametric counterparts, estimates from the FRM-based ZIP and NB models for longitudinal data can be readily compared to their corresponding cross-sectional versions. These distribution-free models are also called semiparametric or moment-based in the literature(41; 42). We refer to these as distribution-free models throughout the text unless otherwise stated.

3 Distribution-free Inference

We first discuss inference for cross-sectional data, and then extend the considerations to the longitudinal setting.

3.1 Distribution-free Inference for Cross-sectional Data

For the FRM-based ZIP model in (??), let θ=(βu,βv) and

Di=θhi,Si=fi-hi,Vi=(Var(f1iui,vi)Cov((f1i,f2i)ui,vi)Cov((f1i,f2i)ui,vi)Var(f2iui,vi)). (13)

We estimate θ by the following set of generalized estimating equations,

wn(θ)=i=1nDiVi-1Si=0. (14)

Given the ZIP model in (4), the elements of Vi in (13) are functions of the conditional moments of yi given xi up to the 4th order, which can be expressed in closed form (see Appendix A). Thus, the quantities Di, Vi and Si in (13) are readily evaluated. Note that (14) bears a close resemblance to the generalized estimating equations II (GEE II) for generalized linear models(25; 28; 29; 43).

By defining Di, Vi and Si the same way as in (13), but with θ = (β, α) and hi defined in (10), the GEE in (14) can be used to obtain estimates of θ for NB as well.

Under (9), the GEE estimate θˆ of θ obtained as the solution to (14) is consistent and asymptotically normal (see Theorem 1 below):

n(θ^-θ)dN(0,θ),B=E(DiVi-1Di),θ=B-1E(DiVi-1SiSiVi-1Di)B-, (15)

where →d denotes convergence in distribution(25). Unlike the MLE, the asymptotic results above do not require that yi (given ui and vi) follow the ZIP distribution in (4). If yi does follow such a parametric model, Σθ in (15) simplifies to Σθ = B−1, which is the model-based asymptotic variance.

A consistent estimate of Σθ is obtained by substituting moment estimates in place of the respective parameters:

^θ=B^-1(1n-1i=1nD^iV^i-1S^iS^iV^i-1D^i)B^-,B^=1n-1i=1nD^iV^i-1D^i,

where i, , i and i denote the corresponding quantities with θ replaced by θˆ. Our simulations indicate that the model-based asymptotic variance estimate outperforms its sandwich alternative by yielding slightly more accurate type I error rates under the correct parametric model(44).

3.2 Distribution-free Inference for Longitudinal Data

We begin with inference under complete data and then extend the discussion to include missing data.

3.2.1 Inference under Complete Data

Let

fi=(fi1,fi2,,fim),hi=(hi1,hi2,,him),Di=θhi,Si=fi-hi, (16)

where fit and hit are defined by (11) for the ZIP or by (12) for the NB model. We again apply the GEE in (14), but with Di and Si revised to reflect the changed dimension, and Vi modified to reflect the correlation between the fit's over time:

Vi=Ai12R(α)Ai12,Ai=diagt(Ait),1in,1tm,Ait=diagt(Var(f1itxit)Cov((f1it,f2it)xit)Cov((f1it,f2it)xit)Var(f2itxit)), (17)

where R(τ) is a working correlation matrix among the components of fi parameterized by τ. As in the cross-sectional data case, Ait is readily computed. For R(τ), the popular choices are the working independence model (R(τ) = I2m) and the exchangeable correlation structure given by:

R(ρ)=(I2ρJ2ρJ2I2ρJ2I2),I2=(1001),J2=(1111),0<ρ<1.

Thus, τ is known for the working independence model, but unknown for the exchangeable correlation model with τ = ρ.

Note that since the GEE estimate may not be consistent under working correlation structures other than the independence model, especially in the presence of time-varying covariates(45), we focus on this model in what to follow unless otherwise stated. With this choice of R(τ), the GEE is readily solved for θ. However, when the working correlation model used involves an unknown τ, an estimate must be substituted before the GEE is solved to obtain estimates of θ.

As in the cross-sectional data case, the GEE estimate has nice asymptotic properties summarized in Theorem 1 below. Since this is a special case of Theorem 2, its proof is omitted. Since Theorem 1 is stated for general working correlation models, it includes the condition for the estimate of τ to ensure such nice properties.

Theorem 1

Let θˆ denote the GEE estimate and let

B=E(DiVi-1Di),θ=B-1E(DiVi-1SiSiVi-1Di)B-. (18)

Under mild regularity conditions, θˆ is consistent. Further, if τˆ is n-consistent, i.e., n(τ^-τ) is bounded in probability(25), then θˆ is asymptotically normal with the asymptotic variance Σθ. A consistent estimate of Σθ is given by:

^θ=B^-1(1ni=1nD^iV^i-1S^iS^iV^i-1D^i)B^-,B^=1ni=1nD^iV^i-1D^i, (19)

where i, i, i and i denote the corresponding quantities with θ replaced by θˆ.

Note that given the limited choices for the working correlation matrix R(τ), E(SiSixi)=Vi generally is not true in practice. Thus, unlike the cross-sectional data case, there is no model-based asymptotic variance.

3.2.2 Inference under Missing Data

Missing data arise frequently in real studies. For mean-based distribution-free models such as the GLM, the weighted generalized estimating equations (WGEE) is the most popular for inference about model parameters. By integrating the inverse probability weighting (IPW) technique with the GEE, the WGEE ensures valid inference when the missing data follows the missing at random (MAR) model, a plausible and general missing data mechanism applicable to many studies in practice(25; 31; 41; 46; 47). We discuss below how to extend this IPW approach to the current FRM-based models for count responses.

Within the context of longitudinal data discussed in the preceding section, we define a missing (or rather observed) data indicator for each subject as follows:

rit={1ifyitis observed0ifyitis missing,ri=(ri1,,rim),1in.

We assume no missing data at baseline t = 1 such that ri1 = 1 for all 1 ≤ in. Let

πit=Pr(rit=1xi,yi),Δit=ritπit,Δi=diagt(Δit). (20)

In most applications, the weight function πit is unknown and must be estimated. Under MCAR, ri is independent of xi and yi and thus πit = Pr (rit = 1) = πt. In this case, πt is a constant independent of xi and yi and is readily estimated by the sample moment: π^t=1ni=1nrit(2tm).

Under MAR, πit becomes dependent on the observed xi and yi, making it difficult to model and estimate πit without imposing the monotone missing data pattern (MMDP) assumption because of the large number of missing data patterns(25; 37; 41). Under MMDP, yit (xit) is observed only if all yis (xis) prior to time t are observed (1 ≤ stm).

Let

Hit={xit,yit;2tm},xit=(xi1,,xi(t-1)),yit=(yi1,,yi(t-1)),

where X̃it and it contain the explanatory and response variables prior to time t, respectively. Under MAR we have:

πit=Pr(rit=1xi,yi)=Pr(rit=1Hit),2tm.

Let pit = Pr (rit = 1 ∣ ri(t−1) = 1, Hit), the one-step transition probability for observing the response from time t − 1 to t. We can model pit using logistic regression:

logit(pit(γt))=γ0t+γxtxit+γytyit,2tm, (21)

where γt=(γ0t,γxt,γyt). Let γ=(γ2,,γm). Then, under MMDP,

πit(γ)=pitPr(ri(t-1)=1Hi(t-1))=s=2tpis(γs),2tm,1in.

The above provides a relationship to estimate πit from the model for pit in (21).

We may estimate γ using the following estimating equations:

Qn(γ)=i=1n(Qi2,,Qim)=0,2tm,1in,Qit=γt{ri(t-1)[ritlog(pit)-(1-rit)log(1-pit)]}, (22)

With estimated πit, we can estimate θ by generalizing the WGEE for mean-based response models to a WGEE for the current context as follows:

wn(θ)=i=1nwni=i=1nDiVi-1Δ^iSi=i=1nDiVi-1Δ^i(yi-hi)=0, (23)

where Di, Vi and Si are defined the same as in the GEE in the complete data case, and Δˆi denotes Δi in (21) with estimated πit. Also, as in the complete data case, Vi may be a function of τ if working dependence correlation models are used, which must replaced with an estimate before (23) is used for inference about θ.

The WGEE estimate θˆ has nice asymptotic properties, as summarized by the theorem below (see Appendix B for a proof).

Theorem 2

Let θˆ denote the WGEE II estimate. Under mild regularity conditions,

  1. θˆ is consistent.

  2. If τˆ is n-consistent θˆ is asymptotically normal with asymptotic variance given by:

θ=B-1(U+Φ)B-,U=E(DiVi-1ΔiSiSiΔiVi-1Di),B=E(DiVi-1ΔiDi),G=E(DiVi-1ΔiSiQniH-C),C=E[γ(DiVi-1ΔiSi)],H=E(γQni),Φ=CH-C-G-G. (24)

A consistent estimate of Σθ is given by:

^θ=B^-1(1ni=1nD^iV^i-1Δ^iS^iS^iΔ^iV^i-1D^i)B^-,B^=1ni=1nD^iV^i-1Δ^iV^i-1D^i,G^=(1ni=1nD^iV^i-1Δ^iS^iQ^ni)H^-C^,

Note that the asymptotic variance in (24) contains a correction term B−1ΦB−⊤ to account for the sampling variability in the estimated γˆ.

3.2.3 Score Test

As Wald-type tests are typically anti-conservative(21; 48; 49), score statistics are often used as an alternative to reduce bias, especially in type I error rates for small to moderate samples. Within the current context, let θ=(θ(1),θ(2)), with p and q denoting the dimension of θ(1) and θ(2), respectively. Consider testing the null H0 : θ(2) = θ(20), with θ(20) denoting a vector of known constants.

Under H0 : θ(2) = θ(20),

Di=(h(θ)θ(1)h(θ)θ(2))=(Di(1)Di(2)),DiVi-1Δ^iSi=(Di(1)Vi-1Δ^iSiDi(2)Vi-1Δ^iSi),wn(θ)=(wn(1)(θ)wn(2)(θ))=1ni=1nDiVi-1Δ^iSi=1n(i=1nDi(1)Vi-1Δ^iSii=1nDi(2)Vi-1Δ^iSi). (26)

Let θ̃(1) denote the estimate from solving the reduced WGEE:

wn(1)(θ(1),θ(20))=1ni=1nDi(1)Vi-1Δ^iSi=0. (27)

Set

θ=(θ(1)θ(2)),B=E(DiVi-1ΔiDi)=(B11B12B12B22),G=(-B21B11-1,Iv), (28)

where q is the dimension of wn(2), B11 denotes the p × p submatrix, B12 the p × q submatrix, and B22 the q × q submatrix from the partitioned (p + q) × (p + q) matrix B. Then, under H0 : θ(2) = θ(20), the following score statistic has as an asymptotic (central) χq2 distribution with q degrees of freedom (see Appendix C for a proof):

Ts((θ(1),θ(2)))=nwn(2)((θ(1),θ(2)))(2)-1((θ(1),θ(2)))wn(2)((θ(1),θ(2)))dχq2, (29)

where Σ̃(2) = Σ̃θ with and Σ̃θ denoting the corresponding quantities with θ replaced by θ̃.

4 Applications

We first investigate the performance of the approach with small to moderate sample sizes by simulation and then present a real data application. In all the examples, we set the statistical significance level at α = 0.05.

4.1 Simulation Study

For space considerations, we only report results from the ZIP model for longitudinal data with sample size n = 50, 100 and 200. All simulations were performed with a Monte Carlo sample of 1,000. We start with data simulations under complete data.

4.1.1 Complete Data Case

For notational brevity, we considered a relatively simple pre-post longitudinal study design, with only one explanatory variable xi following a normal distribution N(1,1), and simulated the bivariate count response, yi = (yi1, yi2), to satisfy the following marginal ZIP model:

yitxi~ZIP(ρi,μi),logit(ρi)=βu0,log(μi)=β0+xiβ1,1t2,1in. (30)

We set βu0 = −1, β0 = β1 = 1. We first simulated xi from N(1, 1), and then conditional on xi, generated yit by using a copula approach(50; 51; 52). The copula method can generate correlated multivariate responses for any specified marginal distribution and correlation structure. For our simulation study, we set Corr(yi1, yi2xi) = 0.5.

To examine type I error rates, we considered the null, H0 : β1 = 1, and computed the Wald statistic, Qn=n(σ^β12)-1(β^1-1)2, where σ^β12 denotes the element of the estimated asymptotic variance Σ̃θ corresponding to β̃1. Let Qn(k) denote this statistic at the kth MC simulation (1 ≤ k ≤ 1000). The type I error rate for testing H0 was estimated by: α^=11000k=11000I{Qn(k)q1,0.95}, with q1,0.95 denoting the 95th percentile of a central χ12 with one degree of freedom.

Since Wald statistics are often anti-conservative, we also applied the score test in Section 3.2. Let θ=(θ(1),θ(2)), where θ(1) = (βu0, β0) and θ(2) = β1. Under H0, θ(2) = 1, the score statistic Ts (θ̃(1), 1) in (29) has an χ12 distribution. The type I error rate for testing H0 was again estimated by: α^=11000k=11000I{Ts(k)q1,0.95}, where Ts(k) denotes this statistic at the kth MC simulation (1 ≤ m ≤ 1000).

Shown in Table 1 are the estimates of θ, standard errors, and type I errors for the ZIP model in (30). For comparison purposes, we also included “Empirical” variance estimates and type I error rates based on such a variance estimate. The “Empirical” type I error rates were computed based on substituting Σθ with the Empirical variance estimate in the Wald test statistic. It is seen that type I error rates were a bit inflated for sample sizes 50 and 100 under the Wald test, but were closer to the nominal 0.05 under the “Score” and “Empirical” tests even for samples as small as n = 50.

Table 1.

GEE estimates of parameters, standard errors, and type I error rates based on Wald and score tests, along with empirical standard errors and type I error rates for ZIP under complete data from 1,000 MC simulations.

Simulation summary for ZIP under complete data
βu0 = −1, β0 = 1, β1 = 1
Parameter Mean Standard errors Type I error for H0 : β1 = 1
WGEE Empirical Wald Score
WGEE Empirical
Sample size of 50
βu 0 −1.052 0.363 0.385
β 0 1.000 0.090 0.100
β 1 0.998 0.039 0.048
0.095 0.061 0.045
Sample size of 100
βu 0 −1.021 0.252 0.256
β 0 1.000 0.063 0.067
β 1 0.999 0.027 0.031
0.076 0.052 0.054
Sample size of 200
βu 0 −1.012 0.177 0.176
β 0 0.999 0.044 0.046
β 1 1.000 0.019 0.021
0.065 0.042 0.042

To compare our approach with GEE II, we also estimated the parameters using a program developed for such an alternative by Hall and Zhang (2004)(24). As noted earlier, their method modeled the conditional variance, rather than the second moment. In addition, they assumed working independence between the mean and variance. We obtain quite similar results (not shown), which may not be surprising, as such differences are likely to have minor impact on inference given the marginal ZIP model in (30).

4.1.2 Missing Data Case

Assuming no missing data at baseline t = 1, we simulated missing responses under MCAR and MAR with about 20% missing data at t = 2. By applying the discussion in Section 3.2 to the context of the pre-post design, we modeled the missingness at time t = 2 under MAR by:

logit(πi2(γ))=γ0+γxxi1+γyyi1,γx=γy=12.

We again considered the null H0: β1 = 1, and computed the Wald and score statistics and the associated type I error rates. The Wald statistic Qn is computed the same way as in the complete data case except that the estimate of θ is obtained from the WGEE in (23).

Shown in Table 2(3) are the estimates of θ, standard errors, and type I errors for the ZIP model under MCAR (MAR). As in the complete data case, the score test again performed a marvelous job in correcting the upward bias in type I error rates by the Wald statistic in testing the null H0: β1 = 1, especially for the sample size n = 50, 100. For inference under MAR, the Wald statistic again yielded inflated type I error rates for testing the null, but the score test corrected the upward bias and maintained a type I error rate consistently near 0.05 across all sample sizes.

Table 2.

WGEE estimates of parameters, standard errors, and type I error rates based on Wald and Score tests, along with empirical standard errors and type I error rates for ZIP under MCAR from 1,000 MC simulations.

Simulation summary for ZIP under missing data following MCAR
βu0 = −1, β0 = 1, β1 = 1
Parameter Mean Standard errors Type I error for H0 : β1 = 1
GEE Empirical Wald Score
GEE Empirical
Sample size of 50
βu 0 −1.077 0.378 0.402
β 0 0.991 0.112 0.120
β 1 0.997 0.115 0.135
0.108 0.061 0.046
Sample size of 100
βu 0 −1.026 0.257 0.258
β 0 0.997 0.080 0.082
β 1 0.998 0.082 0.088
0.075 0.057 0.044
Sample size of 200
βu 0 −1.016 0.180 0.183
β 0 0.998 0.057 0.055
β 1 1.000 0.059 0.060
0.055 0.049 0.045

4.2 Real Study Data

To illustrate the approach to real study data, we applied it to a multi-center, NIDA-sponsored study entitled “HIV/STD Safer Sex Skills Groups For Men In Methadone Maintenance Or Drug-free Outpatient Treatment Programs,” known as CTN0018 within the Clinical Trials Network (CTN) studies. This study was designed to examine the effectiveness of 5 session motivational and skills training in HIV/AIDS group interventions developed to reduce sexual risk behaviors in men, as compared to an HIV education only control condition. Unlike most community-based studies in which the HIV education provided is limited to information, this trial integrated a component to provide skill-training programs such as role plays to reducing sex risk behaviors. The primary outcome of the study is the number of unprotected vaginal and anal sexual intercourse occasions (USO) which was assessed at baseline, 2 weeks, 3- and 6-months(53; 54).

Out of 573 eligible subjects screened, 422 subjects completed assessment at baseline. Among these, 381 (91.27%) and 345 (60.2 %) came for assessment at 3- and 6-months. Since 2 weeks was too short to observe a reasonably large USO, we limited our analysis to the period from baseline to 3- and 6-months follow-up visits.

Shown in Table 4 are the mean USOs and percent of zero USO at baseline, 3- and 6-months for the two treatment groups. It is evident that there was a preponderance of zeros in the distribution of this outcome at each assessment time. Accordingly, we modeled the USO at 3-month (yi1) and 6-month (yi2) as a function of treatment condition, time and time by treatment interaction, controlling for baseline USO, yi0, using the FRM-based ZIP model in (11) with

Table 4.

Comparison of mean USO and percent of zero USO between the two treatment groups at baseline, 3- and 6-month follow-up for the CTN0018 Study.

Mean USO and number of zeros at each assessment time for CTN0018 study
Intervention (S.D.) Without intervention (S.D.) zeros (%)
Baseline 21.46(26.66) 22.34(27.77) 65(15.40)
USO at 3 months 15.71(25.43) 18.14(27.21) 125(32.80)
USO at 6 months 15.05(23.35) 17.19(25.89) 132(38.26)
logit(ρit)=βu0+βu1xi+βu2yi0+βu3t+βu4txi,log(μit)=β0+xiβ1=β0+β1xi+β2yi0+β3t+β4txi,1t2,1in, (31)

where xi was an indicator with xi = 1 for the intervention and 0 otherwise.

To account for potential response-dependent MAR missingness, we modeled the missingness under MMDP using logistic regression:

logit(pit(γt))=γ0t+γxtxi+γytyi(t-1),1t2,1in. (32)

We assumed a Markov condition in (32) so that the missingness only depended on the most recent observed response.

Shown in Table 5 are the estimates of parameters from the logistic regression, their standard errors and corresponding p-values. The results show that the missingness at time t = 1 depended on the treatment assignment, while at time t = 2 depended on the observed response at time t = 1. In other words, the subjects in the intervention group were more likely to drop out than those in the control at time t = 2, while those with smaller values of USO at t = 1 were also more likely to drop out at t = 2. Based on these results, we proceeded with inference under MAR.

Table 5.

Estimates of logistic regression for modeling missingness under MAR and MMDP for CTN0018 Study.

Estimates of logistic regression for modeling missingness for CTN0018 study
Assessment time t = 1
Predictors Estimates Standard errors P-values
Intercept 2.777 0.319 < 0.001
yi 1 −0.002 0.006 0.752
intervention −0.869 0.351 0.013
Assessment time t = 2
Intercept 1.443 0.206 < 0.001
yi 2 0.019 0.007 0.007
intervention −0.325 0.257 0.206

Shown in Table 6 are the estimates of parameters of the ZIP model, their standard errors and associated p-values. As the interaction term involving time and intervention was neither significant in the logistic (ρit) nor in the Poisson (μit) component of the model, we refit the model without this term, with the results from the revised model shown in Table 7.

Table 6.

WGEE estimates of parameters, standard errors, and p-values from FRM-based ZIP model with treatment by time interaction based on Wald and score tests under MAR and MMDP for the CTN0018 Study.

Results of FRM-based ZIP model for CTN0018 study
P-value for H0 : β = 0
Parameter Estimate Standard errors Wald Score
Log-linear part (μit)
β 0 2.69 0.196 < 0.001 < 0.001
β1 (intervention) −0.08 0.028 < 0.001 < 0.001
β2 (baseline USO) 0.012 0.001 < 0.001 < 0.001
β3 (time) −0.017 0.118 0.885 0.883
β4(intervention*time) −0.062 0.187 0.742 0.741
Logistic part (ρit)
βu 0 −0.52 0.354 0.142 0.140
βu1 (intervention) 0.301 0.499 0.564 0.562
βu2 (baseline USO) −0.017 0.004 < 0.001 < 0.001
βu3(time) 0.126 0.221 0.568 0.566
βu4(intervention*time) −0.121 0.314 0.701 0.700

Table 7.

WGEE estimates of parameters, standard errors, and p-values from revised additive ZIP model based on Wald and score tests under MAR and MMDP for the CTN0018 Study.

Results from revised additive ZIP model for CTN0018 study
Parameter Estimate Standard errors P-value for H0 : β = 0
Wald Score
Log-linear part (μit)
β 0 2.90 0.021 < 0.001 < 0.001
β1 (intervention) −0.09 0.025 < 0.001 < 0.001
β2 (baseline USO) 0.012 0.0004 < 0.001 < 0.001
Logistic part (ρit)
βu 0 −0.68 0.144 < 0.001 < 0.001
βu1 (intervention) 0.371 0.200 0.065 0.068
βu2 (baseline USO) −0.015 0.004 < 0.001 < 0.001

For treatment effectiveness based on the results from the additive model, the logistic part of the model indicates that the intervention increased the likelihood of no risk for USO during the study, while the log-linear component shows that the intervention also significantly reduced the mean frequency of USO for the at-risk subgroup. The ratio of the mean USO of the treated to that of the control condition is exp(-0.09) = 0.9, suggesting a 10% decrease in USO for the treated subjects.

Baseline USO also played a significant role. The logistic component indicates that lower baseline USO would significantly increase the likelihood of being at no risk for USO during the study period. The log-linear part of the model shows that higher baseline USO was significantly associated with higher USO during the study. The findings suggest that substance abuse treatment programs should consider offering motivational exercises and skills training to achieve greater reductions in risky sexual activities.

5 Discussion

Count responses are a common type of outcome in biomedical, psychosocial and related services research. We discussed two major manifestations of departure from the Poisson assumption, overdispersion and structural zeros, and reviewed existing methods for addressing these two important issues. In particular, we focused on the limitations of available approaches with respect to longitudinal data analysis and proposed an approach to systematically tackle these problems under a unified modeling framework.

We applied the proposed approach to a real study in HIV prevention, allowing us to address important methodological issues in a timely application. In addition, the results from the simulation study show that the proposed approach works well for longitudinal study data under both complete and missing data settings. Although inference is derived based on large samples, the approach seems to provide valid inference for samples with sample size as small as 50.

Table 3.

WGEE estimates of parameters, standard errors, and type I error rates based on Wald and Score tests, along with empirical standard errors and type I error rates for ZIP under MAR from 1,000 MC simulations.

Simulation summary for ZIP under missing data following MAR
βu0 = −1, β0 = 1, β1 = 1
Parameter Mean Standard errors Type I error for H0 : β1 = 1
WGEE Empirical Wald Score
WGEE Empirical
Sample size of 50
βu 0 −1.05 0.402 0.400
β 0 0.995 0.128 0.105
β 1 1.000 0.168 0.151
0.094 0.062 0.052
Sample size of 100
βu 0 −1.02 0.253 0.261
β 0 1.001 0.064 0.066
β 1 0.998 0.088 0.080
0.087 0.058 0.043
Sample size of 200
βu 0 −1.01 0.176 0.177
β 0 0.999 0.044 0.044
β 1 1.000 0.066 0.060
0.055 0.056 0.051

Acknowledgments

This research was supported in part by NIH grant R21 DA027521-01. We want to thank two anonymous reviewers for very careful reviews of the manuscript, with constructive comments and edits that led to a significantly improved manuscript.

Appendix.

A

The variance V ar(fixi) for the cross-sectional data case is readily computed using the moments up to the 4th order under either ZIP or NB distribution. The first two order moments for ZIP and NB are given in (9) and (10), while the 3rd and 4th order moments for the two models are given by:

ZIP:E(y4)=μ+7(α+1)μ2+6(α+1)(2α+1)μ3+(α+1)(2α+1)(3α+1)μ4,:E(y3)=μ+3(α+1)μ2+(α+1)(2α+1)μ3,NB:E(y4x)=(1-ρ)μ(1+7μ+6μ2+μ3),E(y3x)=(1-ρ)μ(1+3μ+μ2), (33)

B. Proof of Theorem 2

Let Gi=DiVi-1 and πi = (πi1, …, πim1). Then, wn=1ni=1nGiΔiSi, with GiΔiSi = Gi(xi, θ, α)Δi(ri, πi, γ)Si(yi, xi,θ). It follows from the iterated conditional expectation that E(GiΔiSi) = E [GiSiE(Δiri, yi, xi)]. By definition, Δiis a m × m block diagonal matrix with the tth block diagonal matrix given by ritπitI2(1tm), with Im denoting the m × m identify matrix. Since E(ritπitI2ri,yi,xi)=I2, it follows that E(GiΔiSi) = E(GiSi) = 0. Thus, the WGEE II is unbiased and the estimate θˆ obtained as the solution to the equations is consistent.

Let γˆ be the solution to the (22). By a Taylor expansion of the estimating equations in (22) and solving for γˆγ, we obtain

n(γ^-γ)=-H-1nni=1nQni+op(1), (34)

where op(1) denotes the stochastic o(1)(25). Also, by applying a Taylor series expansion to the WGEE II in (23), we have

nwn=-(θwn)n(θ^-θ)-(αwn)n(α^-α)--(γwn)n(γ^-γ)+op(1). (35)

If αˆ is n-consistent, it follows that

(αwn(θ,α))n(α^-α)=op(1)n(α^-α)=op(1).

By substituting op(1) for (αwn(θ,α))n(α^-α) in (35) and solving for n(θ^-θ) (θˆθ), we obtain

n(θ^-θ)=(-θwn)-n[wn+C(γ^-γ)]+op(1). (36)

It follows from (34) and (36) that

n(θ^-θ)=(-θwn)-nni=1n(wni-CH-1Qni)+op(1). (37)

Since

θwn=1ni=1n(θΔiSi)Gi+op(1)=1ni=1nDiΔiGi+op(1)p-B. (38)

where →p denotes convergence in probability, it follows from (37) and (38) that

n(θ^-θ)=-B-nni=1n(wni-CH-1Qni)+op(1). (39)

By applying the central limit theorem and Slutsky's theorem to (39)(25), θˆ is asymptotically normal with the asymptotic variance given by Σθ in (24).

C. Asymptotic Normality of Score Statistic

First, assume no missing data. Then, B=E(DiVi-1Di) By applying the law of large numbers,

θwn(θ)=(θ(1)wn(1)(θ)θ(1)wn(2)(θ)θ(2)wn(2)(θ)θ(2)wn(2)(θ))pB=E(DiVi-1Di)=(B11B12B12B22). (40)

It follows from a Taylor's series expansion and (40) that

0=wn(1)(θ(1),θ(20))wn(1)(θ)-B11-(θ(1)-θ(1))+op(n-12).

Thus,

θ(1)-θ(1)=B11-1wn(1)(θ)+op(n-12). (41)

Similarly, since B12=B21, we have:

wn(2)(θ(1),θ(20))=wn(2)(θ)-(θ(1)wn(2))(θ(1)-θ(1))+op(n-12)=wn(2)(θ)-B21(θ(1)-θ(1))+op(n-12). (42)

It follows from (41) and (42) that

wn(2)(θ(1),θ(20))=wn(2)(θ)-B21[B11-1wn(1)(θ)+op(n-12)]+op(n-12)=Gwn(θ)+op(n-12).

By the central limit theorem,

nwn(2)(θ(1),θ(20))=nGwn(θ)+op(1)dN(0,(2)=GθG). (43)

where G is defined in (28) and Σθ in (24).

In the presence of missing data, B=E(DiVi-1ΔiDi) as defined in (28). By a similar argument, wn(2) (θ̃(1), θ(20)) has an asymptotic normal distribution, which implies that the score statistic Ts((θ̃(1), θ(2))) has the asymptotic χq2 distribution.

References

  • 1.Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
  • 2.Crepon B, Duguet E. Research and development, competition and innovation — pseudo-maximum likelihood and simulated maximum likelihood methods applied to count data models with heterogeneity. Journal of Econometrics. 1997;79:355–378. [Google Scholar]
  • 3.Miaou SP. The relationship between truck accidents and geometric design of road sections — Poisson versus negative binomial regressions. Accident Analysis & Prevention. 1994;26:471–482. doi: 10.1016/0001-4575(94)90038-8. [DOI] [PubMed] [Google Scholar]
  • 4.Welsh A, Cunningham RB, Donnelly CF, Lindenmayer DB. Modeling the abundance of rare species: statistical-models for counts with extra zeros. Ecological Modelling. 1996;88:297–308. [Google Scholar]
  • 5.Faddy M. Stochastic models for analysis of species abundance data. In: Fletcher DJ, Kavalieris L, Manly BF, editors. Statistics in Ecology and Environmental Monitoring 2: Decision Making and Risk Assessment in Biology. University of Otago Press; 1998. pp. 33–40. [Google Scholar]
  • 6.Gurmu S, Trivedi P. Excess zeros in count models for recreational trips. Journal of Business & Economic Statistics. 1996;14:469–477. [Google Scholar]
  • 7.Gurmu S. Semi-parametric estimation of hurdle regression models with an application to Medicaid utilization. Journal of Applied Econometrics. 1997;12:225–242. [Google Scholar]
  • 8.Shonkwiler J, Shaw W. Hurdle count-data models in recreation demand analysis. Journal of Agricultural and Resource Economics. 1996;21:210–219. [Google Scholar]
  • 9.Hall DB. Zero-Inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
  • 10.Yau KW, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]
  • 11.World Health Organization. Optimal duration of exclusive breastfeeding. Geneva: WHO; 2001. [Google Scholar]
  • 12.Donath S, Amir LH. Rates of breastfeeding in Australia by State and socio-economic status: Evidence from the 1995 National Health Survey. Journal of Pediatrics and Child Health. 2000;36(2):164–168. doi: 10.1046/j.1440-1754.2000.00486.x. [DOI] [PubMed] [Google Scholar]
  • 13.Cheung YB. Zero-infated models for regression analysis of count study of growth and development. Statistics in Medicine. 2002;21:1461–1469. doi: 10.1002/sim.1088. [DOI] [PubMed] [Google Scholar]
  • 14.Wyman PA, Cross W, Brown HC, Yu Q, Tu XM. Intervention to strengthen emotional self-regulation in children with emerging mental health problems: Proximal impact on school behavior. Journal of Abnormal Child Psychology. doi: 10.1007/s10802-010-9398-x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Abma JC, Martinez GM, Mosher WD, Dawson BS. Teenagers in the United States: Sexual activity, contraceptive use, and child bearing. Vital Health Statistics. 2002;23(24) [PubMed] [Google Scholar]
  • 16.Abe T, Martin I, Roche L. Clusters of Census Tracts with High Proportions of Men with Distant-Stage Prostate Cancer Incidence in New Jersey, 1995 to 1999. American Journal of Preventive Medicine. 2006;30(2):S60–S66. doi: 10.1016/j.amepre.2005.09.003. [DOI] [PubMed] [Google Scholar]
  • 17.Hur K, Hedeker D, Henderson W, Khuri S, Daley J. Modeling clustered count data with excess zeros in health care outcomes research. Health Services and Outcomes Research Methodology. 2002;3:5–20. [Google Scholar]
  • 18.Lachenbruch PA. Analysis of data with excess zeros. Statistical Methods in Medical Research. 2002;11:297–302. doi: 10.1191/0962280202sm289ra. [DOI] [PubMed] [Google Scholar]
  • 19.Lee AH, Wang K, Scott JA, Yau KKW, McLachlan GJ. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research. 2006;15:47–61. doi: 10.1191/0962280206sm429oa. [DOI] [PubMed] [Google Scholar]
  • 20.Ritz J, Spiegelman D. Equivalence of conditional and marginal regression models for clustered and longitudinal data. Statistical Methods in Medical Research. 2004;13:309–323. [Google Scholar]
  • 21.Zhang H, Xia Y, Chen R, Lu N, Tang W, Tu X. On Modeling Longitudinal Binomial Responses — Implications from Two Dueling Paradigms. Journal of Applied Statistics. 2011;38:2373–2390. [Google Scholar]
  • 22.Zhang H, Tang W, Yu Q, Feng C, Gunzler D, Tu X. A New Look at the Differerence between GEE and GLMM When Modeling Longitudinal Count Responses. Journal of Applied Statistics [Google Scholar]
  • 23.Estimating Equations. Oxford University Press; New York: 1991. Estimating equations for mixed Poisson models; pp. 35–46. [Google Scholar]
  • 24.Hall DB, Zhang ZG. Marginal models for zero inflated clustered data. Statistical Modeling. 2004;4:161–180. [Google Scholar]
  • 25.Kowalski J, Tu XM. Modern Applied U Statistics. Wiley; New York: 2007. [Google Scholar]
  • 26.Crowder M. On linear and quadratic estimating functions. Biometrika. 1987;74:591–97. [Google Scholar]
  • 27.Dobbie MJ, Welsh AH. Modeling correlated zero-inflated count data. Australian & New Zealand Journal of Statistics. 2001;43:431–444. [Google Scholar]
  • 28.Prentice RL, Zhao LP. Estimating Equations for Parameters in Means and Covariances of Multivariate Discrete and Continuous Responses. Biometrics. 1991;47:825–839. [PubMed] [Google Scholar]
  • 29.Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. J R Statist Soc B. 1992;54:3–40. Rubeussin and Liang, 1998. [Google Scholar]
  • 30.Rubin DB. Inference and Missing Data. Biometrika. 1976;63:581–592. [Google Scholar]
  • 31.Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley; 1987. [Google Scholar]
  • 32.McCullagh P, Nelder JA. Generalized Linear Models. 2nd. Chapman and Hall; London: 1989. [Google Scholar]
  • 33.Dean CB, Lawless JF. Tests for detecting overdispersion in Poisson regression models. J Amer Statist Assoc. 1989;84:467–472. [Google Scholar]
  • 34.Cameron AC, Trivedi PK. Econometric models based on count data: Comparisons and applications of some estimators and tests. Journal of Applied Econometrics. 1986;1:29–53. [Google Scholar]
  • 35.Lee LF. Specification test for Poisson regression models. International Economic Review. 1986;27:689–706. [Google Scholar]
  • 36.Tu XM, Feng C, Kowalski J, Tang W, Wang H, Wan C, Ma Y. Correlation analysis for longitudinal data: Applications to HIV and psychosocial research. Statistics in Medicine. 2007;26:4116–4138. doi: 10.1002/sim.2857. [DOI] [PubMed] [Google Scholar]
  • 37.Ma Y, Tang W, Feng C, Tu XM. Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics. 2008;64:781–789. doi: 10.1111/j.1541-0420.2007.00934.x. [DOI] [PubMed] [Google Scholar]
  • 38.Ma Y, Tang W, Yu Q, Tu XM. Modeling concordance correlation coefficient for longitudinal study data. Psychometrika. 2010;75:99–119. [Google Scholar]
  • 39.Ma Y, Gonzalez Della Valle A, Zhang H, Tu XM. A U-statistics based approach for modeling Cronbach Coefficient Alpha within a longitudinal data setting. Statistics in Medicine. 2011;29(6):659–670. doi: 10.1002/sim.3853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yu Q, Tang W, Kowalski J, Tu XM. Multivariate U-Statistics: A Tutorial with applications. Wiley Interdisciplinary Reviews – Computational Statistics. 2011;3:457–471. [Google Scholar]
  • 41.Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. JASA. 1995;90:106–121. [Google Scholar]
  • 42.Cameron AC, Trivedi PK. Regression analysis of counter data. Cambridge Univ. Press; London: 1998. [Google Scholar]
  • 43.Reboussin BA, Liang KY. An estimating equations approach for the LISCOMP Model. Psychometrika. 1998;63:165–182. [Google Scholar]
  • 44.Yu Q. Department of Biostatistics and Computational Biology School of Medicine and Dentistry. University of Rochester; Rochester, New York: 2009. Distribution-free models for longitudinal count data. Ph.D. Thesis. [Google Scholar]
  • 45.Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics: Simulation and Computation. 1994;23:939–951. [Google Scholar]
  • 46.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semi-parametric nonresponse models. 448. Vol. 94. Journal of the American Statistical Association; 1999. pp. 1096–1146. [Google Scholar]
  • 47.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Spring; 2006. [Google Scholar]
  • 48.Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. [Google Scholar]
  • 49.Pan W. On the robust variance estimator in generalized estimating equations. Biometrika. 2001;88:901–906. [Google Scholar]
  • 50.Freesm EW, Valdez EA. Understanding relationships using copulas. North American Actuarial Journal. 1998;2:1–25. [Google Scholar]
  • 51.Nelsen RB. An introduction to Copulas. Springer; New York: 2006. [Google Scholar]
  • 52.Yan JR. Package copula on CRAN, multivariate dependence with copula. 2009. http://cran.r-project.org/web/packages/copula/index.html .
  • 53.Calsyn DA, Wells EA, Saxon AJ, Jackson R, Heiman JR. Sexual activity under the influence of drugs is common among methadone clients. In: Harris L, editor. Problems of Drug Dependence 1999. Vol. 315. National Institute on Drug Abuse; 2000. NIH Pub. No. 00-4773. [Google Scholar]
  • 54.Calsyn DA, Hatch-Maillette M, Tross S, et al. Motivational and Skills Training HIV/Sexually Transmitted Infection Sexual Risk Reduction Groups for Men. Journal of Substance Abuse Treatment. 2009;37(2):138–150. doi: 10.1016/j.jsat.2008.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES