Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 18.
Published in final edited form as: J Appl Stat. 2019 May 22;46(16):2862–2883. doi: 10.1080/02664763.2019.1620705

A semiparametric marginalized zero-inflated model for analyzing healthcare utilization panel data with missingness

Tian Chen a, Hui Zhang b, Bo Zhang c
PMCID: PMC7500577  NIHMSID: NIHMS1534808  PMID: 32952258

Abstract

Zero-inflated count outcomes arise quite often in research and practice. Parametric models such as the zero-inflated Poisson and zero-inflated negative binomial are widely used to model such responses. However, interpretations of those models focus on the at-risk subpopulation of a two-component population mixture and fail to provide direct inference about marginal effects for the overall population. Recently, new approaches have been proposed to facilitate such marginal inferences for count responses with excess zeros. However, they are likelihood based and impose strong assumptions on data distributions. In this paper, we propose a new distribution-free, or semiparametric, alternative to provide robust inference for marginal effects when population mixtures are defined by zero-inflated count outcomes. The proposed method also applies to longitudinal studies with missing data following the general missing at random mechanism. The proposed approach is illustrated with both simulated and real study data.

Keywords: Functional response models, zero-inflated Poisson, zero-inflated negative binomial, marginalized ZIP, marginalized ZINB, missing data

1. Introduction

In health economics and health services research, one commonly used primary outcome is the number of health care facility usage, such as visits to primary care doctors and days of hospitalizations. Such count responses are typically modeled by the Poisson or negative binomial (NB) distributions. However, if a proportion of study subjects is not at risk for the disease or use of any health care facilities during the study period, their zeros would confound the expected zeros under either Poisson or NB distributions, leading to excessive zeros or zero-inflated count responses. To account for these excess zeros, zero-inflated Poisson (ZIP) and negative binomial (ZINB) models are widely applied to analyze count data with many zeros. ZIP models consist of two components where one component models the count from an at-risk subpopulation following a Poisson distribution and the other one models the excess zero for non-risk subpopulation using a logistic regression. The interpretation of parameters depends on the components: For Poisson component, with a log-link, exponentiating the regression coefficients gives the effects of the corresponding covariates to the count outcome for at-risk subpopulation. For logistic component, exponentiating the coefficients gives the odds ratio comparing the latent class membership, that is, the odds of belong to a no-risk class to the odds of belong to an at-risk class. A natural extension of ZIP models is the ZINB models, in which the Poisson model in the ZIP is replaced by NB and the interpretations for estimates remains the same. As can be seen, when the primary interest lies in the overall effect of predictors on the count response, or the effect of predictors in the overall mixture population, neither ZIP or ZINB are satisfactory because they cannot offer such direct marginal inference for overall population, instead, they only assess the effect of predictors on the count outcome in at-risk subpopulation.

For cross-sectional count data with a preponderance of zeros, Bohning et al. [2] and Albert et al. [1] proposed post-modeling methods to make marginal inference on the effect of a covariate on the over mean of a zero-inflated count response, whose realizations can be both zero and nonzeros. Long et al. [17] developed the marginalized ZIP (MZIP) models by directly modeling the overall mean of count data with excess zeros. Unlike the conventional ZIP, where the linear predictor is linked to the mean of Poisson component, the linear predictor in MZIP is linked to the overall mean of the zero-inflated count response. Preisser et al. [21] introduced the marginalized ZINB (MZINB) models with an identical rational and aimed at seeking direct marginal inference for count data with clumping at zeros. Here, direct marginal inference entails simplified inference on average marginal and incremental effects of a covariate in these models [16]. Especially, the semi-elasticity [30] of a covariate can be expressed as a regression coefficient under a certain convenient link function. For panel count data with clumping at zeros, Long et al. [13] explored an extension of the MZIP models to marginalized random-effects zero-inflated Poisson models. Kassahun et al. [11] presented both marginalized random-effects hurdle models and marginalized random-effects zero-inflated model for correlated and overdispersed count data with excess zero observations. However, the models proposed by Long et al. [13] and Kassahun et al. [11] are parametric models with inference based upon the maximum likelihood approach. Although the inference is asymptotically more efficient if the models are correctly specified, such likelihood-based methods impose strong distributional assumptions, and their inference is generally sensitive to deviation from model specification. Wang et al. [29] recently proposed robust semiparametric method to provide inference on the overall means of multiple non-negative distributions with excess zero observations based on empirical likelihood ratio test. Their method focuses on the test for mean equality under cross-sectional case. For panel count data with excess zeros, there has been a growing desire to develop semi-parametric models that do not depend on distributional assumptions and can also provide direct marginal inference on the overall mean.

This article is devoted to developing semiparametric marginalized zero-inflated count models to make direct inference on the marginal effects of covariates for panel data that not only is robust to the underlying distribution but also provides valid inference under data with missing values. To this end, we utilize recent extensions of the generalized estimating equations (GEE), a popular robust alternative to parametric models for panel data. By modeling only the mean response, GEE provides valid inference for a wider class of data distributions. Unfortunately, conventional GEE does not apply to the current context, because the mean response of the count outcome alone does not provide sufficient information to identify all parameters for mixtures models like zero-inflated count responses. To address this identifiability issue, we take advantage of the functional responses models (FRM), a rich class of semiparametric models that not only includes conventional GEE and weighted GEE (WGEE) as its members, but also offers a framework for developing new regression models beyond the confines of the current regression paradigm. Unlike GEE, which only models the mean responses, the FRM is capable of modeling non-linear functions of responses and even between-subject attributes and has been applied to a range of methodological issues, including zero-inflated models [7, 26, 33, 35], extensions of the Mann-Whitney-Wilcoxon rank sum test to longitudinal and causal inference settings [3, 32], reliability coefficient [18], social network analysis [19], and causal inference for multi-layered intervention studies [31]. For panel data, FRM also provides valid inference under the missing at random (MAR) assumption, the most popular missing data mechanism in practice. [6, 7, 9, 31-33].

This article is organized as follows: Section 2 reviews ZIP, ZINB, MZIP and MZINB models; Section 3 details development of the proposed methods, inference and interpretation of model parameters; Section 4 illustrates the proposed approach with simulated study data; Section 5 presents an empirical analysis of German Socioeconomic Panel data with the proposed method, and Section 6 gives our concluding remarks.

2. Zero-Inflated Count Responses

We start with a brief review of parametric ZIP and ZINB regression models.

2.1. ZIP and ZINB Models for Structural Zeros

Both ZIP and ZINB model a two-component mixture: the first component models the membership of the non-risk group whose responses are constant, or structural zeros while the second focuses on the at-risk group whose count response follow the Poisson or NB. Specifically, let Yi be the random variable of a zero-inflated count response. The distribution of Yi is defined by

P(Yi=yi)={ρi+(1ρi)f(0γi)ifyi=0(1ρi)f(yiγi)ifyi=1,2,},

where f(yi | γi) is either the Possion probability function with γi = ηi where ηi is the mean of Poisson distribution or the negative binomial with γi = (ηi, τ) where τ is the overdispersion parameter. ρi models the excess zero and describes the probability of being in the non-risk group who have zero probability of a count greater than 0. Therefore, zero observations include both structural (fixed) zeros from the non-risk group and random zeros from the at-risk group, with unknown group membership. Interested reader may refer to the Appendix A for more detailed introduction of ZIP and ZINB models.

The traditional ZIP or ZINB usually uses logit link to model ρi and log link to model ηi:

logit(ρi)=uiβu,log(ηi)=viαv,i=1,,n, (1)

where ui and vi are vectors of explanatory variables for the i-th individual for excess zeros and Poisson (or negative binomial) models, respectively, which may overlap one another or even identical.

2.2. Marginalized Zero-inflated Poisson and Negative Binomial Models

Define μi = E(Yi|vi), the mean of the count response which includes zero (structural or random) and non-zero counts given covariates vi, the MZIP model still assumes two components composed of a Possion or NB for the count responses and a Bernoulli process for extra zeros. However, instead of specifying a linear model for the logarithm of ηi, the mean of Possion or NB as in Equation (1), the MZIP/MZINB assumes that the logarithm of the mean of the mixture model, log(μi), is directly associated with linear predictors

logit(ρi)=uiβuandlog(μi)=viβv,

where μi = ηi(1−ρi). While βu still describes the effect of covariates on the excess zeros as in the traditional ZIP or ZINB model, βv now represents the effect of covariates on the count outcome in the overall population, rather than the effect for at-risk subpopulation.

As mentioned in the Introduction, GEE fails to identify both parameter vectors βu and βv because of lack of sufficient information when only modeling the mean response of the count outcome [7, 26]. Next, we discuss an approach to extend GEE to address the identifiability issue and provide robust inference for a broader class of overdispersion through the framework of function response models.

3. Functional Response Models for Marginalized Zero-inflated Count Responses

Since cross-sectional data is a special case of longitudinal data with only 1 time point, we will discuss the proposed methodology under the more general longitudinal setting.

3.1. Functional Response Models for Marginalized Zero-inflated Count Responses

Suppose we have m time points and at each time t we model the mean of (I (yit = 0), yit). As can be seen, we introduced a second response, I(yit = 0), to address the identifiability issue, which is a binary variable and has value 1 if the count response yit = 0 and value 0 otherwise. We still use logit link to model the proportion of structure zeros and log link to model the overall mean of the count response. Therefore, we have

fit=f(yit)=(f1it,f2it),f1it=I(yit=0),f2it=yit,logit(ρit)=uitβu,log(μit)=vitβv,hit=E(fitvit,uit)=(h1it,h2it),h1it=logit1(uitβu)+exp(exp(vitβv)(1+exp(uitβu)))1+exp(uitβu),h2it=exp(vitβv). (2)

Note that although similar in appearance, Equation (2) is not a conventional GEE, since the response is not just yit, but a function of both yit and nonlinear I(yit = 0). It becomes a member of functional response model (FRM),

E[f(yi1,,yiq,θ)xi1,,xiq]=h(xi1,,xiq;θ),(i1,,iq)Cqn,1q,i=1,,n,

where f(.) denotes some vector-valued function, h (.) some vector-valued smooth function (e.g. with continuous derivatives up to the second order), θ a vector of parameters of interest, q is some positive integer, and Cqn is the set of (nq) combinations of q distinct elements (i1,..., iq) from the integer set {1,..., n}.

3.2. Inference

3.2.1. Complete Data

For the FRM in Equation (2), let β=(βu,βv) denote the vector of all parameters and

fi=(fi1,,fim),hi=(hi1,,him),Di=βhi,Si=fihi,Ai=diag(Ait),Vi=Ai12R(α)Ai12,Ait=(Var(f1itxit)Cov(f1it,f2itxit)Cov(f1it,f2itxit)Var(f2itxit)), (3)

where R (α) is a working correlation matrix parameterized by α. We then estimate β by solving the following set of equations

wn(β)=i=1nDiVi1Si=0. (4)

The above extends the GEE beyond linear [14] or quadratic responses [22] to general functions of responses of the FRM.

Under Equation (2), the estimate β^ of β obtained as the solution to Equation (4) is consistent and asymptotically normal. See Appendix B.1 for a sketch of the proof.

Theorem 1: Assume E(DiVi1SiSiVi1Di)< and B=E(DiVi1Di) and B−1 exists, then ∀ fixed m, when n we have

n(β^β)dN(0,Σβ),whereΣβ=B1E(DiVi1SiSiVi1Di)B1,

where →d denotes convergence in distribution. Unlike the MLE, the asymptotic results only require the assumptions in Equation (2), which gaurantee the unbiasedness of the estimating equations in Equation (4), the condition to ensure consistency of estimates [12]. Thus, yit given uit and vit does not need to follow the ZIP distribution.

A consistent estimate of Σβ is obtained by substituting moment estimates in place of the respective parameters

Σ^β=B^1(1n1i=1nD^iV^i1S^iS^iV^i1D^i)B^1,B^=1n1i=1nD^iV^i1D^i,

where B^i, D^i, S^i and V^i denote the corresponding quantities with β replaced by β^.

3.2.2. Missing Data

Missing data is ubiquitous in longitudinal studies, regardless of how well a study is planned and executed. The estimating equations in Equation (4) generally yield biased estimates under the missing at random (MAR) mechanism [15, 24, 28]. We adapt the inverse probability weighting (IPW) technique to the current context to address this issue.

Define a missing (or rather observed) data indicator for each ith subject as follows,

rit={1ifyitisobserved0ifyitismissing},ri=(ri1,,rim). (5)

We assume no missing data at baseline t = 1 such that ri1 = 1 for all i = 1, . . . , n. Under MAR, πit, the probability of observing the outcome at time t for the ith subject, becomes dependent on the observed covariates, past outcomes and other ancillary variables that are not part of the regression model in Equation (2). For notational brevity and without loss of generality, we assume that πit depends only on observed xit and yit. We also assume Monotone Missing Data Patterns (MMDP) for the missing data, as it is difficult to model and estimate πit otherwise [7, 12, 24]. Under MMDP, yit (xit) is observed only if all yis (xis) prior to time t are all observed (1 ≤ s < tm).

Let

Hit={xit,yit;t=2,,m},xit=(xi1,,,xi,t1),yit=(yi1,,yi,t1),

which represents the history of all data information prior to time t (t = 2, . . . , m) for i = 1, . . . , n.

For each t = 2, . . . , m, let

πit=Pr(rit=1Hit,xit,yit),Δit=ritπitI2,Δi=diagt(Δit),

where I2 denotes the 2 × 2 identity matrix. If the weight function πit is known, then consider the weighted estimating equations given by

Un(β)=i=1nUni(β)=i=1nDiVi1ΔiSi=0. (6)

As in the case of unweighted estimating equations, estimates of β from solving the weighted estimating equations in (6) are consistent and asymptotically normal.

However, in most applications, the weight function πit is unknown. Below we give a brief discussion on how to estimate πit.

Under the missing completely at random (MCAR) mechanism, rit is independent of both xit and yit and thus πit = Pr(rit = 1) = πt. In this case, πit is a constant independent of xit and yit and is readily estimated by the sample moment: π^it=1ni=1nrit(t=2,,m). Under the more general MAR, rit is dependent on yit, but becomes independent of yit given Hit, i.e.,

πit=Pr(rit=1Hit,xit,yit)=Pr(rit=1Hit),t=2,,m,i=1,,n.

To estimate πit, let pit = Pr (rit = 1 | ri(t−1) = 1, Hit) denote the one-step transition probability for observing the response from time t − 1 to t. The pit’s are related to the πit’s as follows,

πit=pitPr(ri(t1)=1Hi(t1))=s=2tpis,t=2,,m,i=1,,n. (7)

By modeling pis(2 ≤ stm), we will be able to estimate πit. We model pit using the logistic regression

logit(pit)=ξ0t+ξxtxit+ξytyit,t=2,,m, (8)

where ξt=(ξ0t,ξxt,ξyt) and ξ=(ξ2,,ξm) are parameters associated with the logistic regressions. Under MMDP, the above logistic regression is well-defined. For example, starting with t = 2, xi2− = xi1 and yi2− = yi1 are observed at baseline. Thus, the logistic regression in (8) models ri2 as a function of xi1 and yi1 at baseline. For t = 3, xi3=(xi1,xi2) and yi3− = (yi1, yi2)T are observed for those with observed data at t = 2, i.e., ri2 = 1, and the logistic regression models ri3 as a function of observed xi1, xi2, yi1 and yi2 for the subgroup with observed data at time t = 2. Thus, we estimate ξ and thus πit in (7) using the m − 1 logistic regressions in Equation (8).

By substituting an estimate ξ^ of ξ and an estimate α^ of α, Equation (6) is readily solved for β. If α is n-consistent and ξ^ is asymptotically normal, the estimate β^ is consistent and asymptotically normal as well [9, 33]. However, the asymptotic variance is quite complex to estimate, since it must account for sampling variability of ξ^. To facilitate the calculations of the asymptotic variance, we may extend the FRM in Equation (2) to include the model for rit.

Let

fi=(fi1,,fim,ri2,,rim),hi=(hi1,,him,pi2,,pim),
θ=(β,ξ),β=(βu,βv),i=1,,n,t=1,,m,

where fit and hit are defined in Equation (2), pit is given in Equation (8) and rit is defined in Equation (5). This extended version of Equation (2) jointly models the response fit = (I (yit = 0), yit)T of primary interest and the response rit under MAR. The weighted estimating equations are again of the form in Equation (6), but with Di, Vi and Δi redefined to provide joint inference about both β and ξ:

Di=θhi,Vi11=Ai12R(α)Ai12,Δi11t=rit(s=2tpis)1I2,Si=fihi,Δi11=(Δi11200Δi11m),Vi22=(pi2(1pi2)00pi2(1pi2)),Δi22=(ri100ri(m1)),Vi=(Vi1100Vi22),Δi=(Δi1100Δi22), (9)

where Ai is defined in Equation (4). By integrating the weight function Δi into Equation (4), we obtain the following estimating equations for inference about β and ξ simultaneously,

Un(θ)=Un(β,ξ)=i=1nUni(β,ξ)=i=1nDiVi1ΔiSi=0. (10)

Unlike Equation (6), the WGEE defined by the redefined Di, Vi, Δi and Si in Equation (9) yields estimates of both β and ξ and the asymptotic variance of the estimate β^ of β is readily obtained, which is part of the asymptotic variance of the WGEE estimate θ^ obtained by solving the WGEE in Equation (10).

The following theorem shows that, as in the case of complete data, a consistent estimate of the asymptotic variance is again readily constructed. A proof is sketched in the Appendix B.2.

Theorem 2: Assume ΣU=E(DiVi1ΔiSiSiΔiVi1Di)< and B=E(DiVi1ΔiDi) and B−1 exists, then θ^ obtained as the solution to (10) is consistent and asymptotically normal when n and m is fixed, i.e.,

n(θ^θ)dN(0,Σθ),

where Σθ = B−1ΣUB−1.

As in Theorem 1, a consistent estimate of the asymptotic variance Σβ is readily obtained by substituting moment estimates in place of the respective parameters:

Σ^θ=B^1(1n1i=1nD^iV^i1Δ^iS^iS^iΔ^iV^i1D^i)B^1,B^=1n1i=1nD^iV^i1Δ^iD^i,

where B^i, D^i, S^i, V^i and Δ^i denote the corresponding quantities with θ replaced by θ^.

4. Simulation Studies

In this section, we conduct a series of simulation studies to investigate the performance of the proposed marginalized regression model for zero-inflated count responses. We only report results for longitudinal data,which includes cross-sectional data as a special case with no missing values. All simulations are completed with Monte Carlo sample size M = 1000 with different choice of sample size, i.e., n = 300, 500 and 1000, and the significance level is set at 0.05.

4.1. Marginalized zero-inflated Poisson (MZIP) model

We first simulate data from MZIP and consider a longitudinal study with three assessments (m = 3). Let yit be the count response for the ith participant at time t. Explanatory variables from the MZIP are set equal to each other and all are time invariant, that is, uit = vit =ui. Let μit = E (yit | uit, vit). Data are generated from the following MZIP model

logit(ρit)=uitβu=βu0+u1itβu1+u2itβu2,log(μit)=vitβv=βv0+vitβv1+v2itβv2,1t3,i=1,,n,

where ukit = vkit = uki ~ N (1, 1), k = 1, 2. Structure (excessive) zeros are generated through the Bernoulli(πit) and random counts are generated through a Poisson with mean ηit = μit/(1 − πit). Conditioning on uit and vit, correlated yit’s are generated using the copula approach, with an exchangeable correlation of 0.5 [20]. We set βu and βv as

βu=(βu0,βu1,βu2)=(0.5,0.5,0.3),βv=(βv0,βv1,βv2)=(3,0.2,0.4).

This model generates about 38.7% zeros (structural plus random).

Partial results are presented in Figure 1. Since the parameters associated with marginal mean component (βv) are of our greatest interest, we only display the estimation bias (i.e., estimated - true value) using boxplot for βv2. Full report of the simulation results can be found in Appendix C with additional information (e.g., empirical standard error, asymptotical standard error etc.) for all parameters.

Figure 1.

Figure 1.

Boxplot of estimation bias for βv2 when data is generated from MZIP: left 3 (white) boxplots are based on longitudinal data without missing values and the right 3 (gray) boxplots are based on the longitudinal data with 17% and 33% missing values at t = 2 and 3, respectively. Coverage probability is shown on the top of the figure.

In Figure 1, the left three white boxplots are based on the data generated with MZIP mentioned before withou missing values (complete case) under sample size of 300, 500 and 1000, respectively. As can be seen from the figure, the estimated parameters were quite close to the true values and the variability was getting smaller with the increase of the sample size. The coverage probabilities (CP) of 95% Wald-type confidence intervals, which is constructed using the Wald statistics based on the parameter and asymptotic variance estimates, was shown on the top of the figure. They were getting closer to the nominal 95% level with the increase of the sample size.

To evaluate the performance under missing data following the MAR, we assume no missing data at baseline t = 1 and model the missingness at t = 2 and t = 3 as

logit(pi2)=0.9+0.5log(yi1+1.5),logit(pi3yi2isobserved)=0.2+1.5log(yi2+1.5). (11)

In the missing model, we assume that the missing is dependent on the previous observed value and the above setting generates about 17% and 33% missing data at t = 2 and t = 3, respectively.

The right three boxplots in Figure 1 show the results under the missing data case for βv2. As in the complete data case, the approach worked very well in the presence of missing data: the estimates were all close to the true values and the variability became smaller with increasing sample size and the coverage probabilities of 95% Wald-type confidence intervals were near the nominal 95% level when sample size was large. Our proposed method provides valid inference for MZIP models with missing values.

4.2. Overdispersion under marginalized zero-inflated negative binomial (MZINB) model

In this simulation, we assess the performance of our proposed method in the presence of overdispersed count response by replacing MZIP with MZINB where structure (excessive) zeros are still generated through the same Bernoulli(πit) with ρit=logit1(uitβu) but random counts are generated through the NB with mean ηit = μit/(1 − pit) and dispersion parameter τ = 3. We assume the same parameter values as in the MZIP case: βu = (−0.5, −0.5, 0.3)T, βv = (3, −0.2, −0.4)T. Because ZINB converges to ZIP when τ, selecting a relatively small τ such as τ = 3 allows us to better assess the performance of our proposed approach under overdispersion.

The left three boxplots in Figure 2 shows the performance of the proposed method under complete data generated from MZINB model and the coverage probability of 95% Wald-type confidence intervals. As seen, our proposed method still worked quite well when data were generated from MZINB: the estimated βv2 were very close to the true value and the coverage probabilities of 95% Wald-type confidence intervals were near the nominal 95% level when sample size was large. The proposed method seems to remain robust against overdispersion.

Figure 2.

Figure 2.

Boxplot of estimation bias for βv2 when data is generated from MZINB: left 3 (white) boxplots are based on longitudinal data without missing values and the right 3 (gray) boxplots are based on the longitudinal data with 17% and 33% missing values at t = 2 and 3, respectively. Coverage probability is shown on the top of the figure.

The results for the missing data case are shown in the right panel of Figure 2 under the same set-up in Equation (11) for MZIP. The approach again performed very well in the presence of missing data: the estimates were close to the true value, and the coverage probabilities of 95% Wald-type confidence intervals were all reasonably near the nominal 95% level. Full report of the simulation results can be found in Appendix C with additional information (e.g., empirical standard error, asymptotical standard error etc.) for all parameters. Overall speaking, our proposed method performs quite well with or without the presence of missing data and is robust to the underling distribution.

5. Empirical Analysis of German Socioeconomic Panel Data

We now apply the proposed approach to the first twelve waves (1984-1995) of German Socioeconomic Panel (GSOEP) data, which surveys a representative sample of East and West German households [23]. GSOEP provided comprehensive information on the utilization of health care facilities, personal characteristics (e.g., gender, age) and the insurance schemes under which individuals are covered.

In German, individuals whose annual income is less than a threshold are mandatory to join a public health insurance fund. Civil servants and the self-employed as well as persons with earnings above the threshold can choose to voluntarily join one of the mandatory public health insurances or a private insurance. People already covered by the public insurance are allowed to purchase add-on insurance to cover extra costs like double rooms in hospitals or a choice of physicians. The primary interest is to evaluate the effect differences between insurance types on the overall health care utilizations, which is measured by the number of physician visits in the last 3 months right before the survey. To answer the research question, two binary explanatory variables are of the greatest interest. The first one is public, which describes whether the individual is covered by the public sector (1) or privately insured (0). The second key explanatory variable is the add-on, which describes whether the individual purchased add-on insurance (1) or not (0). Specifically, we want to examine if choosing public insurance will change the overall health care utilization compared with private insurance. And the same comparison is made between with and without add-on insurance.

The GSOEP data includes a total of 7293 individual families, with observed visits varying from one to seven, resulting in a total number of 27326 observations. For illustration purposes, we only consider the first 3 years with respect to each subject and this gives us 4079 subjects at year 2 and 3612 subjects at year 3. The number of physician visits showed a high frequency of zeros at each year, i.e., 42%, 41%, 38%, 38%, 35%, 32%, 34%, respectively, with an overall zero visits of 37.09%. Figure 3 shows the histogram of the GSOEP study data.

Figure 3.

Figure 3.

Histogram of the GSOEP study data

We start by examining the missing data mechanism.

logit(pit(ξt))=ξ0t+ξ1tFemalei+ξ2tAgeit+ξ3tlog(yi(t1)+1),2t3, (12)

where pit (ξt) = P(rit = 1 | ri(t−1) = 1, Femalei, Agei, yi(t−1)), the probability of observing at time t given time t − 1 is also observed. We assumed a Markov condition in Equation (12) so that the missingness only depended on the most recent observed response.

Result of the logistic regression (in Appendix D) shows that the missingness at time t = 2 was dependent on the gender, age and observed response yi1 at baseline, while at time t = 3 the missingness was also dependent on the gender and age but not the prior response yi2 at time t = 2. The results indicate that female and younger subjects are less likely to have missing visits. For t = 2, people with more doctors’ visits are less likely to have missing values than those with fewer doctors’ visits. The missing completely at random is rejected and we proceeded under the MAR mechanism. Since the prior response is not significant when modeling missingness at t = 3, we will keep it when modeling missingness at t = 2 but remove it at t = 3.

For the marginalized effects of public indicator and addon indicator, we model the number of doctor visits as a function of public indicator and addon indicator, controlling for gender, age, health satisfaction (0-bad and 10-well) and year (treated as categorical and used 1984 as the reference level), that is,

uit=vit={female, age, health satisfaction, year, public, add-on}

As noted earlier, we model the missingness at t = 2(3) as a function of observed response at t = 1(2) using logistic regression specified in Equation (12). We apply the proposed approach to make inference about the marginalized effect.

Shown in Table 1 are the estimated parameters (Est.), standard errors (S.E.), and p-values of the explanatory variables. The upper panel of Table 1 lists the estimates of marginal mean component. We write out the estimated model as below

log(μit)=1.848+0.305Femaleit+0.007Ageit0.219Healthsatisfactionit+0.190Publicit+0.037Addonit0.013Year1985it+0.121Year1986it+0.051Year1987it0.005Year1988it0.114Year1991it+0.210Year1994it.

Table 1.

Fitted parameters for the analysis of the effect of the public insurance and other covariates on the number of doctor’s visit based on German Socioeconomic Panel (GSOEP) data.

Est. S.E. p-value
Marginalized Mean Component
Intercept 1.848 0.113 < 0.0001
Female 0.305 0.039 < 0.0001
Age 0.007 0.002 < 0.0001
Health −0.219 0.007 < 0.0001
Public 0.190 0.050 0.0001
Addon 0.037 0.146 0.800
Year1985 −0.013 0.045 0.774
Year1986 0.121 0.043 0.005
Year1987 0.051 0.069 0.462
Year1988 −0.005 0.071 0.939
Year1991 −0.114 0.081 0.159
Year1994 0.210 0.089 0.018
Zero-inflation Component
Intercept −1.452 0.200 < 0.0001
Female −0.559 0.058 < 0.0001
Age −0.007 0.003 0.0198
Health 0.243 0.015 < 0.0001
Public −0.223 0.077 0.0037
Addon −0.499 0.267 0.0617
Year1985 0.050 0.070 0.476
Year1986 −0.089 0.071 0.205
Year1987 −0.020 0.116 0.861
Year1988 −0.339 0.126 0.007
Year1991 −0.489 0.153 0.001
Year1994 −0.253 0.146 0.082

As can be seen, the estimated marginalized effect of public indicator is exp(0.190) = 1.209 with 95% Wald-type confidence interval (1.097, 1.333). Thus, with all other covariates held fixed, the expected number of doctors’ visits for an individual covered by the public sector of the German health insurance is approximately 20.9% higher than for those covered by the private sector. That is, individuals covered by private insurances are expected to have less doctor visits. Such comparison between public insurance and private insurance holds for the overall population, not just the individuals in the at-risk group as in ZIP or ZINB does. While the number of doctor visits are significantly affected by the health insurance type, it is not affected by the add-on indicator: the expected number of doctor visits for individual with addon insurance is 3.8% (exp(0.037) − 1), which, although higher than those without add-on, is not significant (pvalue = 0.776 with 95% Wald-type confidence interval (0.780, 1.382)). We also noticed that both women and age had a significant effect on the health utilization with female and older individuals having more doctors’ visits. Health satisfaction had a significant negative effect: higher health satisfaction was associated with fewer doctors’ visits. Year is treated as categorical variable and Year 1984 is the reference level. All coefficients for years were compared with Year 1984. Coefficients for Year 1989, 1990, 1992 - 1995 are not available because of lack of health care utilization data on these years. As can be seen, Year 1986 and Year 1994 had significantly higher number of physician’s visits compared with Year 1984.

The lower panel of Table 1 lists the estimates of the logistic regression model for the probability of an excess zero, i.e., zero-inflation component. Unlike the interpretation for the estimates in the marginal mean component, where the effects are with respect to the overall population instead of the at-risk group as in the ZIP or ZINB, the interpretation of the logistic regression part would be the same as in the traditional ZIP or ZINB model. For example, the estimated odds ratio of an excess zero for the public indicator is exp(−0.223) = 0.800 with 95% Wald-type confidence interval (0.688, 0.930), suggesting that individuals in the no-risk group with public health insurance is 20% less likely in utilizing health-care as individuals covered by private insurance.

6. Discussion

We proposed a new alternative to model marginalized effects for zero-inflated count responses to address limitations of available parametric methods. By taking advantage of the FRM, this approach provides joint inferences about all parameters in the main (consisting of a loglinear and a logistic regression) and missing data modules of the model. When implementing IPW in the context of GEE, a common approach is fit the missing data module and then use the inverse of estimated probability as weights to make inference about main module. Such a two-step procedure, although convenient to perform, may seriously underestimate standard errors of the estimated parameters for the main module, since it ignores sampling variability of the estimated weights. Although it is possible to account for such variability in the inference for the main module, the procedure is quite complex [3]. In comparison, the proposed joint inference is much more straightforward to perform and again enjoy consistency and asymptotical normality.

Results from our simulation studies show that the approach performs well for longitudinal data under MAR, the most popular missing data mechanism arising in practice, even when the missing rates are as high as 33%. The approach not only addresses a major limitation of current methods on their inability of handling missing data, but also several advantages. First, compared to full parametric models, the proposed model requires no specification of random effects to account for correlated responses. Distribution assumptions such as normality are difficult to validate, especially in the presence of missing data. Second, parametric models for longitudinal data such as the popular generalized linear mixed-effects models (GLMM) suffer serious computational issues when applied to non-continuous responses and even software giants like SAS continue to have trouble to provide reliable inferences for GLMM for binary responses [4, 5, 10, 34]. Inference for more complex, random-effect based models for longitudinal zero-inflated count responses is likely to be more problematic.

In simulation studies, covariates uit and vit are set to be the same and time-invariant for convenience. The proposed method can handle time-variant covariates as well. In practice, we can either use experts’s knowledge or modern variable selection techniques, such as LASSO [27] or SCAD [8], to identify the appropriate sets of covariates for each module. Chen et al. [7] proposed SCAD-based variable selection methods for zero-inflated count responses, which achieve simultaneous selection and estimation for all three components and can be applied to the current setting.

The hurdle model [25] is another popular approach to addressing excess zeros and overdispersion when modeling count responses, where a logit regression model is used to model the zero response and a truncated version of the log-linear model such as Poisson or Negative Binomial is used to model the positive counts. The FRM can also be applied to facilitate joint inference when modeling the zero and positive counts using GEE-like marginal models and this will be the focus of our future work.

Acknowledgments

Funding

Dr. Bo Zhang’s research was partially supported by the National Institutes of Health grant U24 AA026968 and the University of Massachusetts Center for Clinical and Translational Science grant UL1TR001453, TL1TR01454, and KL2TR01455.

Appendix A. Introduction of ZIP and ZINB

The zero-inflated Poisson regression model is defined by

yixi~i.d.ZIP(ρi,ηi),logit(ρi)=uiβu,log(ηi)=viβv,i=1,,n, (A1)

where logit(ρ)=log(ρ1ρ) denotes the logit link and ZIP(p, η) denotes the ZIP distribution defined by

fZIP(yρ,η)={ρf0(0)+(1ρ)fP(0η)ify=0(1ρ)fP(yη)ify=1,2,}, (A2)

with f0 (y) denoting a degenerate distribution centered at 0 and fP (y | η) the distribution function of Poisson(η) with mean η. In (A2), the Poisson probability at 0, fP (0 | η), is modified by πf0 (0) + (1 − π) fP (0 | η) with πf0 (0) = π to account for structural zeros.

Under ZIP(π, η), the mean and variance of the Poisson component are both equal to η. In many studies, the count response from the at-risk group may be overdispsered, causing invalid inference when applying the ZIP to such data. By mixing the zero-centered degenerative distribution with the NB, ZINB is a popular alternative to addresses this key limitation of ZIP. By replacing the Poisson with the NB in (A1), we obtain the ZINB

yixi~i.d.ZINB(ρi,ηi,τ),logit(ρi)=uiβu,log(ηi)=viβv,i=1,,n,

where ZINB(π, η, τ) is a ZINB distribution defined by:

fZINB(yp,η,τ)={ρf0(0)+(1ρ)fNB(0η,τ)ify=0(1ρ)fNB(yη,τ)ify=1,2,}.

with fNB (y | η, τ) denoting the distribution function of NB

fNB(yη,τ)=Γ(y+τ)y!Γ(τ)(ητ1+ητ)y(11+ητ)τ,τ>0,y=0,1,2,.

As in ZIP, p is the mixing probability of the ZINB mixture. Therefore, zeros include both structural from the non-risk group and random from the at-risk group, zeros with unknown group membership.

Appendix B. Proof of Theorems

B.1. Proof of Theorem 1

Consider the normalized 1nwn=1ni=1nDiVi1Si and for notational brevity we continue to denote the normalized estimating equations by wn. It follows from the iterated conditional expectation that E(DiVi1Si)=E[DiVi1E(Sixi)]=0. Thus, the GEE is unbiased and the estimate β^ obtained as the solution to the equations is consistent [12].

By applying a Taylor series expansion to the GEE in (4), we have

wn(β^)=wn(β)+(βwn)β^=β(β^β)+op(1),

where op (1) denotes the stochastic o (1). Since β^ is the solution for wn = 0, we have wn(β^)=0. Thus the above reduces to

0=wn(β)+(βwn(β))(β^β)+op(1).

Thus,

nwn=(βwn)n(β^β)+op(1).

Also, Solving for n(β^β) yields

n(β^β)=(βwn)1nwn+op(1).

According to the law of large number,

βwn=1ni=1n(βSi)Vi1Di=1ni=1nDiVi1DipE(DiVi1Di)=B, (B1)

where →p denotes convergence in probability.

By applying the central limit theorem we have

wn=1ni=1nDiVi1SidN(0,1nE(DiVi1SiSiVi1Di)) (B2)

Apply Slutsky’s theorem to (B1) and (B2), β^ is asymptotically normal with the asymptotic variance given by Σβ in Theorem 1, i.e.,

n(β^β)dN(0,Σβ),B=E(DiVi1Di),Σβ=B1E(DiVi1SiSiVi1Di)B1.

B.2. Proof of Theorem 2

Since we jointly model the excess zeros, marginal mean count and missingness, the proof is almost the same as in Theorem 1.

Again, consider the normalized Un=1ni=1nDiVi1ΔiSi. Let θ = (βτ, ξT)T.By applying a Taylor series expansion to the WGEE in (10), we have

n(θ^θ)=(θUn)1nUn+op(1).

Since

θUn=1ni=1n(θSi)Vi1ΔiDi=1ni=1nDiVi1ΔiDipE(DiVi1ΔiDi).

where →p denotes convergence in probability. By applying the central limit theorem, we have

nUn=nni=1nDiVi1ΔiSidN(0,E(DiVi1ΔiSiSiVi1Di))

Define B=E(DiVi1ΔiDi), then

n(θ^θ)=B1nni=1nDiVi1ΔiSi+op(1).

By applying the Slutsky’s theorem to above equation, θ^ is asymptotically normal with the asymptotic variance given by Σθ in Theorem 2, i.e.,

n(θ^θ)dN(0,Σθ),whereΣθ=B1E(DiVi1ΔiSiSiΔiVi1Di)B1,

Appendix C. Tables for Simulation Results

C.1. Simulation results under MZIP

Let β=(βu,βv) denote the vector of true parameter values and β^=(β^u,β^v) the corresponding estimates using the proposed method. The left panel of Table C1 shows the averaged estimates (Est), percent relative median bias which is defined as 100% × median{(β^jβj)βj} (Bias), average of asymptotical standard errors over the 1000 MC replicates (Asym.SE), empirical standard errors based on the 1000 sets of parameter estimates (Emp.SE) and the coverage probability (CP) of 95% Wald-type confidence intervals constructed based on the asymptotic standard errors, with the study sample size n = 300, 500 and 1000, respectively. For space consideration, we only report the results for the slopes.

As can be seen from the left panel in Table C1, the estimated parameters were quite close to the true values, albeit with a small downward bias when n = 1000, with the percent relative median bias ranging from 0.035% to 0.360%. The asymptotical standard errors also matched up their empirical counterparts quite well. The coverage probabilities of 95% Wald-type confidence intervals were all near the nominal 95% level for marginalized mean component when sample size is relatively large. For the zero-inflation component, the coverage probability became closer to the nominal level when sample size increased. The zero-inflation component (i.e., logistic regression component) performed less well due to the nature of binary outcomes.

C.2. Simulation results under MZINB

In this simulation study, we assess the performance of our proposed method under MZINB models. The left panel of Table C2 shows performance of the proposed method under complete data, including the percent relative median bias, asymptotic standard errors over the 1000 MC replicates, empirical standard errors based on the 1000 sets of parameter estimates and the coverage probability of 95% Wald-type confidence intervals. As seen, our proposed method still worked quite well when data were generated from MZINB: the estimated parameters were very close to the true values, with the percent relative median bias ranging from 0.1% to 0.8%. The asymptotical standard errors also matched up their empirical counterparts quite well and the coverage probabilities of 95% Wald-type confidence intervals are all near the nominal 95% level. The proposed method seems to remain robust against overdispersion.

The results for the missing data case are shown in the right panel of Table C2 under the same set-up in (11) for MZIP. The approach again performed very well in the presence of missing data: the estimates, for both the main and missing modules, were all close to the true values, with the percent relative median bias all less than 1% and the coverage probabilities of 95% Wald-type confidence intervals were all reasonably near the nominal 95% level. All estimated parameters in Table C2 demonstrated higher standard error compared with their counterparts in Table C1 under MZIP for both the log-linear and logistic component, due to the increased variability from NB. The missing data module, however, still had standard error comparable to their counterparts under MZIP. Extra variability in the main module does not have much influence on the efficiency of the missing model. Overall speaking, our proposed method performs quite well with or without the presence of missing data and is robust to the underling distribution.

Table C1.

Simulation results under MZIP. Based on 1000 simulation replicates. Est=Avg of the estimates; Bias=median of the percentage bias; Asym.SE=Asymptotical standard error; Emp.SE=Empirical standard error; CP=Coverage Probability

Complete Incomplete
Marginalized Mean Comp. n Est Bias Asym. SE Emp. SE CP Est Bias Asy.SE Emp.SE CP
βv1 = −0.2 300 −0.200 0.127% 0.013 0.014 94.2% −0.200 −0.218% 0.014 0.014 93.3%
500 −0.200 −0.198% 0.010 0.011 94.2% −0.200 −0.217% 0.011 0.011 93.6%
1000 −0.200 −0.190% 0.007 0.008 93.4% −0.200 −0.214% 0.007 0.008 93.8%
βv2 = −0.4 300 −0.400 0.007% 0.013 0.014 93.1% −0.400 −0.084% 0.014 0.014 94.7%
500 −0.400 −0.040% 0.010 0.010 94.2% −0.400 −0.026% 0.011 0.011 92.7%
1000 −0.400 −0.082% 0.007 0.007 94.9% −0.400 0.016% 0.007 0.008 94.1%
Zero-Inflation Component
βv1 = −0.5 300 −0.504 0.515% 0.040 0.042 93.0% −0.505 0.568% 0.043 0.044 94.5%
500 −0.502 0.186% 0.031 0.033 93.3% −0.502 0.263% 0.034 0.033 95.5%
1000 −0.500 0.063% 0.022 0.024 93.6% −0.500 −0.090 0.024 0.025 93.0%
βv1 = 0.3 300 0.300 −0.943% 0.035 0.038 91.9% 0.300 −0.182% 0.037 0.039 92.7%
500 0.301 0.462% 0.027 0.029 91.8% 0.300 −0.406% 0.029 0.029 93.3%
1000 0.301 0.325% 0.019 0.020 94.1% 0.299 −0.439% 0.020 0.021 94.3%
Missing Component
ξ1 = 0.5 300 0.506 −0.276% 0.117 0.121 95.6%
500 0.506 0.993% 0.090 0.093 95.6%
1000 0.500 −0.249% 0.063 0.065 94.9%
ξ2 = 1.5 300 1.550 0.367% 0.249 0.244 96.9%
500 1.538 1.053% 0.192 0.204 95.7%
1000 1.523 0.621% 0.131 0.139 95.3%

Table C2.

Simulation results under MZINB. Based on 1000 simulation replicates. Est=Avg of the estimates, Bias=median of the percentage bias; Asym.SE=Asymptotical standard error; Emp.SE = Empirical standard error; CP=Coverage Probability

Complete Incomplete
Marginalized Mean Comp. n Est Bias Asym. SE Emp. SE CP Est Bias Asy.SE Emp.SE CP
βv1 = −0.2 300 −0.202 0.100% 0.082 0.084 94.1% −0.199 −2.357% 0.084 0.086 93.4%
500 −0.199 −0.273% 0.064 0.065 94.4% −0.200 −0.746% 0.065 0.068 93.9%
1000 −0.200 −0.261% 0.046 0.045 95.2% −0.198 −1.328% 0.047 0.048 93.9%
βv2 = −0.4 300 −0.402 −0.556% 0.082 0.086 93.3% −0.400 −0.208% 0.084 0.091 92.7%
500 −0.399 −0.013% 0.065 0.066 94.1% −0.397 −0.538% 0.066 0.069 94.0%
1000 −0.400 0.372% 0.046 0.047 94.6% −0.399 0.112% 0.047 0.048 94.8%
Zero-Inflation Component
βv1 = −0.5 300 −0.499 −1.279% 0.094 0.107 88.6% −0.504 0.591% 0.097 0.107 91.0%
500 −0.501 −0.410% 0.075 0.084 89.4% −0.505 0.593% 0.079 0.086 90.7%
1000 −0.499 −0.894% 0.056 0.062 89.6% −0.501 0.095% 0.058 0.062 93.7%
βv1 = 0.3 300 0.303 −0.203% 0.091 0.101 91.9% 0.294 −3.365% 0.092 0.103 90.0%
500 0.300 −0.981% 0.071 0.077 92.9% 0.300 −1.319% 0.074 0.081 92.9%
1000 0.301 0.884% 0.053 0.052 95.0% 0.298 −1.853% 0.053 0.056 93.2%
Missing Component
ξ1 = 0.5 300 0.509 1.418% 0.112 0.116 94.9%
500 0.504 0.402% 0.086 0.085 95.3%
1000 0.500 −0.089% 0.060 0.060 95.1%
ξ2 = 1.5 300 1.562 1.683% 0.272 0.272 95.0%
500 1.558 1.065% 0.212 0.242 95.0%
1000 1.525 0.343% 0.143 0.158 95.7%

Appendix D. Results of logistic regressions

Shown in Table D1 are the estimates of parameters from the two logistic regression models, along with their standard errors and corresponding p-values.

Table D1.

Estimates of logistic regression model for missingness at 3- and 6-months follow-up under MAR and MMDP.

Predictors Est. S.E. p-value
Assessment time t = 2
Intercept −1.193 0.087 < 0.001
Female −0.228 0.049 < 0.001
Age 0.041 0.002 < 0.001
Prior response (yi1) −0.079 0.027 0.004
Assessment time t = 3
Intercept 0.488 0.138 < 0.001
Female −0.186 0.073 0.011
Age 0.015 0.003 < 0.001
Prior response (yi2) 4.26 × 10−6 0.04 1.000

References

  • [1].Albert JM, Wang W, and Nelson S, Estimating overall exposure effects for zero-inflated regression models with application to dental caries, Statistical methods in medical research 23 (2014), pp. 257–278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Böhning D, Dietz E, Schlattmann P, Mendonca L, and Kirchner U, The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology, Journal of the Royal Statistical Society: Series A (Statistics in Society) 162 (1999), pp. 195–209. [Google Scholar]
  • [3].Chen R, Chen T, Lu N, Zhang H, Wu P, Feng C, and Tu X, Extending the mann–whitney–wilcoxon rank sum test to longitudinal regression analysis, Journal of Applied Statistics 41 (2014), pp. 2658–2675. [Google Scholar]
  • [4].Chen T, Knox K, Arora J, Tang W, Kowalski J, and Tu X, Power analysis for clustered non-continuous responses in multicenter trials, Journal of Applied Statistics (2015), pp. 1–17.25484482 [Google Scholar]
  • [5].Chen T, Lu N, Arora J, Katz I, Bossarte R, He H, Xia Y, Zhang H, and Tu X, Power analysis for cluster randomized trials with binary outcomes modeled by generalized linear mixed-effects models, Journal of Applied Statistics (2015), pp. 1–15.25484482 [Google Scholar]
  • [6].Chen T, Kowalski J, Chen R, Wu P, Zhang H, Feng C, and Tu XM, Rank-preserving regression: a more robust rank regression model against outliers, Statistics in Medicine 35 (2016), pp. 3333–3346. [DOI] [PubMed] [Google Scholar]
  • [7].Chen T, Wu P, Tang W, Zhang H, Feng C, Kowalski J, and Tu XM, Variable selection for distribution-free models for longitudinal zero-inflated count responses, Statistics in Medicine (2016). [DOI] [PubMed] [Google Scholar]
  • [8].Fan J and Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 (2001), pp. 1348–1360. [Google Scholar]
  • [9].Gunzler D, Tang W, Lu N, Wu P, and Tu X, A class of distribution-free models for longitudinal mediation analysis, Psychometrika 79 (2014), pp. 543–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Huang L, Tang L, Zhang B, Zhang Z, and Zhang H, Comparison of different computational implementations on fitting generalized linear mixed-effects models for repeated count measures, Journal of Statistical Computation and Simulation 86 (2016), pp. 2392–2404. [Google Scholar]
  • [11].Kassahun W, Neyens T, Molenberghs G, Faes C, and Verbeke G, Marginalized multilevel hurdle and zero-inflated models for overdispersed and correlated count data with excess zeros, Statistics in Medicine 33 (2014), pp. 4402–4419. [DOI] [PubMed] [Google Scholar]
  • [12].Kowalski J and Tu XM, Modern applied U-statistics, Vol. 714, John Wiley & Sons, 2008. [Google Scholar]
  • [13].Leann Long D, Preisser JS, Herring AH, and Golin CE, A marginalized zero-inflated poisson regression model with random effects, Journal of the Royal Statistical Society: Series C (Applied Statistics) 64 (2015), pp. 815–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Liang KY, Zeger SL, and Qaqish B, Multivariate regression analyses for categorical data, Journal of the Royal Statistical Society. Series B (Methodological) (1992), pp. 3–40. [Google Scholar]
  • [15].Little RJA and Rubin DB, Statistical analysis with missing data, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc., New York, 1987. [Google Scholar]
  • [16].Liu X, Zhang B, Tang L, Zhang Z, Zhang N, Allison JJ, Srivastava DK, and Zhang H, Are marginalized two-part models superior to non-marginalized two-part models for count data with excess zeroes? estimation of marginal effects, model misspecification, and model selection, Health Services and Outcomes Research Methodology (2018), pp. 1–40. [Google Scholar]
  • [17].Long DL, Preisser JS, Herring AH, and Golin CE, A marginalized zero-inflated poisson regression model with overall exposure effects, Statistics in Medicine 33 (2014), pp. 5151–5165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Lu N, Chen T, Wu P, Gunzler D, Zhang H, He H, and Tu X, Functional response models for intraclass correlation coefficients, Journal of Applied Statistics 41 (2014), pp. 2539–2556. [Google Scholar]
  • [19].Lu N, White A, Wu P, He H, Hu J, Feng C, and Tu X, Social network endogeneity and its implications for statistical and causal inferences In Social Networking: Recent Trends, Emerging Issues and Future Outlook, edited by Lu N, White AM and Tu XM, Nova Science, 2013. [Google Scholar]
  • [20].Nelsen RB, An introduction to copulas, Springer Science & Business Media, 2007. [Google Scholar]
  • [21].Preisser JS, Das K, Long DL, and Divaris K, Marginalized zero-inflated negative binomial regression with application to dental caries, Statistics in Medicine 35 (2016), pp. 1722–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Prentice RL and Zhao LP, Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses, Biometrics 47 (1991), pp. 825–839. [PubMed] [Google Scholar]
  • [23].Riphahn RT, Wambach A, and Million A, Incentive effects in the demand for health care: a bivariate panel count data estimation, Journal of applied econometrics 18 (2003), pp. 387–405. [Google Scholar]
  • [24].Robins JM, Rotnitzky A, and Zhao LP, Analysis of semiparametric regression models for repeated outcomes in the presence of missing data, Journal of the American Statistical Association 90 (1995), pp. 106–121. [Google Scholar]
  • [25].Rose CE, Martin SW, Wannemuehler KA, and Plikaytis BD, On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data, Journal of Biopharmaceutical Statistics 16 (2006), pp. 463–481. [DOI] [PubMed] [Google Scholar]
  • [26].Tang W, Lu N, Chen T, Wang W, Gunzler DD, Han Y, and Tu XM, On performance of parametric and distribution-free models for zero-inflated and over-dispersed count responses, Statistics in Medicine 34 (2015), pp. 3235–3245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Tibshirani R, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 (1996), pp. 267–288. [Google Scholar]
  • [28].Tsiatis AA, Semiparametric Theory and Missing Data, Springer Series in Statistics, Springer, New York, 2006. [Google Scholar]
  • [29].Wang C, Marriott P, and Li P, Semiparametric inference on the means of multiple nonnegative distributions with excess zero observations, Journal of Multivariate Analysis 166 (2018), pp. 182–197. [Google Scholar]
  • [30].Wooldridge JM, Econometric analysis of cross section and panel data, MIT press, 2010. [Google Scholar]
  • [31].Wu P, Gunzler D, Lu N, Chen T, Wymen P, and Tu X, Causal inference for community-based multi-layered intervention study, Statistics in Medicine 33 (2014), pp. 3905–3918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Wu P, Han Y, Chen T, and Tu X, Causal inference for mann–whitney–wilcoxon rank sum and other nonparametric statistics, Statistics in Medicine 33 (2014), pp. 1261–1271. [DOI] [PubMed] [Google Scholar]
  • [33].Yu Q, Chen R, Tang W, He H, Gallop R, Crits-Christoph P, Hu J, and Tu X, Distribution-free models for longitudinal count responses with overdispersion and structural zeros, Statistics in Medicine 32 (2013), pp. 2390–2405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Zhang H, Lu N, Feng C, Thurston SW, Xia Y, Zhu L, and Tu XM, On fitting generalized linear mixed-effects models for binary responses using different statistical packages, Statistics in Medicine 30 (2011), pp. 2562–2572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Zhang H, Tang L, Kong Y, Chen T, Liu X, Zhang Z, and Zhang B, Distribution-free models for latent mixed population responses in a longitudinal setting with missing data, Statistical Methods in Medical Research (2018). [DOI] [PubMed] [Google Scholar]

RESOURCES