Summary
Zero-inflated regression models have emerged as a popular tool within the parametric framework to characterize count data with excess zeros. Despite their increasing popularity, much of the literature on real applications of these models has centered around the latent class formulation where the mean response of the so-called at-risk or susceptible population and the susceptibility probability are both related to covariates. While this formulation in some instances provides an interesting representation of the data, it often fails to produce easily interpretable covariate effects on the overall mean response. In this paper, we propose two approaches that circumvent this limitation. The first approach consists of estimating the effect of covariates on the overall mean from the assumed latent class models, while the second approach formulates a model that directly relates the overall mean to covariates. Our results are illustrated by extensive numerical simulations and an application to an oral health study on low income African-American children, where the overall mean model is used to evaluate the effect of sugar consumption on caries indices.
Keywords: Caries research, Latent class models, Marginal mean, Overall covariate effects, Overdispersion, Zero-inflated data
1. Introduction
Zero-inflated (ZI) regression models, which view data as being generated from a mixture of a point mass at zero and a non-degenerate distribution, have become a popular and interesting tool within the parametric framework to analyze count data with excessive zeros. Well known applications of these models include the works of Mullahy (1986), Farewell and Sprott (1988), Lambert (1992), Ridout et al. (1998), Böhning et al. (1999), Hall (2000), Gilthorpe et al. (2009) and references therein. Despite their increasing popularity, most of the generic applications of ZI models in statistical practice focus primarily on regression models that relate the mean response of the so-called at-risk or susceptible population and the susceptibility probability to covariates. Although this latent class formulation in some settings provides a versatile and useful representation of the data, the implied parameterization may fail to provide a clear answer to the question of evaluating the covariate effects on the marginal mean response. By marginal mean, we refer to the overall mean obtained by averaging the latent mean response across the distribution of the susceptibility status, regardless of covariates. This mean, even when there is a sizeable frequency of zeros in the data, is often the target of inference in many clinical trials and observational studies especially when non-susceptibility is scientifically obscured or implausible. And for this reason, many analyses involving ZI models tend to misinterpret the covariate effects on the mean of the susceptible population as effects on the overall mean (Preisser et al., 2012).
The work proposed in this paper is motivated by data generated from a unique oral health study on low-income inner city African-American children under the age of six and their main caregivers residing in Detroit, Michigan. Our primary interest for these data is to evaluate the effect of sugar intake on caries indices in primary dentition, possibly adjusting for important confounders such as age. The dental caries literature has consistently indicated that sugar consumption remains an important modifiable risk factor for dental caries prevention, although its effect is not as strong as it used to be in the pre-fluoride era (Burt and Pai, 2001; Tellez et al.; 2006; and Anderson et al., 2009). Evaluating this effect for medically underserved children, who are prone to extensive caries and excessive sugar consumption, would be helpful in formulating a tailored dental caries prevention policy. Because caries susceptibility may not be fully observed, focusing this evaluation on children presumed susceptible to caries would be obscured from a policy formulation standpoint.
The literature has been fairly silent about the usefulness of ZI regression models in evaluating the overall covariate effects on the marginal mean response. An important contribution was recently made by Albert et al. (2011) who proposed, in the context of binary exposures, the so-called average predicted value (APV) by integrating out the confounding variables from the model-predicted response for each subject. This approach is interesting but has some limitations. Although the APV operates on the marginal mean, which is often of primary interest, the involved numerical integration can be computationally very intensive for moderate to high dimensional confounding variables even when these confounders do not interact with the exposure variable. And most importantly, its extension to continuous exposures, such as sugar intake in our motivating data example, in not trivial. To address these limitations, these authors also proposed an approach for evaluating the exposure effect on the marginal mean by relating the susceptibility probability to covariates using the log link function. This approach which they referred to as the log – log approach has the key advantage that it provides a direct interpretation of the effects of covariates on the marginal mean response. A limitation, however, is that the log link may lead to unstable computations and inconsistent estimates, owing to the obvious constraints imposed on probabilities.
In this article, we propose two strategies for evaluating the effect of covariates on the marginal mean response, which merely use the heterogeneity implied by the ZI models as a device to account for extra zeros in the data. The first approach derives the covariate effect on the overall mean response from the estimates of the models relating the latent mean response and the class membership probability to covariates. The second strategy, which we refer to as the direct approach, consists of formulating a regression model that relates the marginal mean response to covariates. Under this model formulation, the regression model relating the latent mean to covariates is implied by the assumed models for the class membership probability and the marginal mean. This second approach generalizes the marginalized zero-inflated Poisson regression model recently proposed by Long et al. (2014) to any count data with excess zeros. Although this extension may appear conceptually modest, the estimation may require additional programming efforts beyond those encountered in ZI Poisson models.
In Section 2, we give a brief description of ZI models and give details on the derived and the direct methods to estimate the overall effects of covariates on the marginal mean response. We conduct simulation studies to evaluate the finite sample performance of these methods and illustrate their practical utility using data from the Detroit oral health study in Section 3. We conclude with some remarks and discussions in Section 4.
2. The method
Suppose we randomly select a sample of n independent subjects with response counts Yi, i = 1, …, n, from a population which can be well represented by a ZI model. Under this model, the population is viewed as a mixture of susceptible and non-susceptible subjects, but the susceptibility status Si, taking value 1 if subject i is susceptible to the event of interest and 0 otherwise, is not fully observed. For each subject i, this heterogeneity manifests itself through the probability mass function (pmf) of Yi, assuming a covariate Wi,
Here y is an observable count, πi = Pr(Si = 1|Wi) is the susceptibility or the at-risk probability, δ0y is the kronecker's function taking value 1 if y = 0 and 0 otherwise, and fi(y|Wi) = Pr(Yi = y|Si = 1; Wi) is the pmf for a susceptible subject indexed, possibly, by a finite dimensional parameter. Letting μi = E(Yi|Si = 1; Wi) be the mean response for a susceptible subject, the marginal mean response E(Yi|Wi) is obtained by averaging the latent mean response E(Yi|Si; Wi) = Siμi over the distribution of Si, yielding E(Yi|Wi) = πiμi.
In real applications of ZI models, μi and πi are related to, potentially different, subsets of Wi through regression models coupled with conventional link functions (Lambert, 1992 and Gilthorpe et al., 2009). Parameter estimates from these regression models often have the so-called latent class interpretation, and in some settings can also be interpretable vis-a-vis the marginal mean. This is especially true when πi is constant or varies with covariates that are not of interest. In general, however, the interpretation of covariate effects on the marginal mean using individual regression models for μi and πi is often not trivial, especially when these terms contain the exposure of interest. This limitation may preclude direct use of the latent class model formulation in practical settings.
In this paper, we aim at evaluating the effects of covariates on the overall mean E(Yi|Wi), representing the target of inference. We propose two strategies, which rely on the trivial relation E(Yi|Wi) = πiμi, to achieve this aim. The first approach provides estimates of covariate effects on the overall mean from individual estimates of μi and πi under the latent model formulation. In contrast, the second approach models directly the marginal mean E(Yi|Wi) and the mixing weight πi as a linear function of known covariates, but a regression model relating the latent mean μi to covariates is not directly specified but implied by the trivial relation . The core of estimation for these methods is based on the following joint pmf for observed outcomes y = (y1, …, yn) given covariates w = (w1, …, wn)
(1) |
With a proper specification of fi(yi|wi), the estimation then proceeds by maximizing, preferably on the log scale, this joint pmf viewed as a function of finite dimensional parameters.
2.1 Derived marginal models
Suppose that the following regression models relating the latent mean and the at-risk probability to covariates using standard link functions are entertained,
(2) |
where Vi = (1, v1i, …, vr−1,i)′ and Zi = (1, z1i, …, zq−1,i)′ are respectively an r × 1 vector and a q × 1 vector and subsets of Wi, α and γ are the associated vector of unknown regression coefficients. Vectors Vi and Zi may share common components, and because there are subsets of Wi, the basic regression models in (2) assume that the effects of some components of Wi on these latent quantities may be zeros. The log and logit link functions are also assumed but any monotone function should be applicable in principle.
Assume that the maximum likelihood estimates (MLEs) α̂ and γ̂ of α and γ are obtained by maximizing the joint pmf in (1). To derive the overall effect of covariates on the marginal mean response from these estimates of the conventional models in (2), suppose for example that there exists an unspecified column vector Xi of dimension p related to the marginal mean as follows,
(3) |
where β is the unknown parameter vector. Unlike the model formulation in (2) where Vi and Zi are directly observable, the specification of Xi under the mean regression model in (3) is somewhat dictated by the working regression models for μi and πi. This then necessitates Xi to be expressed in terms of covariates Vi and Zi, at least approximately. A simple algebraic calculation shows that the relation between the marginal mean E(Yi|Wi), the latent mean μi, and the at-risk probability πi given the respective working regression models, can be expressed as
Because α′Vi is linear in parameters, a simple approach for selecting Xi would be to linearize gγ(Zi) = log{1 + exp{−γ′Zi}} in the parameters, using a Taylor expansion of gγ(Zi) around E(Zi). As an example, the second order Taylor expansion
where hi = Zi − E(Zi), ∇g(.) and ∇2gγ(.) are the gradient and Hessian matrix, would include in Xi all unique elements in Vi, and the first order (linear) and second order (quadratic and interaction) terms of Zi. A first order Taylor approximation of gγ(Zi) around E(Zi) would only include the unique elements in Vi and Zi for the choice of Xi. Naturally if Zi only contains one dummy 0 – 1 covariate, the Taylor approximation is not necessary.
With Xi specified, we focus on estimating β. We adopt the following notation. Let mi(α, γ) denote the marginal mean on the log scale, log{πi(γ)μi(α)}, where πi(γ) and μi(α) highlight the dependence of μi and πi on α and γ, and let X = (X1, …, Xn)′ and m(α, γ) = (m1 (α, γ), …, mn(α, γ))′ respectively be a matrix and a column vector of dimensions n × p and n. Under the working independence assumption of elements of m(α̂, γ̂), a consistent estimate β̂der of the unknown β, obtained by minimizing the sum of square deviations (Xβ−m(α̂,γ̂))′(Xβ − m(α̂, γ̂)), is
with associated variance-covariance matrix cov(β̂der) = (X′X)−1 X′cov{m(α̂, γ̂)}X(X′X)−1, where cov{m(α̂, γ̂)} = cov{mi(α̂, γ̂), mj(α̂, γ̂)}i,j ∈ {1, …, n}. The matrix cov{m(α̂, γ̂)} can be approximated using the delta method or any resampling (including the bootstrap) technique. Details of these calculations for the delta method are given in the Appendix.
2.2 Direct Marginal Models
We formulate a ZI regression model that directly relates the marginal mean E(Yi|Wi), the desired target of inference, to covariates. We consider the marginal mean model,
where Xi = (1, x1i, …, xp−1,i)′ is a p × 1 vector and a subset of Wi, and β is a vector of unknown regression coefficients that directly captures the effects of covariates on the overall mean. It is worth noting that, unlike the model formulation in (3) where Xi is dictated by the working models of μi and πi, here Xi represents the vector of covariates that are directly observable. To describe heterogeneity in the population, we assume that the unobserved latent variable Si is a Bernoulli process with success probability πi = Pr(Si = 1|Wi) related to Zi = (1, z1i, …, zq−1,i)′, a subset of Wi, as follows,
where γ is a vector of unknown regression coefficients. In the current model formulation, ηi = log{E(Yi|Si = 1;Wi)} describing the mean response for a susceptible subject is not directly modeled as a linear function of covariates as in (2) but is related to covariates through the trivial relation . Estimates of β, γ and other finite dimensional parameters defining fi(.) can be obtained by maximizing the joint pmf in (1) viewed as a function of parameters. We denote by β̂dir the MLE of β.
We refer to this formulation as the marginal log-logit zero-inflated regression model for count data. This marginal regression model is conceptually similar to the formulation proposed by Heagerty (1999) in the context of logistic regression models with random effects. In the current formulation, the unobserved class membership represents this author's random effects terms to model the within-subject association. The marginalized pattern-mixture model for informative missing data proposed by Wilkins and Fitzmaurice (2007) also shares some similarities with this model formulation. However, unlike in this model where the latent means are averaged across unobserved variables, their marginalized model is averaged over observed missing data patterns. As stated in Section 1, this model is a generalization of the marginalized zero-inflated Poisson regression model recently proposed by Long et al. (2014). It is generally applicable to any ZI regression model for which the associated non-degenerate function fi(.) is a smooth function that decays rapidly at infinity with some degree of uniformity (see for example Preisser et al, 2015).
2.3 The relative merit of the derived and direct marginal models
For any of the marginally specified mean models, β is interpreted as contrasting the log mean for subgroups defined by measured covariates. In addition to this marginal interpretation of covariate effects, an appealing and interesting feature of the derived approach is that it also allows a latent class interpretation (through α) of covariate effects. This is scientifically valuable in situations where the scientist is not only interested in conducting inferences on variables that affect the mean response for subjects at risk but also variables that affect the overall mean response. The direct approach, however, focuses primarily on the marginal mean E(Yi|Wi), while treating the latent components E(Yi|Si = 1; Wi) and Pr(Si = 1|Wi) which describe how heterogeneity arises in the data as nuisance. But from a technical standpoint, because the parameters in this approach are directly estimated from a likelihood, β̂dir will enjoy well known desirable asymptotic properties associated with MLE, compared to β̂der obtained through the unweighted least squares method. Finally, it is worth mentioning that the degree of accuracy of the derived approach depends primarily on the order of the Taylor expansion. This may constitute a trade-off between the ease of interpretation and the degree of approximation of the derived marginal mean to the fitted mean πi (γ̂)μi(α̂) from the fitted latent class regression models for πi and μi.
3. Numerical studies
3.1 Simulation studies
We conduct a numerical study to evaluate the finite sample performance of estimated covariates effects on the marginal mean using both the derived and the direct approaches. The results of this evaluation are then compared to those of the APV approach of Albert et al. (2011), which we briefly describe. Suppose that v1i is a binary exposure taking value 1 if subject i is exposed and 0 otherwise, and v2i a potential confounder. The APV method compares the overall means of a subject under exposed and unexposed conditions with the confounder integrated out. And because it relies on the integrated mean, we will focus our investigation on the behavior of the estimate of the mean ratio for binary exposures
where E(Yi|v1i) = ∫v2i E(Yi|v1i, v2i)dF(v2i), v1i is deterministic (v1i = 1, i ≤ [n/2]; v1i = 0, i > [n/2]) and v2i is generated from a standard normal distribution F. Two data generating schemes based on ZINB models are considered. Given covariates v1i and v2i, Yi is generated first with the latent mean model log{μi} = 1.5–0.5v1i–0.1v2i, and second with the marginal mean model log{E(Yi|v1i, v2i)} = 1.5 – 0.5v1i – 0.1v2i. Both schemes set the dispersion parameter κ to 0.5 and relate πi, to covariates using the model logit{πi} = 1.5–0.5v1i–0.2v2i. Throughout our simulations, we compute the estimate of the MR using the derived approach and the direct approach, and the APV. Specifically, the APV and the derived estimates are computed using the working regression models log{μi} = α0 + α1v1i + α2v2i and logit{πi} = γ0 + γ1v1i + γ2v2i. The MR estimate from the marginal log-logit model (direct approach) was obtained using the working model log{E(Yi|v1i, v2i)} = β0 + β1v1i + β2v2i. Estimates of the true MR are exp{β̂1,der} and exp{β̂1,dir}, respectively for the derived and the direct method. But the APV estimate is computed by integrating out the confounder v2i from the fitted marginal mean πi(γ̂)μi(α̂) predicted from individual models of πi and μi. The three estimation methods of the mean ratio are compared according to the estimated mean ratio (EMR), the relative bias (RB) in percentage, the mean squared error (MSE), and the 95% coverage probability (CP) of Wald confidence intervals of the true mean ratio. Finally, all simulations are replicated 1,000 times and for sample sizes varying from 50 to 1000.
Results in Table 1 show that the three estimation methods work extremely well in finite samples with average estimates of the mean ratio virtually identical to their true values and relative bias below 5%. Moreover, the associated MSEs also decrease with increasing sample sizes leading to the conjecture that the invoked estimates are consistent. The derived estimation approach based on β̂der has 95% coverage probabilities of confidence intervals higher than the nominal level, resulting from larger standard errors. And this behavior does not appear to change with increasing sample sizes. This phenomenon is also apparent in the analysis of early childhood indices in Section 3.2, where the parameter estimates from the derived mean model appear to be more variable than those from the direct method. This loss of precision under working independence assumptions is not uncommon and has been previously reported in the literature (Fitzmaurice, 1995).
Table 1. Simulation results for the mean ratio (exposed vs unexposed), using the APV and the derived mean estimation from a latent (conventional) log-logit model, and the direct mean estimation from a marginal log-logit working model, with data generated from ZINB models.
Latent log-logit working model | Marginal log-logit working model | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|||||||||||||||
APV estimation | Derived marginal mean estimation | Direct marginal mean estimation | |||||||||||||
|
|
|
|||||||||||||
True MR | n∓ | EMR | RB(%) | MSE | CP(%) | EMR | RB(%) | MSE | CP(%) | AIC | EMR | RB(%) | MSE | CP(%) | AIC |
Data generated from a ZINB model with a latent mean | |||||||||||||||
0.543 | 100 | 0.547 | 0.743 | 0.017 | 92.7 | 0.545 | 0.314 | 0.019 | 99.8 | 434.14 | 0.548 | 0.915 | 0.018 | 92.6 | 434.17 |
200 | 0.547 | 0.740 | 0.008 | 94.8 | 0.546 | 0.452 | 0.008 | 98.6 | 863.40 | 0.547 | 0.675 | 0.008 | 95.1 | 863.44 | |
400 | 0.546 | 0.546 | 0.005 | 92.1 | 0.545 | 0.281 | 0.005 | 99.9 | 1716.61 | 0.546 | 0.480 | 0.005 | 93.4 | 1716.68 | |
1000 | 0.545 | 0.301 | 0.002 | 94.2 | 0.543 | 0.024 | 0.002 | >99.9 | 4285.21 | 0.544 | 0.224 | 0.002 | 95.3 | 4285.30 | |
| |||||||||||||||
Data generated from a ZINB model with a marginal mean | |||||||||||||||
0.607 | 100 | 0.625 | 3.086 | 0.021 | 94.7 | 0.625 | 3.071 | 0.045 | 96.7 | 475.74 | 0.626 | 3.281 | 0.021 | 94.7 | 475.72 |
200 | 0.614 | 1.288 | 0.011 | 93.1 | 0.613 | 1.003 | 0.011 | 99.9 | 945.35 | 0.614 | 1.192 | 0.011 | 93.3 | 945.35 | |
400 | 0.609 | 0.435 | 0.005 | 94.7 | 0.608 | 0.254 | 0.005 | >99.9 | 1883.04 | 0.608 | 0.395 | 0.005 | 94.5 | 1883.00 | |
1000 | 0.608 | 0.228 | 0.002 | 95.6 | 0.607 | 0.055 | 0.002 | >99.9 | 4704.47 | 0.607 | 0.146 | 0.002 | 95.6 | 4704.40 |
n/2 is sample size per group
A simulation study was also conducted for situations where the APV can not be computed, for example when the exposure of interest v1i has infinitely many strata or is continuous. For such cases, the probability of observing a specific exposure profile is zero rendering the APV computationally unfeasible. Table 2 shows that both the derived and the direct estimation approaches give satisfactory results for the mean ratio MR = E(Yi|v1i + 1)/E(Yi|v1i) for one unit increase of the exposure generated from a standard normal distribution. These methods provide a decent estimation of the mean ratio in settings where the APV method can not be performed. Additional simulations to study the performances of the derived and direct estimation approaches when a second order Taylor expansion is assumed are given in Web supplementary materials.
Table 2. Simulation results for the mean ratio (one unit increase in continuous exposure), for the derived mean estimation from a latent (conventional) log-logit model, and the direct mean estimation from a marginal log-logit working model, with data generated from ZINB models.
Latent log-logit working model Derived marginal mean estimation |
Marginal log-logit working model Direct marginal mean estimation |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||
True MR | n | EMR | RB(%) | MSE | CP(%) | AIC | EMR | RB(%) | MSE | CP(%) | AIC |
Data generated from a ZINB model with a marginally specified mean | |||||||||||
0.607 | 50 | 0.592 | -2.38 | 0.014 | 97.4 | 162.16 | 0.620 | 2.22 | 0.012 | 94.3 | 162.29 |
100 | 0.592 | -2.41 | 0.007 | 98.5 | 317.52 | 0.612 | 0.94 | 0.005 | 93.4 | 317.34 | |
200 | 0.592 | -2.32 | 0.003 | 99.3 | 625.57 | 0.606 | -0.05 | 0.003 | 94.6 | 625.31 | |
500 | 0.594 | -2.10 | 0.001 | 98.8 | 1559.95 | 0.605 | -0.30 | 0.001 | 94.6 | 1559.14 | |
1000 | 0.595 | -1.88 | 0.0007 | 98.7 | 3110.13 | 0.606 | -0.11 | 0.0005 | 93.6 | 3108.68 |
3.2 Analysis of dental caries indices in primary dentition
We apply the proposed methods to dental caries data generated from the Detroit study aimed at identifying the social, familial, biological, and neighborhood determinants of dental caries and periodontal disease among low-income African American children and their caregivers (Sohn et al., 2007 and Ismail et al., 2011). Our chief focus in this article is to evaluate the effect of the daily amount of sugar intake (DASI), measured in grams per day, on early childhood caries, taking into account potential confounders. We are particularly interested in answering the following scientific question: do inner city African American children with higher levels of sugar intake experience greater caries severity relative to those with lower levels of intake in the modern age of fluoride exposure? The outcome of interest is the so-called dmfs (number of decayed, missing and filled tooth surfaces) index representing the cumulative severity of tooth decay for each surveyed child. This index has well-documented shortcomings but continues to be instrumental in evaluating and comparing the risks of dental caries across population groups (Lewsey and Thomson, 2004). Additional pertinent covariates include the child's age, the caregiver's employment status and oral health practices (measured by the personal hygiene performance-PHP index with lower scores being desirable, described by Podshadley and Haley, 1968).
The data set contains 874 children of which 427 (48.86%) have no caries, resulting in a sizeable frequency of zero dmfs counts. Following earlier analyses of caries indices in this inner city children population, a traditional zero-inflated negative binomial (ZINB) regression model is considered to accommodate excessive zeros and overdispersion due to some children having large caries indices (Todem et al., 2012 and Cao et al., 2014). Specifically, we postulate that the distribution of dmfs caries index, which we denote by Yi, for a susceptible child i is a negative binomial model with mean μi and dispersion parameter κ > 0. And that the latent mean μi and the membership probability πi are related to covariates as follows,
(4) |
For each child i, Unempli is the caregiver's employment status recoded to binary (Unempli = 1 if unemployed and 0 otherwise); Agei is the child's age (standardized); SIi is the child's sugar intake (standardized version of DASI); and phpi is the caregiver's PHP index (standardized).
Using the parameter estimates of the latent regression models in (4), and assuming a first order Taylor expansion of log{πi} around the mean of its covariates, we indirectly estimate the overall effects of covariates on the marginal mean πiμi, using the model
(5) |
Because the Taylor expansion only invokes linear terms of covariates in πi, which are a subset of covariates in μi, the marginal mean model in (5) has the same covariates as the working model for the latent mean μi. In addition to the indirect approach, we also estimate the covariate effects on the overall mean using the marginal log-logit regression model coupled with the maximum likelihood estimation. This approach directly specifies a regression model for the overall mean πiμi using the same covariates as in model (5). It also assumes a regression model for πi similar to the formulation in (4).
Table 3 presents the MLEs and inference results for the parameters of the latent mean regression model in (4), as well as those of the derived and direct covariate effects from the marginal mean in (5). Both the derived and the direct approaches produce similar estimates and inferences, with the effect of sugar intake on average caries indices being significant at 5% nominal level. This effect, however, fails to reach significance on both the mean caries of the at-risk group and the susceptibility probability, after controlling for caregivers' employment and oral hygiene practices. This analysis constitutes an excellent example where the classical formulation of zero-inflated count regression models fails to capture the overall effect of covariates in contrast with the models that relate the overall mean to covariates. This finding is reminiscent of the statement by Preisser et al. (2012) who argued that misinterpreting the covariate effects on the mean of the susceptible subpopulation as the covariate effects on the overall may lead to the incorrect conclusion that covariate effects are not significant and thus can be grossly misleading.
Table 3. Parameter estimates, standard errors (SE) and p-values for the zero-inflated Negative Binomial model under the latent formulation (with derived overall effects) and the marginal log-logit model with direct overall effects.
Latent log-logit model with derived overall effects on marginal∓ | Marginal log-logit model (direct overall effects) | |||||
---|---|---|---|---|---|---|
|
|
|||||
Effects | Estimate | SE | p-value | Estimate | SE | p-value |
Mean | ||||||
Intercept | 1.915 (1.084) | 0.098 (0.121) | < 10−4 (< 10−4) | 1.278 | 0.104 | < 10−4 |
Unemployed | 0.175 (0.397) | 0.105 (0.003) | 0.096 (0.295) | 0.295 | 0.110 | 0.007 |
Age | 0.268 (0.974) | 0.071 (0.093) | < 10−3 (< 10−4) | 0.700 | 0.077 | < 10−4 |
SI | 0.056 (0.265) | 0.074 (0.085) | 0.450 (0.002) | 0.266 | 0.081 | 0.001 |
PHP | 0.136 (0.109) | 0.049 (0.050) | 0.006 (0.029) | 0.128 | 0.049 | 0.009 |
Age*SI | -0.057 (-0.347) | 0.077 (0.091) | 0.460 (< 10−3) | -0.258 | 0.081 | 0.001 |
Susceptibility probability | ||||||
Intercept | 0.331 | 0.213 | 0.120 | 0.228 | 0.193 | 0.250 |
Unemployed | 0.497 | 0.232 | 0.032 | 0.572 | 0.210 | 0.006 |
Age | 1.817 | 0.254 | < 10−4 | 1.635 | 0.208 | < 10−4 |
SI | 0.200 | 0.160 | 0.211 | 0.241 | 0.142 | 0.090 |
Age*SI | -0.254 | 0.211 | 0.229 | -0.238 | 0.182 | 0.191 |
Dispersion log(κ) | ||||||
Intercept | -0.005 | 0.115 | 0.965 | -0.015 | 0.114 | 0.895 |
| ||||||
Summary statistics | ||||||
Max logL | -1872.4 | -1877.5 | ||||
AIC | 3768.8 | 3779.0 |
Parameter estimates and inferences for the derived overall effects on the marginal mean under the latent log-logit model are in parentheses.
Children with high sugar intakes appear to exhibit worst caries indices on average although the size of the effect tends to diminish with age. In Figure 1, we plot the estimates of the ratios of mean caries indices for each unit increase in SI (standardized version of DASI) as a function of Age (years) and corresponding 95% joint confidence band, for the derived and the direct estimation approach, holding the caregivers' employment and oral hygiene practices constant. Note that each unit increase in the standardized version of sugar intake (DASI ‒ 126.47)/102.31, corresponds to an increase of about 229g sugar intake per day, which is substantial in view of the observed data. Nonetheless, the dramatic effect of sugar intake for this nominal increment is seen in infants under the age of 1 who can see their average caries indices multiplied by as much as 2. After age 3 years, however, the effect of sugar intake vanishes and fails to reach the significance level. This analysis shows that the study population clearly has different levels of vulnerability to dental caries from exposure to sugar intake. Such information is critical for designing targeted and age-specific oral health policies pertaining to dental caries in this inner city children population.
To study the agreement between the two fitted models, we plot in Figure 2 α̂′Vi, with Vi = (1, Unempli, Agei, SIi, PHPi, AgeiSIi)′, the estimates of the linear predictors of μi against η̂i from the marginally specified model, only for susceptible subjects as predicted by these models. Susceptible children to caries are classified as such when they have a higher posterior probability , where yi is the observed caries index. An estimate of the susceptibility status is Ŝi = 1 if , with children with observed nonzero dmfs indices being naturally classified as being at risk. These estimates generated from the two working models were almost identical with 465 among 874 children being classified as being at risk. The R2 statistic for this plot is 0.72 representing the variation in η̂i explained by covariates of μi in model (4). This level of correlation is consistent with inferential results obtained under the derived and direct approaches, which are virtually identical (Table 3). Additional analysis was conducted to evaluate whether a second order Taylor expansion of log{πi} around the mean of its covariates, would lead to significant effects of higher order terms of covariates on the average caries indices. Table W.3 (Web supplementary materials) shows that except age, no significant effect of quadratic terms and other higher order interactions on the average caries indices were found.
Table 4 presents the results of goodness-of-fit statistics for the two fitted models and those of competing formulations. Models that accommodate zero-inflation and overdispersion appear to give a better representation for these caries indices. Specifically, the ZINB coupled with the marginal and the latent means provides superior fit according to the AIC and BIC criteria. Although the latent mean and the marginal mean both have the same specification in terms of covariates and link functions, the latent mean model appears to provide a better fit to the data under the same mixing probability model.
Table 4. Goodness-of-fit statistics for alternative models fit to dental caries indices in young children.
Model | # parameters | -2 logLik | AIC | BIC |
---|---|---|---|---|
Homogeneous model | ||||
Poisson | 6 | 8574.8 | 8586.8 | 8615.5 |
Beta Binomial | 7 | 3787.4 | 3801.4 | 3834.8 |
Negative Binomial | 7 | 3980.2 | 3994.2 | 4027.6 |
ZI model with latent mean | ||||
Poisson | 11 | 5873.4 | 5895.4 | 5947.9 |
Beta Binomial | 12 | 3749.1 | 3773.1 | 3830.4 |
Negative Binomial | 12 | 3744.8 | 3768.8 | 3826.1 |
ZI model with marginal mean∓ | ||||
Poisson | 11 | 5901.4 | 5923.4 | 5975.9 |
Beta Binomial | 12 | 3780.7 | 3804.7 | 3862.0 |
Negative Binomial | 12 | 3755.0 | 3779.0 | 3836.3 |
Marginal log-logit model
4. Discussion
This paper has extended the literature by developing two methods which relate the overall mean response to covariates but use the heterogeneity implied by the ZI models as a device to account for extra zeros. These methods are particularly useful when the overall mean response is the target of inference and the latent class parameterization based on the characterization of the population into the at-risk and not at-risk subgroups is scientifically implausible. As argued by Mwalili et al. (2007), the marginal distribution resulting from ZI models for count data does not always imply that there is an underlying classification of the at-risk and not at-risk population, and that the marginal distribution model may well provide a reasonable representation of data from a homogeneous population. In the Detroit study, it is unclear why some of the minority low-income children would be considered immune to dental caries.
From a practical standpoint, these methods can be implemented in commercial software with minimal programming effort. They can also be readily extended to latent class models with more than two classes. Consider a mixture population with J latent classes with a pmf function of the form , where Wi is the vector of covariates and Si represents the class membership for subject i. Using the trivial expression , the overall covariate effect on the marginal mean E(Yi|Wi) can be indirectly estimated from MLEs of the basic regression models relating the latent class means E(Yi|Si = j;Wi) and the latent class membership probabilities Pr(Si = j|Wi) to covariates. A marginally specified mixture model can also be formulated by directly relating the marginal mean E(Yi|Wi) to covariates and formulating regression models for the latent class membership probabilities Pr(Si = j|Wi) and all but one latent means E(Yi|Si = j; Wi). This extension has wide applications not only to discrete data but also to continuous data in which case the probability mass functions are replaced by density functions.
The proposed methods have some limitations. The derived estimation approach by relying on the working independence assumption of elements of m(α̂, γ̂), is apt to yield less precise estimates β̂der. A simple approach to circumvent this limitation might be to estimate β by minimizing the sum of the weighted square deviations (Xβ − m(α̂, γ̂))′ D̂−1 (Xβ − m(α̂, γ̂)) with D = cov{m(α̂, γ̂)}, in which case β̂der = (X′D̂−1X)−1X′D̂−1m(α̂, γ̂), and cov(β̂der) = (X′D̂−1X)−1. This approach, however, can be computationally demanding as it requires inverting D̂, a high dimensional matrix of order n × n.
Another limitation of our methodology is that the marginal mean is assumed to be linearly related to covariates through the log link function, which may be subject to misspecification. Although the methodology is readily applicable to any known link function, the linearity assumption may provide a poor approximation of the true function relating continuous covariates to the marginal mean. Given that the true underlying relationship between the mean response and covariates is usually unknown to the analyst, a general approach that does not specify a priori the form of this relationship appears to be the most robust analytic strategy. Smoothing techniques such as generalized additive models and spline models can then be used to reliably estimate the underlying relationship between the marginal mean and covariates (see, for example, Hastie and Tibshirani, 1986; Xue et al., 2004; Lam et al., 2006; Lui et al., 2012). This extension and other generalizations of the methodology are outside of the scope of this paper and may be the subject of further research.
Supplementary Material
Acknowledgments
This work was supported by the first author's NCI/NIH K-award, 1K01 CA131259 and its supplement from the 2009 American Recovery and Reinvestment Act funding mechanism. The authors are grateful to Dr Amid Ismail for his permission to use the dental caries data.
Appendix: Calculation of cov{m(α̂, γ̂)}
Using a first order Taylor expansion of mi(α̂, γ̂) around {α, γ}, we have,
where m˙i,α = ∂mi (α, γ)/∂α and m˙i,γ = ∂mi(α, γ)/∂γ.
Using the delta method, we have for i, j = 1, …, n,
where m˙u, α = Vu and m˙u,γ = (1 − πu(γ))Zu with πu(γ) = {1 + exp{−γZu}}−1, u = 1, …, n. Applied at α̂ and γ̂, m˙u,α and m˙u,γ take values Vu and (1 − πu(γ̂))Zu, respectively.
Footnotes
Supplementary Materials: Supplementary Web Appendices, referenced in Section 3 as well as the code for analyzing the early childhood caries indices, are available with this paper at the Biometrics website on Wiley Online Library.
References
- Albert JM, Wang W, Nelson S. Estimating overall exposure effects for zero-inflated regression models with application to dental caries. Statistical Methods in Medical Research. 2014;23:257–278. doi: 10.1177/0962280211407800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson CA, Curzon MEJ, Van Loveren C, Tatsi C, Duggal MS. Sucrose and dental caries: a review of the evidence. Obesity Reviews. 2009;10:41–54. doi: 10.1111/j.1467-789X.2008.00564.x. [DOI] [PubMed] [Google Scholar]
- Böhning D, Dietz E, Schlattmann P, Mendonca L, Kirchner U. The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society: Series A (Statistics in Society) 1999;162:195–209. [Google Scholar]
- Burt B, Pai S. Sugar consumption and caries risk: a systematic review. Journal of Dental Education. 2001;65:1017–1023. [PubMed] [Google Scholar]
- Cao G, Hsu WW, Todem D. A score-type test for heterogeneity in zero-inflated models in a stratified population. Statistics in Medicine. 2014;33:2103–2114. doi: 10.1002/sim.6092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Farewell V, Sprott D. The use of a mixture model in the analysis of count data. Biometrics. 1988;44:1191–1194. [PubMed] [Google Scholar]
- Fitzmaurice GM. A caveat concerning independence estimating equations with multivariate binary data. Biometrics. 1995;51:309–317. [PubMed] [Google Scholar]
- Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V. Modelling count data with excessive zeros: The need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Statistics in Medicine. 2009;28:3539–3553. doi: 10.1002/sim.3699. [DOI] [PubMed] [Google Scholar]
- Hall DB. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
- Hastie T, Tibshirani R. Generalized additive models. Statistical Science. 1986;1:297–310. doi: 10.1177/096228029500400302. [DOI] [PubMed] [Google Scholar]
- Heagerty PJ. Marginally specified logistic-normal models for longitudinal binary data. Biometrics. 1999;55:688–698. doi: 10.1111/j.0006-341x.1999.00688.x. [DOI] [PubMed] [Google Scholar]
- Ismail AI, Lim S, Sohn W. A transition scoring system of caries increment with adjustment of reversals in longitudinal study: evaluation using primary tooth surface data. Community Dentistry and Oral Epidemiology. 2011;39:61–68. doi: 10.1111/j.1600-0528.2010.00565.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam KF, Xue H, Bun Cheung Y. Semiparametric analysis of zero-inflated count data. Biometrics. 2006;62:996–1003. doi: 10.1111/j.1541-0420.2006.00575.x. [DOI] [PubMed] [Google Scholar]
- Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
- Lewsey JD, Thomson WM. The utility of the zero-inflated poisson and zero-inflated negative binomial models: a case study of cross-sectional and longitudinal dmf data examining the effect of socio-economic status. Community Dentistry and Oral Epidemiology. 2004;32:183–189. doi: 10.1111/j.1600-0528.2004.00155.x. [DOI] [PubMed] [Google Scholar]
- Liu H, Ma S, Kronmal R, Chan KS. Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA) The Annals of Applied Statistics. 2012;6:1236–1255. doi: 10.1214/11-aoas534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long D, Preisser JS, Herring AH, Golind CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Statistics in medicine. 2014 doi: 10.1002/sim.6293. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullahy J. Specification and testing of some modified count data models. Journal of Econometrics. 1986;33:341–365. [Google Scholar]
- Mwalili S, Lesaffre E, Declerck D. The zero-inflated negative binomial regression model with correction for misclassification: an example in caries research. Statistical Methods in Medical Research. 2008;17(2):123–139. doi: 10.1177/0962280206071840. [DOI] [PubMed] [Google Scholar]
- Podshadley A, Haley J. A method for evaluating oral hygiene performance. Public Health Rep. 1968;83(3):259–264. [PMC free article] [PubMed] [Google Scholar]
- Preisser J, Stamm J, Long D, Kincade M. Review and recommendations for zero-inflated count regression modeling of dental caries indices in epidemiological studies. Caries Research. 2012;46:413–423. doi: 10.1159/000338992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Preisser JS, Das K, Long DL, Divaris K. Marginalized zero-inflated negative binomial regression with application to dental caries. Statistics in Medicine. 2015 doi: 10.1002/sim.6804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ridout M, Demetrio CGB, Hinde J. Proceedings of International Biometric Conference. Cape Town, South Africa: 1998. Models for count data with many zeros; pp. 179–192. [Google Scholar]
- Sohn W, Ismail A, Amaya A, Lepkowski J. Determinants of dental care visits among low-income African-American children. The Journal of the American Dental Association. 2007;138:309–318. doi: 10.14219/jada.archive.2007.0163. [DOI] [PubMed] [Google Scholar]
- Tellez M, Sohn W, Burt B, Ismail A. Assessment of the relationship between neighborhood characteristics and dental caries severity among low-income African-Americans: A multilevel approach. Journal of Public Health Dentistry. 2006;66:30–36. doi: 10.1111/j.1752-7325.2006.tb02548.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Todem D, Hsu WW, Kim K. On the efficiency of score tests for homogeneity in two-component parametric models for discrete data. Biometrics. 2012;68:975–982. doi: 10.1111/j.1541-0420.2011.01737.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkins KJ, Fitzmaurice GM. A marginalized pattern-mixture model for longitudinal binary data when nonresponse depends on unobserved responses. Biostatistics. 2007;8:297–305. doi: 10.1093/biostatistics/kxl010. [DOI] [PubMed] [Google Scholar]
- Xue H, Lam KF, Li G. Sieve maximum likelihood estimator for semipara-metric regression models with current status data. Journal of the American Statistical Association. 2004;99:346–356. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.