Untangle the Structural and Random Zeros in Statistical Modelings

W Tang; H He; WJ Wang; D G Chen

doi:10.1080/02664763.2017.1391180

. Author manuscript; available in PMC: 2019 Mar 20.

Published in final edited form as: J Appl Stat. 2017 Oct 24;45(9):1714–1733. doi: 10.1080/02664763.2017.1391180

Untangle the Structural and Random Zeros in Statistical Modelings

W Tang ^a, H He ^b,^*, WJ Wang ^c, D G Chen ^d,^e,^f

PMCID: PMC6426322 NIHMSID: NIHMS1504815 PMID: 30906098

Abstract

Count data with structural zeros are common in public health applications. There are considerable researches focusing on zero-inflated models such as zero-inflated Poisson (ZIP) and zero-inflated Negative Binomial (ZINB) models for such zero-inflated count data when used as response variable. However, when such variables are used as predictors, the difference between structural and random zeros is often ignored and may result in biased estimates. One remedy is to include an indicator of the structural zero in the model as a predictor if observed. However, structural zeros are often not observed in practice, in which case no statistical method is available to address the bias issue. This paper is aimed to fill this methodological gap by developing parametric methods to model zero-inflated count data when used as predictors based on the maximum likelihood approach. The response variable can be any type of data including continuous, binary, count or even zero-inflated count responses. Simulation studies are performed to assess the numerical performance of this new approach when sample size is small to moderate. A real data example is also used to demonstrate the application of this method.

Keywords: generalized linear models, maximum likelihood, structural zeros, zero-inflated Poisson, zero-inflated explanatory variables

1. Introduction

Count variables recording frequencies of some specific behaviors during a period of time, such as days of alcohol consumption or number of unprotected sexual activities in the past month, are common in behavioral and social studies. It is important, both conceptually and methodologically, to pay close attention to structural zeros in such count variables. Structural zeros refer to zero responses by those subjects whose count response will always be zero, in contrast to random (or sampling) zeros that occur to subjects whose count responses can be greater than zero, but appear to be zero due to sampling variability. For example, in HIV-AIDS prevention research, the count of unprotected vaginal sex is commonly used to measure the risk of HIV/AIDS. Subjects who have always been, or have become, continually abstinent from unprotected sex in a given time period forma nonrisk group as defined by structural zeros in their count outcomes, while the remaining subjects constitute an at-risk group with their count outcomes consisting of random zeros or a positive number of episodes of unprotected sex. Such a partition of the study population is not only supported by the excess number of zeros observed in real studies, but is also conceptually needed to serve as a basis for valid inference.

The main issue in modeling count data with structural zeros is that structural zeros are often not observed. In fact, structural zeros may be latent and not observable, so the issue cannot be solved by refining the study design. For example, in toxicological studies, long-term exposure to food-borne toxins is often estimated using short-term food intake measures. Zeros in the measures may be structural or random simply due to the variability in their food intake from day to day. It is typically impossible, from the survey, to separate these two types of zeroes. In such cases, Appropriate statistical methods are needed to address the issue. The issue has been acknowledged and dealt with when the zero-inflated count data is treated as dependent, or as a response variable in literature, see for example, [2, 4–6, 8, 13–15, 19, 21]. Zero-inflated models such as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models have been developed and also successfully applied to various fields in biomedical health such as HIV-AIDS studies, cancer, nursing, and health care outcome research, as well as non-health fields such as zoology, econometric, manufacturing and traffic accident modeling [1, 3, 7, 9, 10, 16– 18, 20, 22–25]. However, the statistical problem and associated implications when such a count outcome is used as a predictor has received far less attention in literature. In such cases, count variables are typically treated as continuous predictors in regression models, with no effort to distinguish structural zeros from their random counterparts. This practice is adopted mainly for modeling convenience, which in many studies does not reflect the realistic association of variables involved. For example, as illustrated in a study on alcohol research [11], a structural zero in drinking outcomes represents an individual who abstains from drinking, while a random zero corresponds to a drinker who happens not to drink during a period of time. Thus, the structural and random zeros represent two distinct groups of subjects with different psychosocial outcomes. Indeed, ignoring the differences between structural and random zeros and simply using the count variable as a predictor may yield biased inferences and uninterpretable findings [11, 12].

To tease out the distinctive effects of structural and random zeros on the response of interest, we can include an indicator of structural zeros in the model (in addition to the count variable itself). This approach requires that the structural zeros are observed, such as alcohol abstainers in alcohol research. However, as indicated above, structural zeros are often latent and are not directly observable. This paper is aimed at filling the methodological gap by developing a new approach to model the distinctive effects of structural and random zeros as predictors in regression analysis, in the situations where the structural are not observed. Our method relies on modeling structural zeros by zero-inflated models and may be potentially applied to a broad range of fields as mentioned above.

2. Models for Count Predictors with Structural Zeros

2.1. Problems from Structural Zeros

Given a sample of n subjects, let y_i denote the response of interest and x_i a zero-inflated count predictor from the ith subject (1 ≤ i ≤ n). Suppose that the structural zero in x_i measures some personal trait and the random zero and positive count assesses the level of activities of some behavior of interest such as alcohol use. Further, we assume there are some other covariates, collectively denoted by z_i = (z_i1,...,z_ip)^T.

Let r_i be an indicator of structural zero of x_i, i.e., r_i = 1 if x_i is a structural zero and r_i = 0 otherwise. In studies where the structural zeros are observed, one may simply add the indicator r_i of structural zero as an additional predictor in the model to address the differential effects between random and structural zeros. However, in many studies r_i is latent as it is only partially observed; for subjects with x_i > 0, r_i = 0; however, r_i is unknown for subjects with x_i = 0.

The latent indicator r_i partitions the study sample (population) into two distinctive subgroups, with one consisting of all subjects corresponding to r_i = 1 and the other comprising of the remaining subjects with r_i = 0. Since the trait in many studies is often a risk factor, we refer to the first group as the non-risk subgroup, while the second is known as the at-risk subgroup.

If we do not distinguish between structural and random zeros, we may apply the generalized linear models (GLMs) to model the association between the explanatory variables including the predictor of interest x_i and the covariates z_i, and the outcome, as follows:

y_{i} | x_{i}, z_{i} ~ i . d . f_{i}, μ_{i} = E (y_{i} | x_{i}, z_{i}) = h (α x_{i} + z_{i}^{⊤} β)

(1)

where i.d. denotes independently distributed, f denotes some distribution such as Poisson and h is the inverse of some link function such as the log function [22]. For example, if y_i is continuous, we may use the following linear model:

y_{i} | x_{i}, z_{i} ~ i . d . N (μ_{i}, σ^{2}), μ_{i} = E (y_{i} | x_{i}, z_{i}) = α x_{i} + z_{i}^{⊤} β, 1 \leq i \leq n,

(2)

where N(μ,σ²) denotes a normal distribution with mean μ and variance σ². Note that z_i includes a covariate with the constant value 1 so the models above contain the intercept term.

However, as mentioned in Section 1, when a count variable x_i has structural zeros, the conceptual difference between structural and random zeros significantly impacts interpretation of the coefficient α in (1) and (2). For example, if x_i is a drinking outcome such as days of heaving drinking, the difference between a subject with r_i = 1 and r_i = 0 is substantial. If x_i = 0 is a random zero, the coefficient of x_i represents the differential effects of drinking on the response y_i within the drinker subgroup when the drinking outcome changes from 0 to 1. If x_i = 0 represents a structural zero, such a difference speaks to the effect of the trait of drinking on the response y_i. When only including x_i as in (1), the coefficient of x_i has a dubious interpretation. Thus, the model in (1) is flawed and must be revised to account for the distinctive effects of structural and random zeros.

Now consider the following GLM:

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . f, E (y_{i} | x_{i}, r_{i}, z_{i}) = h (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β), 1 \leq i \leq n .

(3)

The above is identical to (1), except for the additional indicator of structural zeros in the set of explanatory variables. Under the refined model in (3), the effects of traits on the response are explained by α₂, while the effects of the level of activities of the behavior are indicated by α₁.

If r_i is observed, (3) is a regular GLM and commonly used inference tools such as maximum likelihood can be applied for inferences about the model parameters. When r_i is unobserved as in most real studies, (3) cannot be estimated using such standard methods. Next we discuss how to make inferences about (3) in the latter case.

2.2. A Mixture Model

We construct a model with a zero-inflated count predictor under the generalized linear regression model framework. Our mixture model consists of two components, one for modeling the outcome y and the other for modeling the zero-inflated count predictor.

Main Model:

This component pertains to the model of primary interest. Given x_i, z_i and r_i, the outcome y_i follows some parametric distribution indexed by parameter vector α = (α₁, α₂, β):

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . f, μ_{i} = E (y_{i} | x_{i}, z_{i}, r_{i}) = g (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β), 1 \leq i \leq n .

(4)

The link function g(·) can be specified depending on the type of the outcome y. For example, if y_i is continuous and normally distributed, we can choose the identity link function. Then model (4) becomes

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . N (μ_{i}, σ^{2}), μ_{i} = α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β, 1 \leq i \leq n .

(5)

Inclusion of the indicator r_i for the risk, as a predictor in the Main Model, enables us to model the differential effects between structural and random zeros. There are two effects associated with the trait in the Main model. One is for the difference between structural zeros and random zeros, say the trait effect, measured by α₂. The other is the dosage effect of the count predictor for the at-risk subgroup, measured by α₁. The coefficient α₁ measures the change in y_i per unit increase in x_i within the at-risk group, which is the effect of the severity of the risk factor on the response for subjects who have such a risk factor. Without including r_i, two effects are mixed together and hence may potentially provide biased and misleading conclusions. Our model can tease apart the two effects and hence can provide a more comprehensive relationship between the outcome and the trait.

Auxiliary Zero-inflated Model:

This component models the zero-inflated predictor x_i. Because of the inflation of zeros from the non-risk subgroup, we model the count variable x_i with some zero-inflated count response models. For example, we may assume that x_i follows a popular ZIP distribution with probability of being structural zero ρ_i and Poisson mean μ_i, i.e., ZIP(ρ_i, μ_i). The Auxiliary ZIP model, ZIP_x, models both the structural zero and the Poisson count in x_i. We assume that u_i is a set of predictors for both ρ_i and μ_i. Although ρ_i and μ_i may depend on different sets of predictors, for notational brevity we assume a common set u_i which includes all the predictors for both components, but with different coefficients γ₁ and γ₂, i.e.:

x_{i} | u_{i} ~ i . d . ZIP (ρ_{i}, μ_{i}), ρ_{i} = h_{1} (u_{i}^{T} γ_{1}), μ_{i} = h_{2} (u_{i}^{T} γ_{2}), 1 \leq i \leq n,

(6)

where h₁(·) and h₂(·) are the link functions for the structural zero component and the Poisson component. The predictors u_i may be different from or overlap with z_i in the Main Model (4). Other commonly used zero-inflated count response models such as zero-inflated negative binomial (ZINB) may also be adopted for the Auxiliary zero-inflated model.

The validity of the Main Model and the Auxiliary Model is given by the following assumptions:

Assumption A:

Conditional Independence. Given u_i, we assume that x_i and r_i are independent of z_i, i.e.,

(x_{i}, r_{i}) ⊥ z_{i} | u_{i} .

This assumption implies that x_i and r_i may depend on the covariates z_i, but the dependence is only through the predictors u_i. This condition can be satisfied by including additional predictors from z_i in (4), as needed for the conditional independence, into u_i in (6) for the zero-inflated model of x_i.

Assumption B:

Comprehensiveness of the Main Model. Given the predictors x_i,z_i, and r_i, y_i is independent of u_i, i.e.,

y_{i} ⊥ u_{i} | x_{i}, z_{i}, r_{i} .

The assumption implies that y_i may depend on u_i, but the dependence is only through x_i,z_i and r_i. This condition can always be satisfied by including additional predictors from u_i in (6) into z_i in (4).The comprehensiveness here means that all the information about y_i carried by or contained in u_i is captured by x_i,z_i and r_i through the Main Model.

In practice, we may choose a set of covariates u_i and z_i based on the subject matter of the study. As long as important predictors for the outcome y_i and the count x_i are included, the two assumptions should approximately true.

The proposed mixture model can be applied to different types of responses in the Main Model including continuous, categorical, count, and survival data and different models such as ZIP and ZINB for zero-inflated count data x_i in the Auxiliary Model. We discussed the linear regression model for the continuous response in (5). Below we illustrate the approach with some other common response variables for the Main Model.

2.2.1. Models for categorical responses

When y_i is binary, we may consider modeling the response in the Main Model using the following logistic regression:

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . Bern (μ_{i}), μ_{i} = E (y_{i} | x_{i}, z_{i}, r_{i}) = {logit}^{- 1} (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β), 1 \leq i \leq n,

(7)

where Bern(μ) denotes a Bernoulli with mean μ and logit⁻¹ (·) denotes the inverse of the logit link. Alternatively, we may apply the probit, complementary log-log, or other commonly used link functions for the binary response y_i. Further, we can readily extend (7) to nominal or ordinal responses using the cumulative logistic or generalized logit models in [22].

2.2.2. Models for count responses

When y_i is a count response, Poisson and negative binomial (NB) regression models may be applied. For example, under a log-linear Poisson regression we may assume:

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . Poisson (μ_{i}), \log (u_{i}) = α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β μ_{i} = E (y_{i} | x_{i}, z_{i}, r_{i}) = \exp (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β), 1 \leq i \leq n,

(8)

where Poisson(μ) denotes a Poisson with mean μ.

2.2.3. Models for zero-inflated count responses

If y_i is a zero-inflated count response itself, we may apply ZIP or ZINB to model the data and to account for structural zeros. For example, if using ZIP, we can apply a logistic model for the structural zero and a loglinear model for the Poisson component of the response y_i in (4) as:

y_{i} | x_{i}, z_{i}, r_{i} ~ i . d . ZIP (ρ_{i} (v_{i}; θ_{2}), μ_{i} (v_{i}; θ_{2})), 1 \leq i \leq n . ρ_{i} (v_{i}; θ_{1}) = {logit}^{- 1} (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β), μ_{i} (v_{i}; θ_{2}) = \exp (α_{1}^{'} x_{i} + α_{2}^{'} r_{i} + z_{i}^{⊤} β^{'}) θ_{1} = {(α_{1}, α_{2}, β^{⊤})}^{⊤}, θ_{2} = {(α_{1}^{'}, α_{2}^{'}, β^{' ⊤})}^{⊤} .

(9)

Note that as in the case of x_i, we assume a common set of explanatory variables $v_{i} = {(x_{i}, r_{i}, z_{i}^{⊤})}^{⊤}$ , but different coefficients θ₁ and θ₂ for the logistic and Poisson components of the ZIP.

In addition to the common types of response variables, this approach can be readily adapted to other response variables.

3. Statistical Inference

3.1. Likelihood Function

Since the indicator variable r_i is only partially observed, inferences cannot be made just based on the Main Model (4). Under Assumptions A and B, we can apply maximum likelihood for inference. For a subject with x_i > 0, note that r_i = 0, so the likelihood is:

L_{(x_{i} > 0)} = f (y_{i}, x_{i}, z_{i}, u_{i}) = f (y_{i}, x_{i}, z_{i}, u_{i}, r_{i} = 0) = f (y_{i} | x_{i}, z_{i}, u_{i}, r_{i} = 0) \Pr (x_{i} | z_{i}, u_{i}, r_{i} = 0) \Pr (r_{i} = 0 | z_{i}, u_{i}) f (z_{i}, u_{i}) = f (y_{i} | x_{i}, z_{i}, r_{i} = 0) \Pr (x_{i} | u_{i}, r_{i} = 0) \Pr (r_{i} = 0 | u_{i}) f (z_{i}, u_{i}) .

(10)

Here we use f() as a generic notation for the (joint) likelihood for the variables in the parentheses. So it will be the density function for continuous variables and mass probabilities for discrete and categorical variables.

For a subject with x_i = 0, since r_i is unknown, the likelihood can be expressed as:

L_{(x_{i} = 0)} = f (y_{i}, x_{i} = 0, z_{i}, u_{i}) = f (y_{i}, x_{i} = 0, z_{i}, u_{i}, r_{i} = 0) + f (y_{i}, x_{i} = 0, z_{i}, u_{i}, r_{i} = 1) = f (y_{i} | x_{i} = 0, z_{i}, u_{i}, r_{i} = 0) \Pr (x_{i} = 0, r_{i} = 0 | u_{i}) f (z_{i}, u_{i}) + f (y_{i} | z_{i}, u_{i}, r_{i} = 1) \Pr (x_{i} = 0, r_{i} = 1 | u_{i}) f (z_{i}, u_{i}) = f (z_{i}, u_{i}) {f (y_{i} | x_{i} = 0, z_{i}, r_{i} = 0) \Pr (x_{i} = 0 | r_{i} = 0, u_{i}) \Pr (r_{i} = 0 | u_{i}) + f (y_{i} | x_{i}, z_{i}, r_{i} = 1) \Pr (r_{i} = 1 | u_{i})} .

(11)

In the above likelihood, f(y_i | x_i,z_i,r_i) can be computed based on (4), while Pr(x_i | u_i,r_i) and Pr(r_i | u_i) are provided by (6). For example, under (5) for a continuous y_i, we have:

f (y_{i} | x_{i}, z_{i}, r_{i}) = \frac{1}{\sqrt{2 π}} \exp {- \frac{{[y_{i} - (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β)]}^{2}}{2 σ^{2}}} .

Under (7) for a binary y_i,

\Pr (y_{i} | x_{i}, z_{i}, r_{i}) = {[\frac{\exp (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β)}{1 + \exp (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β)}]}^{y_{i}} {[\frac{\exp (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β)}{1 + \exp (α_{1} x_{i} + α_{2} r_{i} + z_{i}^{⊤} β)}]}^{1 - y_{i}}

Under (6), if we assume:

h_{1} = \log i t^{- 1} (u_{i}^{T} γ_{1}), h_{2} = \log^{- 1} (u_{i}^{T} γ_{2}), 1 \leq i \leq n,

then we have

\Pr (r_{i} = 1 | u_{i}) = ρ_{i} (u_{i} γ_{1}) = \frac{\exp (u_{i} γ_{1})}{1 + \exp (u_{i} γ_{1})}, \Pr (x_{i} | u_{i}, r_{i} = 0) = \frac{\exp (- \exp (u_{i} γ_{2})) \exp (x_{i} u_{i} γ_{2})}{x_{i}!} .

By substituting f(y_i | x_i,z_i, r_i), Pr(r_i | u_i) and Pr(x_i | u_i, r_i = 0) into the likelihood functions $L_{(x_{i} > 0)}$ and $L_{(x_{i} = 0)}$ in (10) and (11), we can apply maximum likelihood methods for inferences about the parameters.

Note that as in standard regression analysis, the likelihood for each subject contains the joint distribution of z_i and u_i. However, since f(z_i,u_i) contains no parameter of primary interest, it provides no contribution to the score equations and thus can be factored out from the likelihood function.

Because we are mainly interested in the differential effects of structural zeros in the main model, we naturally adopted the “pattern mixture model” approach, which involves a formulation of f(y_i ,x_i, r_i,z_i,u_i) = f(y_i|x_i,z_i,u_i, r_i)Pr(x_i|r_i,z_i,u_i)f(r_i,z_i,u_i). However, we can also formulate the model following a “selection model” scheme. In the “selection model” scheme, the model would involve a selecting distribution Pr(r_i|x_i,z_i,u_i), and the likelihood is factored as f(y_i, x_i, r_i,z_i,u_i) = f(y_i|x_i,z_i,u_i, r_i)Pr(r_i|x_i,z_i,u_i)f(x_i,z_i,u_i). Thus the likelihood for a subject with x_i > 0 (hence, r_i = 0) will be

f (y_{i}, x_{i}, z_{i}, u_{i}) = f (y_{i} | x_{i}, z_{i}, u_{i}, r_{i} = 0) \Pr (r_{i} = 0 | x_{i}, z_{i}, u_{i}) f (x_{i}, z_{i}, u_{i}),

and the likelihood for a subject with x_i = 0 (hence, r_i can be 0 or 1) will be

f (y_{i}, x_{i}, z_{i}, u_{i}) = [f (y_{i} | x_{i}, z_{i}, u_{i}, r_{i} = 0) \Pr (r_{i} = 0 | x_{i}, z_{i}, u_{i}) + f (y_{i} | x_{i}, z_{i}, u_{i}, r_{i} = 1) \Pr (r_{i} = 1 | x_{i}, z_{i}, u_{i})] f (x_{i}, z_{i}, u_{i}) .

Under this formulation, the distribution f(x_i,z_i,u_i) does not need to be specified if a model (say logistic) is specified for r_i and r_i is observed. However, since the structural zeros are unobserved, a (logistic) model r_i would not be identifiable. We may rely on the zero-inflated models to model the structural zeros, i.e., use the auxiliary model to model r_i, and the pattern mixture approach we adopted above would be a natural choice in the situation.

3.2. Hypothesis Testing

As discussed above, there are two effects associated with the trait. One is for the trait effect, the difference between structural and random zeros, measured by α₂, and the other is for the dosage effect of the count predictor on the outcome for the at-risk subgroup, measured by α₁, in the Main Model. We may test them separately using common hypothesis testing techniques such as the Wald test and the likelihood ratio test. For an overall testing of whether the trait is associated with the outcome, a linear composite hypothesis of α₁ = α₂ = 0 needs to be tested.

3.3. Selection of Initial Values

Due to the complexity of the mixture model, we generally do not obtain closed-form ML estimates (MLEs) of the parameters. Numerical optimization is needed to find the MLEs, such as the popular Newton-Raphson (NR) method. In using the Newton-Raphson method, it is important to start with good initial values in order for the iterations to converge to the global maximum of the likelihood function. The following strategies can be used for setting initial values to achieve this objective as well as to speed up convergence for the Newton-Raphson method.

We first estimate the initial values of the parameters in (6) for the count predictor x_i, and then estimate the initial values for the parameters in (4). More specifically, we follow a two-step procedure to obtain initial values:

Step 1. Calculating initial values for the regression parameters γ₁ and γ₂ in the Auxiliary Model in (6), as well as the marginal probability of structural zeros ρ_x and the marginal Poisson mean μ_x, for the count predictor x_i.

(a)
Estimate the initial value of μ_x and ρ_x. We fit a ZIP model for x_i with intercept only. The estimated probability of structural zeros and the Poisson mean then serve as the initial values of μ_x and ρ_x, denote as $μ_{x_{I}}$ and $ρ_{x_{I}}$ , respectively.
(b)
Estimate the initial value of γ₁ and γ₂. We fit a ZIP model for x_i with predictor u_i. The estimated coefficients from the logistic regression for the structural zeros are then used as the initial value of γ₁, $γ_{1_{I}}$ , while the coefficients from the loglinear model serve as the initial value of γ₂, $γ_{2_{I}}$ .

Step 2. Calculating initial values for the parameters for the Main Model of y in (4). The difficulties of model estimation lie in the fact that structural zeros are not observed. However, subjects with positive values of x_i are not structural zeros, i.e., r_i = 0 for these subjects. Thus we may apply regular regression methods to this subsample to obtain initial estimates of all the parameters except for the coefficient of r_i.

For example, if we have a regression model on y and

y_{i} | x_{i}, r_{i}, z_{i} ~ i . d . N (μ_{i}, σ^{2}), μ_{i} = c_{0} + c_{x} x_{i} + c_{r} r_{i} + c_{z} z_{i} .

we can apply the model

y_{i} | x_{i}, z_{i} ~ i . d . N (μ_{i}, σ^{2}), μ_{i} = c_{0} + c_{x} x_{i} + c_{z} z_{i},

(12)

to the subjects with x_i > 0 to obtain the initial values for $c_{0_{I}}$ , $c_{x_{I}}$ , $c_{z_{I}}$ and σ_I. We can then set the initial value of c_r based on the following equations:

E (y | x = 0) = \Pr (x is random zero) (c_{0_{I}} + c_{x_{I}} E (x | x = 0)) + \Pr (x is structural zero) \cdot (c_{0_{I}} + c_{r} + c_{x_{I}} E (x | x = 0)) = (c_{0_{I}} + c_{x_{I}} E (x | x = 0)) + c_{r} \Pr (x is structural zero) = (c_{0_{I}} + c_{x_{I}} E (x | x = 0)) + c_{r} \frac{ρ_{x_{I}}}{ρ_{x_{I}} + (1 - ρ_{x_{I}}) \cdot e^{- μ_{x_{I}}}},

so the initial value of c_r can be obtained by:

c_{r_{I}} = \frac{[E (y_{i} | x_{i} = 0) - c_{0_{I}}] [ρ_{x_{I}} + (1 - ρ_{x_{I}}) \cdot e^{- μ_{x_{I}}}]}{ρ_{x_{I}}} .

(13)

The initial value for c_r is obtained by comparing the mean response outcome for all the subjects with zero counts in x, including both structural and random zeros, to the intercept estimated from the subjects with x_i > 0 in (12). The choices of initial values depend on the models we use, and we will give some examples in the simulation section. Since ignoring the difference between structural and random zeros means the coefficient involving r is zero, it is also reasonable to use 0 as initial values for c_r.

4. Simulation Studies

4.1. Simulation Setup

We use simulation studies to examine the performance of the proposed method when modeling zero-inflated outcomes as predictors in regression analysis. We assume that the predictor x ∼ ZIP(ρ_x, μ_x), a ZIP with ρ_x denoting the probability of structural zeros in the logistic component and μ_x denoting the mean of a count response in the Poisson component of the ZIP. A larger ρ_x means more structural zeros, while a larger μ_x indicates a smaller proportion of random zeros in the simulated data.

The predictor x is generated based on the following Auxiliary Model:

x ~ i . d . ZIP (ρ_{x}, μ_{x}), z_{1} ~ i . d . N (0, σ_{z_{1}}^{2}), \log i t (ρ_{x}) = a_{0} + a_{1} z_{1}, \log (μ_{x}) = b_{0} + b_{1} z_{1} .

(14)

The values of a₀, a₁, b₀ and b₁ control the amount of structural and random zeros in the predictor. We consider four different types of response: continuous, binary, Poisson and zero inflated Poisson y. To investigate the performance of the proposed method under different conditions for each type of outcomes, we consider three scenarios: a) when the structural zeros have effects on the outcome y and the Main Model (4) correctly specifies the effect; b) when the structural zeros do have an effect on y, but the Main Model (4) is misspecified by not including the effect of structural zeros in the model, i.e., the difference between structural zeros and random zeros is ignored; c) when the structural zeros don’t have effect on y, but the Main Model does include an effect of the structural zeros.

In all simulations, a Monte Carlo (MC) sample size of 1,000 is used for the models. We summarize results of model estimates by reporting point and variance estimates (both model-based obtained from the asymptotic theory and empirical estimates from MC replications), as well as the coverage probability of confidence intervals (probability whether the true value is covered by the confidence interval).

4.2. Continuous Response Y

For a continuous y, the association of y with x, z and r based on (5) is specified as follows:

y | x, r, z ~ i.d . N (μ, σ_{y}^{2}), z ~ i . d . N (0, σ_{z}^{2}), μ = c_{0} + c_{x} x + c_{r} r + c_{z} z,

(15)

and x is generated based on (14). For the simulation studies, we set $σ_{y}^{2} = σ_{z}^{2} = σ_{z_{1}}^{2} = 1$ , c₀ = −1, c_x = c_z = 1, a₀ = b₀ = 0.5, and a₁ = b₁ = 1. To see if the effect of structural zeros on the response has any impact on the estimates of other parameters such as c_x and c_z, we consider c_r = 1 and c_r = 0. When c_r = 1, we consider both the true Main Model (15) and a misspecified Main Model by excluding c_r in the model fitting, i.e., we fit a model on y as y = c₀ + c_xx + c_zz. The sample sizes considered were 200, 500 and 1000. As described above, we set the initial values of the parameters based on the discussion in Section 3.3. By applying the model (12) to the subsample with x > 0, we obtained the initial values for c_x and c_z, and the initial value c_r is set based on (13).

Shown in Table 1 and tables S1-S2 (in the supporting web material) are the averages of the estimates for both the Main and Auxiliary Model. In Table 1, the Main Model (15 ) is correctly specified, while in Table S1, the Main Model is misspecified. The results for c_r = 0 are provided in Table S2 as the supporting web material. As shown in Table 1 and S2, the estimates for both the Main Model and the Auxiliary Model are very close to the true values, the coverage probabilities are also very close to 95%, and the asymptotic variances are very close to the empirical variances. Table 1 and Table S2 also show that structural zeros do not have much impact on the estimates of other parameters such as c_x and c_z, as long as the Main Model is correctly specified. But when the Main Model is misspecified by not including the structural zeros in the model, as shown in Table S1, the estimates of c_x are quite biased, and the coverage probabilities are very low. The misspecification of the Main Model does not have a big impact on the estimates of c_z. Therefore, when the structural zeros do have effect on the outcome y, a model failing to include the structural zeros of the count variable x can’t capture the true association between x and y, but the associations between the outcome y and other covariates z may not be affected much by the misspecification. The estimate of the intercept c₀ is biased.

Table 1.

Mean estimated parameters (mean asymptotic variance, simulation variance) and the coverage probabilities (CP) (%) over 1000 realizations for the continuous response. The variances are in 10⁻³ for c_x and c_z, 10⁻¹ for a₁, and 10⁻² for the other estimates.

The estimates of the parameters for the Main Model:

	c₀ = −1		c_x = 1		c_r = 1		c_z = 1

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	−1.000 ( 3.34 3.38 )	93.1	1.000 ( 4.50 4.23 )	95.0	0.998 ( 4.69 4.76 )	95.3	1.003 ( 5.37 5.47 )	94.0
500	−0.999 ( 1.28 1.31 )	93.8	1.000 ( 1.56 1.63 )	94.1	1.001 ( 1.81 1.83 )	94.5	1.000 ( 2.15 2.25 )	94.6
1000	−1.001 ( 0.62 0.61 )	95.0	1.000 ( 0.73 0.76 )	94.8	1.000 ( 0.89 0.88 )	94.3	0.998 ( 1.07 1.04 )	94.3

The estimates of the parameters for the Auxiliary Model:

	a₀ = 0.5		a₁ = 1		b₀ = 0.5		b₁ = l

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	0.478 ( 5.54 5.77 )	95.5	1.062 ( 0.97 1.18 )	93.8	0.488 ( 1.50 1.58 )	94.4	1.011 ( 1.60 1.65 )	94.7
500	0.492 ( 2.05 2.10 )	95.0	1.016 ( 0.33 0.32 )	96.3	0.494 ( 0.58 0.59 )	95.4	1.003 ( 0.56 0.54 )	96.9
1000	0.495 ( 1.01 1.02 )	95.1	1.014 ( 0.16 0.18 )	94.2	0.499 ( 0.29 0.27 )	95.5	1.001 ( 0.26 0.27 )	94.9

Open in a new tab

4.3. Binary Response Y

For a binary outcome y, we simulate the data from a GLM for the Main Model with a logit link as follows:

y | x, r, z ~ i . d . Bern (p), logit (p) = c_{0} + c_{x} x + c_{r} r + c_{z} z .

(16)

The explanatory variables x and z are simulated the same way as in the continuous case. The values of the parameters are set to be the same as in the continuous case. A MC sample of 1000 replications is simulated for each of the sample sizes 200, 500 and 1000 using the same parameter values as in the case of a continuous y. The initial values for a₀,b₀,a₁ and b₁ are again determined by Section 3.3. For the initial values of c_x and c_z, we apply a logistic regression model to the subset of subjects with x > 0, i.e., $c_{x_{I}}$ and $c_{z_{I}}$ are estimated based on:

E (y | x, z) = {logit}^{- 1} (c_{0} + c_{x} x + c_{z} z), x > 0.

(17)

After obtaining the initial values of c_x and c_z by applying Step 2 in Section 3.3, the initial value of c_r is estimated by:

c_{r_{I}} = \log \frac{A}{1 - A} - c_{0_{I}} - c_{x_{I}} E (x | x = 0),

where

A = \frac{\Pr (y = 1 | x = 0) - (1 - ρ_{x_{I}}) e^{- μ_{x_{I}}} \log i t^{- 1} [c_{0_{I}} - c_{x_{I}} E (x | x = 0)]}{ρ_{x_{I}}} .

The simulation results are summarized in Table 2, S3 and S4. As shown in Table 2 and S4, when the Main Model (16) is correctly specified, all the estimates are very good, even for a relatively small sample size. As the sample size increases from 200 to 1000, the point estimates are closer to the true value. When the Main Model (16) is misspecified, as shown in Table S3, the estimates of c_x become quite biased, although other parameters except for c₀ are all estimated quite well.

Table 2.

Mean estimated parameters (mean asymptotic variance, simulation variance) and the coverage probabilities (CP) (%) over 1000 realizations for the binary response. The variances are in 10⁻² for c_z, a₀, b₀ and b₁, 10^—1 for the other estimates.

The estimates of the parameters for the Main Model:

	c₀ = −1		c_x = 1		c_r = 1		c_z = 1

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	−1.118 ( 4.03 4.37 )	95.4	1.101 ( 1.65 1.78 )	96.2	1.129 ( 5.25 5.91 )	94.3	1.045 ( 4.34 4.27 )	95.5
500	−1.048 ( 1.33 1.39 )	94.9	1.040 ( 0.52 0.56 )	95.1	1.058 ( 1.80 1.90 )	95.3	1.022 ( 1.62 1.69 )	94.7
1000	−1.022 ( 0.62 0.65 )	94.8	1.019 ( 0.24 0.25 )	96.0	1.020 ( 0.86 0.86 )	96.3	1.007 ( 0.79 0.81 )	94.2

The estimates of the parameters for the Auxiliary Model:

	a₀ = 0.5		a₁ = 1		b₀ = 0.5		b₁ = l

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	0.468 ( 6.90 7.06 )	95.8	1.081 ( 1.27 1.54 )	93.9	0.488 ( 1.68 1.71 )	94.4	1.013 ( 1.75 1.80 )	94.5
500	0.488 ( 2.56 2.55 )	94.7	1.023 ( 0.44 0.41 )	96.5	0.493 ( 0.66 0.66 )	95.4	1.004 ( 0.62 0.58 )	96.2
1000	0.494 ( 1.25 1.29 )	93.2	1.015 ( 0.21 0.23 )	94.0	0.499 ( 0.33 0.32 )	95.0	1.001 ( 0.29 0.31 )	95.1

Open in a new tab

4.4. Poisson Count Response Y

For a Poisson count variable y, we generate y from a GLM with a log function as follows:

y | x, r, z ~ i . d . Poisson (μ), μ = \exp (c_{0} + c_{x} x + c_{r} r + c_{z} z) .

(18)

With the same set of values of the parameters as in the continuous case, we simulate 1,000 MC samples from each of the three sample sizes considered.

The initial values of the estimates of μ_x and ρ_x are determined by the same algorithm as in the previous cases. In order to obtain a proper initial value of c₀, c_x and c_z, we fit the following Poisson to the subsample with x > 0:

y | x, z ~ i . d . Poisson (μ), μ = \exp (c_{0_{I}} + c_{x_{I}} x + c_{z_{I}} z), x > 0,

with the initial values $μ_{x_{I}}$ , $ρ_{x_{I}}$ , $c_{0_{I}}$ , $c_{x_{I}}$ and $c_{z_{I}}$ . We estimate an initial value of c_r using the following estimating equations:

E (y | x = 0) = \Pr (x is random zero) \cdot e^{c_{0_{I}} + c_{z_{I}} * E (z | x = 0)} + \Pr (x is structural zero) \cdot e^{c_{0_{I}} + c_{r} + c_{z_{I}} * E (z | x = 0)}, c_{r} = \log (\frac{E (y | x = 0) - (1 - ρ_{x_{I}}) e^{- μ_{x_{I}}} \cdot e^{c_{0_{I}} + c_{z_{I}} * E (z | x = 0)}}{ρ_{x_{I}} \cdot e^{c_{0_{I}} + c_{z_{I}} * E (z | x = 0)}})

The simulation results are summarized in Tables S5, S6 and S7 as the supporting web material. Similar to the continuous and binary cases, all estimates are quite close to the true values when the Main Model (18) is correctly specified. But when the Main Model is misspecified, as shown in Table S6, the estimates are biased and the coverage probabilities are very small as well. Again, misspecification of the Main Model does not have much impact on the Auxiliary Model.

4.5. Zero-inflated Poisson Response Y

Finally, we consider a zero-inflated count response y generated from the following ZIP model:

y | v ~ i . d . ZIP (ρ (v; θ_{2}), μ (v; θ_{2})), ρ = {logit}^{- 1} (c_{0} + c_{x} x + c_{r} r + c_{z} z), μ = \exp (c_{0}^{'} + c_{x}^{'} x + c_{r}^{'} r + c_{z}^{'} z),

(19)

where v = (x,r,z)^T. We set $c_{0}^{'} = c_{0} = - 1$ and $c_{x}^{'} = c_{x} = c_{r}^{'} = c_{r} = c_{z}^{'} = c_{z} = 1$ . Since the latent nature of ZIP requires a larger sample size to obtain reliable estimates, especially within the context of a latent x following another ZIP, we consider bigger sample sizes 500, 1000 and 1500 for each case.

Again, we determine the initial values for estimating μ_x and p_x as discussed in Section 3.3. For the initial values of c₀, c_x, c_z of the logistic component and $c_{0}^{'}, c_{x}^{'}, c_{z}^{'}$ of the loglinear component of the ZIP for y, we apply the following models for the subsample with x > 0 :

y | x, z ~ i . d . ZIP (ρ (x, z; η_{1}), μ (x, z; η_{2})), ν = {logit}^{- 1} (c_{0} + c_{x} x + c_{z} z), \log (μ) = c_{0}^{'} + c_{x}^{'} x + c_{z}^{'} z, η_{1} = {(c_{0}, c_{x}, c_{z})}^{⊤}, η_{2} = {(c_{0}^{'}, c_{x}^{'}, c_{z}^{'})}^{⊤} .

Due to the complexity of the model, we set $c_{r} = c_{r}^{'} = 0$ as the initial values for estimating c_r and $c_{r}^{'}$ .

Shown in Tables 3, S8 and S9 are the simulation results. When the Main Model (19) is correctly specified, as shown in Table 3 and S9, the estimates from both the log-linear Poisson component and the logistic zero-inflated component of the Main Model are very good. The point estimates are close to the true values and the coverage probabilities are close to 95%. The asymptotic variances are also quite close to their corresponding empirical counterparts. But when the Main Model is misspecified, as shown in Table S8, the estimates from both components of the Main Model are biased, especially for the estimates of c_xp and c_xb. This indicates that when the outcome follows a ZIP model and has a zero-inflated count predictor x, if the difference between the structural and random zeros in the predictor is ignored, the Main Model can detect neither the true relationship between y and x nor the associations between y and other covariates. Comparing to the two components of the Main Model, the estimates of c_zb in the zero-inflated component are relatively better than the estimates of c_zp in the Poisson component. The misspecification of the Main Model do not have much impact on the performance of the Auxiliary Model.

Table 3.

Mean estimated parameters (mean asymptotic variance, simulation variance) and the coverage probabilities (CP) (%) over 1000 realizations for the ZIP response. The variances are in 10^—2 for c_zp, a₀, b₀ and b₁, 10^—1 for the other estimates except for c_rb and c₀_b.

The estimates of the parameters for the poisson component of the Main Model:

	c_0p = 1		c_xp = 1		c_rp = 1		c_zp = 1

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	−1.078 ( 3.29 3.86 )	91.3	1.024 ( 0.99 1.27 )	93.9	1.031 ( 3.53 4.01 )	93.2	1.029 ( 2.86 3.18 )	93.0
500	−1.031 ( 0.93 0.90 )	94.3	1.016 ( 0.22 0.21 )	94.3	1.012 ( 0.98 0.96 )	94.3	1.009 ( 0.95 0.99 )	94.5
1000	−1.011 ( 0.39 0.41 )	94.4	1.003 ( 0.07 0.08 )	93.9	1.000 ( 0.41 0.45 )	93.6	1.004 ( 0.43 0.46 )	93.7

The estimates of the parameters for the zero-inflated component of the Main Model:

	c_0b = −1		c_xb = 1		c_rb = 1		c_zb = 1

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	−1.412 ( 1.80 1.78 )	95.5	1.234 ( 4.75 5.15 )	96.8	1.254 ( 1.65 1.71 )	95.7	1.216 ( 2.92 3.33 )	95.1
500	−1.194 ( 0.49 0.48 )	96.2	1.113 ( 1.17 1.17 )	95.4	1.140 ( 0.47 0.49 )	96.4	1.085 ( 0.79 0.82 )	94.8
1000	−1.076 ( 0.21 0.21 )	94.8	1.051 ( 0.49 0.51 )	95.4	1.038 ( 0.21 0.21 )	95.1	1.036 ( 0.36 0.34 )	96.0

The estimates of the parameters for the Auxiliary Model:

	a₀ = 0.5		a₁ = 1		b₀ = 0.5		b₁ = 1

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	0.464 ( 6.89 6.78 )	95.4	1.102 ( 1.28 1.50 )	94.4	0.486 ( 1.67 1.68 )	95.1	1.014 ( 1.79 1.8 )	94.5
500	0.490 ( 2.52 2.49 )	95.3	1.033 ( 0.43 0.41 )	96.4	0.493 ( 0.65 0.65 )	95.7	1.002 ( 0.64 0.6 )	96.5
1000	0.494 ( 1.23 1.32 )	94.2	1.024 ( 0.21 0.24 )	93.3	0.498 ( 0.32 0.31 )	95.5	0.998 ( 0.3 0.33 )	94.8

Open in a new tab

5. Real Data Analysis

5.1. The Data

We now use the 2009–2010 National Health and Nutrition Examination Survey (NHANES) study discussed in [11] as an illustrative example of a real study application. The NHANES is a survey research program conducted by the National Center for Health Statistics to assess the health and nutritional status of people in the United States. A brief introduction of the study and a more detailed description of the NHANES data can be found in [11]. In the NHANCE study, alcohol use is measured by the number of days of alcohol consumption (DAD) in a week, while depressive symptoms are assessed by the Patient Health Questionnaire (PHQ-9). As discussed in [11], both DAD and PHQ-9 have excessive zeros in their distributions. By fitting a zero-inflated Poisson (ZIP) model, we revealed that the DAD outcome has excessive zeros and the structural zeros in DAD were as high as 30%. Note that for illustrative purpose, in all these analyses and the following analyses, we ignore the complex survey study design of NHANES and did not incorporate the sampling weight in the analysis.

We apply the proposed approach to examine potential differential rates of depression between the at- and non-risk subgroups of alcohol use. In the proposed method, the essential component is to tease apart the effect of alcohol use, a trait of an individual, from the effect of amount of alcohol use, when modelling the relationship between alcohol use and depression. One of the unique features of the NHANES is the inclusion of the variable “NeverDrink”, which measures lifetime abstinence from alcohol. This variable asks if a subject has ever used alcohol in his/her life. It is not a perfect indicator of structural zero in our context, since subjects who have used alcohol but became abstinent from it (structural zero) are not be counted as Never Drinkers. Nonetheless, the variable “NeverDrink” may serve at least as a crude benchmark to examine the performance of the proposed approach. Regarding the Main predictor DAD, we want to know if there are any demographic information to predict DAD.

5.2. Statistical Model

We apply the approach to model the effect of alcohol use on PHQ-9 score. For the PHQ-9 score, we applied a ZIP, with age, race, gender, education, and DAD as well as the indicator of structural zeros of DAD as the explanatory variables. Since our initial univariate analysis of DAD vs. PHQ-9 suggested a quadratic association between the two variables, a square of DAD (DAD²) was also included as a predictor. We also consider a ZIP model for the DAD variable by including age, race, gender, education as predictors for both components of the DAD variable. So, our model (Model I) to study the effect of alcohol use on depression is specified as follows:

PHQ- 9_{i} ~ ZIP (ρ_{i}, μ_{i}), DAD ~ ZIP (ρ_{x i}, μ_{x i}), ρ_{i} ~ Structural zero of DAD + DAD + {DAD}^{2} + age + gender + race + education, μ_{i} ~ Structural zero of DAD + DAD + {DAD}^{2} + age + gender + race + education, ρ_{x i} ~ age + gender + race + education, μ_{x i} ~ age + gender + race + education,

(20)

where ρ_xi is the probability for structural zeros and μ_xi is the Poisson mean of the DAD variable.

We apply the maximum likelihood method discussed in Section 2.2 to make inference about the parameters for the model in (20). The coefficients of the structural zeros of DAD indicate the effect of a trait of an individual for alcohol use on the depressive symptoms, while those of DAD and DAD² provide the effect of amount of drinking on this response for subjects in the at-risk group of alcohol use. We also apply a ZIP model to model the PHQ-9 score with exactly the same explanatory variables, except that the indicator of structural zeros of DAD in (20) is replaced by the variable “NeverDrink”. The second ZIP model (Model II) does not involve the latent variable of structural zeros of DAD, providing a benchmark to assess the performance of the proposed approach. Unlike Model I, Model II do not include a ZIP Auxiliary Model for the predictor of DAD.

5.3. Results

Due to some missing values, the actual sample size for the analysis is 5,261 (out of 5,283 subjects in the data). Shown in Table 4 are the parameter estimates for the logistic and Poisson components of the ZIP Models I (Model II) for the PHQ9 score. Both models have successfully identified significant associations between alcohol use and depression in both components. In the logistic component which models the likelihood of non-depression (structural zeros of PHQ-9 score), the non-drinkers are more likely of being non-risk for depression, or less likely of being at-risk for depression (p-value 0.0142 for Model I and <0.0001 for Model II). Among the at-risk subgroup for alcohol use, the coefficients for DAD² are significant in both models (p-value <0.0001 and 0.0006 for Model I and II, respectively). The negative signs of these coefficients indicate that subjects with DAD at the two ends, near 0 (few days of alcohol use) or near 7 (most days of alcohol use), are at higher risk for depressive symptoms. Based on Model I (II), subjects with 2.70 (2.04) days of any alcohol use per week are least likely to be depressed.

Table 4.

Mean estimated parameters (mean asymptotic variance, simulation variance) and the coverage probabilities (CP) (%) over 1000 realizations for the ZIP response when the Main Model is misspecified. The variances are in 10^—2 for c_zp, a₀,b₀ and b₁, 10^—1 for the other estimates except for c_0b

The estimates of the parameters for the poisson component of the Main Model:

	c_0p = 1		c_xp = 1		c_zp = 1

N	Est.	CP	Est.	CP	Est.	CP

200	−0.214 ( 0.38 0.47 )	4.3	0.609 ( 0.38 0.70 )	29.5	0.969 ( 2.46 3.35 )	90.6
500	−0.211 ( 0.14 0.16 )	0.0	0.676 ( 0.08 0.17 )	2.9	0.947 ( 0.82 1.15 )	85.0
fOOO	−0.210 ( 0.07 0.09 )	0.0	0.699 ( 0.03 0.08 )	0.1	0.946 ( 0.38 0.58 )	78.1

The estimates of the parameters for the zero-inflated component of the Main Model:

	c_0p = −1		c_xb = 1		c_zp = 1

N	Est.	CP	Est.	CP	Est.	CP

200	−0.230 ( 0.24 0.26 )	52.0	0.730 ( 1.67 1.36 )	79.4	1.038 ( 2.07 2.47 )	92.5
500	−0.162 ( 0.08 0.07 )	16.9	0.704 ( 0.33 0.24 )	56.7	0.945 ( 0.63 0.67 )	92.7
1000	−0.149 ( 0.04 0.04 )	1.9	0.701 ( 0.16 0.11 )	33.2	0.916 ( 0.30 0.30 )	90.1

The estimates of the parameters for the Auxiliary Model:

	a₀ = 0.5		a₁ = 1		b₀ = 0.5		b1 = l

N	Est.	CP	Est.	CP	Est.	CP	Est.	CP

200	0.468 ( 7.33 7.47 )	94.9	1.095 ( 13.70 1.60 )	94.3	0.486 ( 1.75 1.82 )	95.1	1.013 ( 1.86 1.92 )	93.9
500	0.488 ( 2.72 2.74 )	95.3	1.036 ( 4.80 0.40 )	96.6	0.493 ( 0.68 0.68 )	95.5	1.003 ( 0.66 0.61 )	96.6
1000	0.493 ( 1.33 1.42 )	93.7	1.025 ( 2.30 0.30 )	93.5	0.498 ( 0.34 0.33 )	94.8	0.998 ( 0.32 0.35 )	94.1

Open in a new tab

For the Poisson component, the non-drinkers have less depressive symptoms based on both models (p-value <0.0001 for both models). Among the subjects who are at-risk for alcohol use, the coefficients of DAD² are again significant in both models (p-value <0.0001 for both models). The positive signs of these coefficients indicate that subjects with DAD at the two ends near 0 and 7 have higher PHQ-9 scores. Based on Model I (II), subjects with 3.55 (2.82) days of alcohol use per week have the lowest PHQ-9 scores.

The results for the Auxiliary Model of DAD are summarized in Table S10. Gender, age and education are significant predictors for both the Poisson and logistic components, older males with higher education are more likely to have more drinks and older females with lower educations are more likely to be in the non-risk group for alcohol drinking; Compared to people of other races, Mexican Americans and Non-Hispanics are more likely to be at-risk and also drink more if they are at risk for drinking.

Regarding the relationship between alcohol drinking and depression, both models yield similar conclusions, although Model I models the latent trait of alcohol use, while Model II uses the observed measure of this trait when examining the effects of alcohol use on depression. Abstinence from alcohol or moderate alcohol consumption are protective for depression. However, there is some discrepancy in the estimates between Model I and Model II. Our estimated percentage of structural zeros (non-risk group) is 35.0%, while the percent of structural zeros based on the NeverDrink variable is only 12.0%. Since this NeverDrink variable asks if subjects have any drink in their lifetime, those who don’t drink, but become abstinent from alcohol, are treated as structural zeros (nondrinkers) in the proposed approach (Model I). In contrast, such individuals are regarded as part of the at-risk subgroup in Model II. So the difference in the percent of structural zeros between the two approaches likely reflects the different interpretations of lifetime abstinence from alcohol.

6. Discussions

Zero-inflation, the observed amount of zeros is larger than that would be expected under a statistical model, is a common phenomenon in public health and medical research, and it is often associated with the existence of structural zeros. It is important both statistically and conceptually to distinguish random and structural zeros. However, structural zeros are often latent and information about whether a zero is structural or random is often not be observed directly.

A comparatively large amount of literature has focused on statistical methodology research and their applications in addressing the structural zero issue when the count variable is used as the response. Little attention is paid when such count variables are used as predictors. In such cases, simply ignoring the differential effects of structural zeros in the data analyses may yield biased estimates and uninterpretable findings [11]. In this paper, we have developed statistical models to address the differential effect of structural zeros and random zeros in such predictors.

The proposed approach fills a critical gap in literature to address the structural zero issues in predictors by jointly modeling the response of interest (Main Model) and the zero-inflated count predictors (Auxiliary Model). To untangle the effect of structural zeros from that of random zeros, an indicator of structural zeros, which is partially latent, is included in the Main Model to address the confounding effects of the two types of zeros. Validity of the zero-inflated model for the count predictor is critical for the application of the method.

We described four popular types of responses for the Main Model and two types of count predictors for the Auxiliary Model in this paper. In the proposed approach, we assumed conditional independence for both components, or equivalently, there is no confounder in both the Main and Auxiliary Models. Such assumptions are standard in regression analysis. The approach is easy to implement using popular statistical packages such as R and SAS. Like any mixture models, initial values are important for finding the maximum likelihood estimates. Based on our experience, the two-step procedure works quite well for selecting satisfactory initial values for computing the estimates. Also, our simulations and real data study examples have shown good performances of the approach.

Like any statistical method, the proposed approach also has some limitations. The method discussed in this paper is only applicable to cross-sectional studies. Further research is needed to extend the approach to longitudinal studies. Since it is premised upon parametric distribution assumptions, the approach lacks robustness against departures from assumed parametric models. Semiparametric approaches are needed for both cross-sectional and longitudinal studies to address such limitations, which is the next development we are pursuing.

Supplementary Material

Supp1

NIHMS1504815-supplement-Supp1.pdf^{(181KB, pdf)}

Table 5.

Comparison of estimates (Estimate), standard errors (Std Err) and p-values (P-value) from the ZIP models of depresson for the 2009–2010 NHANES study

The estimates of the poisson component of the depresson:

Parameter		Model I			Model II

		Estimate	Std Err	P-value	Estimate	Std Err	P-value

Intercept		2.7791	0.0635	<.0001	2.3457	0.0498	<.0001

NeverDrink	Yes vs. No	−1.3195	0.0267	<.0001	−0.1265	0.0256	<.0001

DAD		−0.5246	0.0177	<.0001	−0.1316	0.0145	<.0001

DAD²		0.0739	0.0028	<.0001	0.0233	0.0025	<.0001

Gender	Male vs. Female	−0.1822	0.0197	<.0001	−0.1895	0.0165	<.0001

AGE		0.0007	0.0006	0.2540	−0.0032	0.0005	<.0001

Race/Ethnicity	Mexican American	−0.1724	0.0521	0.0009	−0.0954	0.0407	0.0192

	Other Hispanic	−0.0260	0.0537	0.6287	−0.0336	0.0425	0.4288

	Non-Hispanic White	−0.1571	0.0479	0.0010	−0.0889	0.0376	0.0181

	Non-Hispanic Black	−0.0212	0.0507	0.6756	0.0090	0.0401	0.8228

	NOther Race	0.0000	0.0000		0.0000	0.0000

Education		−0.1320	0.0081	<.0001	−0.1202	0.0068	<.0001

The estimates of the zero-inflated component of the depression:

Parameter		Model I			Model II

		Estimate	Std Err	P-value	Estimate	Std Err	P-value

Intercept		−2.1316	0.2322	<.0001	−1.9935	0.2016	<.0001

NeverDrink	Yes vs. No	0.3959	0.1615	0.0142	0.4755	0.0970	<.0001

DAD		0.4050	0.1026	0.0001	0.1602	0.0591	0.0067

DAD²		−0.0751	0.0168	<.0001	−0.0392	0.0114	0.0006

Gender	Male vs. Female	0.5152	0.0677	<.0001	0.5684	0.0645	<.0001

AGE		0.0159	0.0020	<.0001	0.0165	0.0018	<.0001

Race/Ethnicity	Mexican American	0.0626	0.1708	0.7138	0.0786	0.1597	0.6224

	Other Hispanic	−0.1192	0.1808	0.5097	−0.0971	0.1695	0.6012

	Non-Hispanic White	−0.2984	0.1580	0.0589	−0.2402	0.1473	0.1028

	Non-Hispanic Black	0.0114	0.1675	0.9457	0.0488	0.1564	0.7549

	NOther Race	0.0000	0.0000		0.0000	0.0000

Education		0.0184	0.0280	0.5095	0.0415	0.0262	0.1136

Open in a new tab

Acknowledgements

This work was supported by the NIH under grants R33 DA027521, R01GM108337, and R01HD075635.

The authors thank Professor Xin Tu and the reviewers for their comments and suggestions.

Appendix

See the Web-based Supplementary Materials.

References

[1].BoÈhning D, Zero-inflated poisson models and ca man: A tutorial collection of evidence, Biometrical Journal 40 (1998), pp. 833–843. [Google Scholar]
[2].Buu A, Johnson N, Li R, and Tan X, New variable selection methods for zero-inflated count data with applications to the substance abuse field, Statistics in medicine 30 (2011), pp. 2326–2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Calsyn DA, Hatch-Maillette M, Tross S, Doyle SR, Crits-Christoph P, Song YS, Harrer JM, Lalos G, and Berns SB, Motivational and skills training hiv/sexually transmitted infection sexual risk reduction groups for men, Journal of Substance Abuse Treatment 37 (2009), pp. 138–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Connor J, Kypri K, Bell M, and Cousins K, Alcohol outlet density, levels of drinking and alcohol-related harm in new zealand: a national study, Journal of epidemiology and community health 65 (2011), pp. 841–846. [DOI] [PubMed] [Google Scholar]
[5].Cranford J, Zucker R, Jester J, Puttler L, and Fitzgerald H, Parental alcohol involvement and adolescent alcohol expectancies predict alcohol involvement in male adolescents., Psychology of Addictive Behaviors; Psychology of Addictive Behaviors 24 (2010), pp. 386–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Fernandez A, Wood M, Laforge R, and Black J, Randomized trials of alcohol-use interventions with college students and their parents: lessons from the transitions project, Clinical Trials 8 (2011), pp. 205–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Gurmu S and Trivedi PK, Excess zeros in count models for recreational trips, Journal of Business & Economic Statistics 14 (1996), pp. 469–477. [Google Scholar]
[8].Hagger-Johnson G, Bewick B, Conner M, O’Connor D, and Shickle D, Alcohol, conscientiousness and event-level condom use, British journal of health psychology 16 (2011), pp. 828–845. [DOI] [PubMed] [Google Scholar]
[9].Hall D and Zhang Z, Marginal models for zero inflated clustered data, Statistical Modelling 4 (2004), pp. 161–180. [Google Scholar]
[10].Hall DB, Zero-inflated Poisson and binomial regression with random effects: A case study, Biometrics 56 (2000), pp. 1030–1039. [DOI] [PubMed] [Google Scholar]
[11].He H, Wang W, Crits-Christoph P, Gallop R, Tang W, Chen D, and Tu X, On the implication of structural zeros as independent variables in regression analysis: applicatoin to alcohol research, Journal of Data Science 12 (2014), pp. 439–460. [PMC free article] [PubMed] [Google Scholar]
[12].He H, Wang W, Chen D, and Tang W, On the effect of structural zeros in regression models, in Innovative Statistical Methods for Public Health Data ICSA Book Series in Statistics, Chen D and Wilson J, eds., Springer, Cham, 2015, pp. 97–115. [Google Scholar]
[13].Hernandez-Avila C, Song C, Kuo L, Tennen H, Armeli S, and Kranzler H, Targeted versus daily naltrexone: secondary analysis of effects on average daily drinking, Alcoholism: Clinical and Experimental Research 30 (2006), pp. 860–865. [DOI] [PubMed] [Google Scholar]
[14].Hildebrandt T, McCrady B, Epstein E, Cook S, and Jensen N, When should clinicians switch treatments? an application of signal detection theory to two treatments for women with alcohol use disorders, Behaviour research and therapy 48 (2010), pp. 524–530. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Horton N, Bebchuk J, Jones C, Lipsitz S, Catalano P, Zahner G, and Fitzmaurice G, Goodness-of-fit for GEE: An example with mental health service utilization, Statistics in Medicine 18 (1999), pp. 213–222. [DOI] [PubMed] [Google Scholar]
[16].Hur K, Hedeker D, Henderson W, Khuri S, and Daley J, Modeling clustered count data with excess zeros in health care outcomes research, Health Services and Outcomes Research Methodology 3 (2002), pp. 5–20. [Google Scholar]
[17].Lambert D, Zero-inflated poisson regression, with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. [Google Scholar]
[18].Miaou SP, The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions, Accident Analysis & Prevention 26 (1994), pp. 471–482. [DOI] [PubMed] [Google Scholar]
[19].Neal D, Sugarman D, Hustad J, Caska C, and Carey K, It’s all fun and games... or is it? Collegiate sporting events and celebratory drinking, Journal of Studies on Alcohol and Drugs 66 (2005), pp. 291–294. [DOI] [PubMed] [Google Scholar]
[20].Ozmen I and Famoye F, Count regression models with an application to zoological data containing structural zeros, Journal of Data Science 5 (2007), pp. 491–502. [Google Scholar]
[21].Pardini D, White H, and Stouthamer-Loeber M, Early adolescent psychopathology as a predictor of alcohol use disorders by young adulthood, Drug and alcohol dependence 88 (2007), pp. S38–S49. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Tang W, He H, and Tu X, Applied Categorical and Count Data Analysis, Chapman & Hall/CRC, Boca Raton, 2012. [Google Scholar]
[23].Welsh AH, Cunningham RB, Donnelly C, and Lindenmayer DB, Modelling the abundance of rare species: statistical models for counts with extra zeros, Ecological Modelling 88 (1996), pp. 297–308. [Google Scholar]
[24].Wilde MH, Crean HF, McMahon JM, McDonald MV, Tang W, Brasch J, Fairbanks E, Shah S, and Zhang F, Testing a model of self-management of fluid intake in community-residing long-term indwelling urinary catheter users, Nursing research 65 (2016), pp. 97–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Yu Q, Chen R, Tang W, He H, Gallop R, Crits-Christoph P, Hu J, and Tu X, Distribution free models for longitudinal count responses with overdispersion and structural zeros, Statistics in medicine 32 (2012), pp. 2390–2405. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp1

NIHMS1504815-supplement-Supp1.pdf^{(181KB, pdf)}

[R1] [1].BoÈhning D, Zero-inflated poisson models and ca man: A tutorial collection of evidence, Biometrical Journal 40 (1998), pp. 833–843. [Google Scholar]

[R2] [2].Buu A, Johnson N, Li R, and Tan X, New variable selection methods for zero-inflated count data with applications to the substance abuse field, Statistics in medicine 30 (2011), pp. 2326–2340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Calsyn DA, Hatch-Maillette M, Tross S, Doyle SR, Crits-Christoph P, Song YS, Harrer JM, Lalos G, and Berns SB, Motivational and skills training hiv/sexually transmitted infection sexual risk reduction groups for men, Journal of Substance Abuse Treatment 37 (2009), pp. 138–150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Connor J, Kypri K, Bell M, and Cousins K, Alcohol outlet density, levels of drinking and alcohol-related harm in new zealand: a national study, Journal of epidemiology and community health 65 (2011), pp. 841–846. [DOI] [PubMed] [Google Scholar]

[R5] [5].Cranford J, Zucker R, Jester J, Puttler L, and Fitzgerald H, Parental alcohol involvement and adolescent alcohol expectancies predict alcohol involvement in male adolescents., Psychology of Addictive Behaviors; Psychology of Addictive Behaviors 24 (2010), pp. 386–396. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Fernandez A, Wood M, Laforge R, and Black J, Randomized trials of alcohol-use interventions with college students and their parents: lessons from the transitions project, Clinical Trials 8 (2011), pp. 205–213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Gurmu S and Trivedi PK, Excess zeros in count models for recreational trips, Journal of Business & Economic Statistics 14 (1996), pp. 469–477. [Google Scholar]

[R8] [8].Hagger-Johnson G, Bewick B, Conner M, O’Connor D, and Shickle D, Alcohol, conscientiousness and event-level condom use, British journal of health psychology 16 (2011), pp. 828–845. [DOI] [PubMed] [Google Scholar]

[R9] [9].Hall D and Zhang Z, Marginal models for zero inflated clustered data, Statistical Modelling 4 (2004), pp. 161–180. [Google Scholar]

[R10] [10].Hall DB, Zero-inflated Poisson and binomial regression with random effects: A case study, Biometrics 56 (2000), pp. 1030–1039. [DOI] [PubMed] [Google Scholar]

[R11] [11].He H, Wang W, Crits-Christoph P, Gallop R, Tang W, Chen D, and Tu X, On the implication of structural zeros as independent variables in regression analysis: applicatoin to alcohol research, Journal of Data Science 12 (2014), pp. 439–460. [PMC free article] [PubMed] [Google Scholar]

[R12] [12].He H, Wang W, Chen D, and Tang W, On the effect of structural zeros in regression models, in Innovative Statistical Methods for Public Health Data ICSA Book Series in Statistics, Chen D and Wilson J, eds., Springer, Cham, 2015, pp. 97–115. [Google Scholar]

[R13] [13].Hernandez-Avila C, Song C, Kuo L, Tennen H, Armeli S, and Kranzler H, Targeted versus daily naltrexone: secondary analysis of effects on average daily drinking, Alcoholism: Clinical and Experimental Research 30 (2006), pp. 860–865. [DOI] [PubMed] [Google Scholar]

[R14] [14].Hildebrandt T, McCrady B, Epstein E, Cook S, and Jensen N, When should clinicians switch treatments? an application of signal detection theory to two treatments for women with alcohol use disorders, Behaviour research and therapy 48 (2010), pp. 524–530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Horton N, Bebchuk J, Jones C, Lipsitz S, Catalano P, Zahner G, and Fitzmaurice G, Goodness-of-fit for GEE: An example with mental health service utilization, Statistics in Medicine 18 (1999), pp. 213–222. [DOI] [PubMed] [Google Scholar]

[R16] [16].Hur K, Hedeker D, Henderson W, Khuri S, and Daley J, Modeling clustered count data with excess zeros in health care outcomes research, Health Services and Outcomes Research Methodology 3 (2002), pp. 5–20. [Google Scholar]

[R17] [17].Lambert D, Zero-inflated poisson regression, with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. [Google Scholar]

[R18] [18].Miaou SP, The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions, Accident Analysis & Prevention 26 (1994), pp. 471–482. [DOI] [PubMed] [Google Scholar]

[R19] [19].Neal D, Sugarman D, Hustad J, Caska C, and Carey K, It’s all fun and games... or is it? Collegiate sporting events and celebratory drinking, Journal of Studies on Alcohol and Drugs 66 (2005), pp. 291–294. [DOI] [PubMed] [Google Scholar]

[R20] [20].Ozmen I and Famoye F, Count regression models with an application to zoological data containing structural zeros, Journal of Data Science 5 (2007), pp. 491–502. [Google Scholar]

[R21] [21].Pardini D, White H, and Stouthamer-Loeber M, Early adolescent psychopathology as a predictor of alcohol use disorders by young adulthood, Drug and alcohol dependence 88 (2007), pp. S38–S49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Tang W, He H, and Tu X, Applied Categorical and Count Data Analysis, Chapman & Hall/CRC, Boca Raton, 2012. [Google Scholar]

[R23] [23].Welsh AH, Cunningham RB, Donnelly C, and Lindenmayer DB, Modelling the abundance of rare species: statistical models for counts with extra zeros, Ecological Modelling 88 (1996), pp. 297–308. [Google Scholar]

[R24] [24].Wilde MH, Crean HF, McMahon JM, McDonald MV, Tang W, Brasch J, Fairbanks E, Shah S, and Zhang F, Testing a model of self-management of fluid intake in community-residing long-term indwelling urinary catheter users, Nursing research 65 (2016), pp. 97–106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Yu Q, Chen R, Tang W, He H, Gallop R, Crits-Christoph P, Hu J, and Tu X, Distribution free models for longitudinal count responses with overdispersion and structural zeros, Statistics in medicine 32 (2012), pp. 2390–2405. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Untangle the Structural and Random Zeros in Statistical Modelings

W Tang

H He

WJ Wang

D G Chen

Abstract

1. Introduction

2. Models for Count Predictors with Structural Zeros

2.1. Problems from Structural Zeros

2.2. A Mixture Model

Main Model:

Auxiliary Zero-inflated Model:

Assumption A:

Assumption B:

2.2.1. Models for categorical responses

2.2.2. Models for count responses

2.2.3. Models for zero-inflated count responses

3. Statistical Inference

3.1. Likelihood Function

3.2. Hypothesis Testing

3.3. Selection of Initial Values

4. Simulation Studies

4.1. Simulation Setup

4.2. Continuous Response Y

Table 1.

4.3. Binary Response Y

Table 2.

4.4. Poisson Count Response Y

4.5. Zero-inflated Poisson Response Y

Table 3.

5. Real Data Analysis

5.1. The Data

5.2. Statistical Model

5.3. Results

Table 4.

6. Discussions

Supplementary Material

Table 5.

Acknowledgements

Appendix

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases