Abstract
Zero-inflated count data are frequently encountered in public health and epidemiology research. Two-parts model is often used to model the excessive zeros, which are a mixture of two components: a point mass at zero and a count distribution, such as a Poisson distribution. When the rate of events per unit exposure is of interest, offset is commonly used to account for the varying extent of exposure, which is essentially a predictor whose regression coefficient is fixed at one. Such an assumption of exposure effect is, however, quite restrictive for many practical problems. Further, for zero-inflated models, offset is often only included in the count component of the model. However, the probability of excessive zero component could also be affected by the amount of ‘exposure’. We, therefore, proposed incorporating the varying exposure as a covariate rather than an offset term in both the probability of excessive zeros and conditional counts components of the zero-inflated model. A real example is used to illustrate the usage of the proposed methods, and simulation studies are conducted to assess the performance of the proposed methods for a broad variety of situations.
Keywords: Count data, zero-inflated models, exposure, offset
2010 MATHEMATICS SUBJECT CLASSIFICATION: 62-07
1. Introduction
In public health and epidemiology research, count data with a large proportion of zeros are often encountered. For example, in health services utilization study, the number of service utilization often includes a large number of zeros representing the patients with no utilization during the study period. A common feature of this type of data is that the count measure tends to have excessive zeros beyond a common count distribution that can accommodate, such as Poisson or negative binomial (NB).
To overcome the issue with excessive zeros, the so-called zero-inflated (ZI) models [19] can be specified, which are a mixture of two components: a point mass at zero and a count distribution, such as a Poisson or negative binomial distribution. An alternative modeling strategy is hurdle model [17,23], which assumes all zero data are from one ‘structural’ source with one part of the model being a binary model for modeling whether the response variable is zero or positive, and another part using a zero truncated model, such as a zero truncated Poisson or a zero truncated NB distribution for the positive data. Both types of models were rapidly embraced by a large number of areas, from population and epidemiological studies to ecological studies [7,10,11,20,22,24,25,27,34,35].
Despite the widespread application of both types of models in the literature, one question arises of how to address the effect of varying exposure (underlying population or duration at risk) in such models. For example, for geographically distributed disease count data, the number of cases in region i might be larger than that in the region j because region i had a substantially larger population at risk than the region j. As a result, a higher number of disease counts in one region compared to another does not necessarily imply subjects in this region have a higher susceptibility to this disease. Similarly, in longitudinal studies, the number of repeated events may depend on the follow-up time for the patients. Event rates can be calculated as events per unit time, which allows the observation window to vary for each unit. In these examples, exposure is respectively unit area, person–years and unit time.
In literature, a commonly used method for incorporating a population size at risk and/or the amount of exposure time is through the introduction of an offset term, i.e. the log of exposure, as an explanatory variable whose coefficient is fixed at one [1]. For example, for a Poisson log-linear model with expected mean and covariate , the model can be written as , where is referred to as an offset. This implies
(1) |
This means that the mean count has a proportionality constant for that depends on the values of the explanatory variables. However, such proportionality assumption may not be plausible. For example, for modeling the incidence of infectious disease, the heterogeneity of the underlying population may have a varying effect on the likelihood and intensity of disease transmission. As a result, the number of events in the response variable may increase non-proportionally with the population at risk. When such sophisticated exposure effects arise from applications, the assumption embedded in the offset term becomes inadequate. Hence, using offset to adjust the varying extent of exposure as a widely accepted common practice should be carefully examined and used with caution.
For zero-inflated models, offset is often incorporated only in the count component of the model ([13,20,21,32,38], for example). Hall [16] considered relaxing the assumption in offset by setting the coefficient on the logarithm of exposure as an unknown parameter in the count component to be estimated in the model-fitting procedure. However, the probability of observing excessive zeros can also be impacted by varying exposure in many situations; that is, the probability of excessive zeros is expected to decrease with increasing exposure. Baetschmann et al. [4] proposed a modified zero-inflated count model where the probability of extra zero is derived from an underlying duration model with Weibull hazard rate. However, to the best of our knowledge, no attempt has been made to extend the zero-inflated model to adjust for the extent of exposure as a covariate in both excess zeros and the count components, particularly in the context of modeling disease incidence collected over a geographical region. Further, simulation studies to explore the impact of misspecification of modeling the effect of varying exposures are limited and therefore warranted.
The overall goal of this study is to discuss the impact of misspecification of the modeling method for varying exposures for zero-inflated models. We explored two potential types of misspecification:
The effect of varying exposure may differ from one. In the situation when the mean count is not proportional to the population at risk, imposing the offset term by constraining the effect of the population at risk as one can be very restrictive. Such a constraint may lead to a biased estimation of the parameter estimates and prediction through a biased estimation of the effect of exposure.
Both excess zero and count components may depend on varying exposures. In literature, the zero-inflated model typically only includes the exposure as an offset in the count component of the zero-inflated model. We show that ignoring varying exposures for the binary part can lead to biased parameter estimates and can be sensitive to the degree of the effect of exposures.
The results for the hurdle models are consistent with zero-inflated models, so for the ease of presentation, we chose to focus on the zero-inflated models. The remainder of the paper is organized as follows. In Section 2, a review of zero-inflated models without and with varying extent of exposure is presented. Section 3 describes the methods for model selection and diagnosis check for zero-inflated models. To demonstrate the pitfalls of using an offset term for zero-inflated models, in Section 4, a real example of a health care utilization study is given. Simulation studies comparing the finite sample performance of the approaches for accounting for varying extent of exposure for zero-inflated models are presented in Section 5. Concluding remarks are given in Section 6.
2. Statistical models
2.1. Zero-inflated model
In a zero-inflated (ZI) model [19], zero observations have two different origins: ‘structural’ and ‘sampling’. The sampling zeros are due to the usual Poisson or negative binomial (NB) distribution, which assumes that those zero observations happened by chance.
Let denote the response for the ith subject, . The zero-inflated Poisson (ZIP) model is given by:
(2) |
where denotes the probability of the observation arising from the degenerated distribution at zero and represents the mean of the Poisson distribution. This formulation allows for more zeros than permitted under the Poisson assumption when . The probability distribution function of the ZIP model can be written as
(3) |
ZIP model can include covariates for modeling both and . Generally, is modeled with a logistic regression and is modeled as a log-linear regression. The ZI model can be written as,
(4) |
where is the column vector of parameters associated with the excess zeros and is a ( ) column vector of parameters associated with the Poisson process and and are the vectors of covariates for the ith study subject for the excess zeros and Poisson processes, respectively. Note that the explanatory variables describing the do not need to be the same as those describing . To account for the unobserved heterogeneity, one can assume the Poisson process as , where is a random effect term, which can follow an independent . In this article, we consider the common logit and log link functions for the binary and count outcomes. Other choices of link functions are possible, such as probit or complementary log-log link functions for the binary component.
The ZIP model can be regarded as a mixture of Poisson and a degenerate component with all of its mass at zero. As such, this model posits an unobserved and latent binary variable with , and , is a Poisson variate. The marginal mean of the ZIP model can be then derived as
(5) |
The second equality holds because implies . The variance can be derived as
(6) |
(7) |
(8) |
As a result, zero-inflated model can accommodate overdispersion relative to a Poisson model, since . This also indicates that model misspecification of either the binary or Poisson component of a ZIP model can lead to biased predicted mean and variance estimations.
2.2. Zero-inflated models with varying exposures
In modeling zero-inflated count data, the population at risk and/or the amount of time of exposure are often heterogeneous among the study subjects. In this article, we refer to the population at risk and/or amount of time of exposure as ‘exposure’. The variable exposure for the positive count process is handled typically through an offset term, as is typically done in a log-linear Poisson regression. For example, let denote the population at risk for the disease of interest. An offset term, is often incorporated into the count component of a ZI model to account for variable exposure, so the model can be written as,
(9) |
This model implicitly assumes that all subjects who belong to the excessive zero component with the same covariates profiles are at the same risk of experiencing the outcome regardless of the size of the population at risk. This may not be plausible, as the probability of observing excessive zero is likely to decrease as the exposure size increases. Ignoring the differential exposure may result in biased estimates for both the binary and Poisson components of the ZIP model. A direct adaptation of the model ZIP-O is to introduce an offset term in the binary component of the model as well; that is,
(10) |
Nevertheless, this model can be implausible, since zero inflation and the conditional model work in opposite directions. (i.e. a higher expected value for the zero inflation ( ) leads to a lower response, but a higher value for the conditional model ( ) leads to a higher response). Further, restricting the effect of exposure as one can be inconsistent with the true extent of association between exposure and the outcome.
Therefore, we propose modifying the ZIP model to addresses the variable exposure not only for the count component but also for the binary process by incorporating exposure as a covariate in both binary and count components of the model. The modified ZIP model can be expressed as,
(11) |
where and represent the functional effects of the extent of exposure ( ) for the binary and count components of a ZIP model, respectively, which can take on any form, such as polynomials or spline functions, or can be modeled as:
(12) |
where and are the regression coefficients for the logarithm transformed . This model relaxes the assumption made on the offset term by allowing and to deviate from one and also have opposite signs.
2.3. Statistical inference
The log-likelihood function for the proposed ZIP model as presented in Equation (11) is given by:
where and . For the model accounting for the unobserved regional variation, the marginal log likelihood function of the ZIP model with random effects can be written as,
(13) |
where the unobserved heterogeneity is quantified by the random effect , which is assumed to be Gaussian on the scale of the linear predictor with mean zero and standard error . is the standard normal density function. Lambert [19] expressed the log-likelihood in terms of latent variables and used the EM algorithm for maximum likelihood fitting by treating as missing values. For a model with random effect terms, numerical integration techniques, such as Gauss-Hermite quadrature or Markov chain Monte Carlo (MCMC) can be used. Nevertheless, those methods can be computationally intensive. Alternatively, the models can be fitted using glmmTMB R package [8], which performs maximum likelihood estimation via TMB (Template Model Builder) [18]. To maximize computational efficiency, TMB uses the Laplace approximation to integrate over random effects and automatic differentiation to estimate the first and second derivatives of the log likelihood function [18]. This package is more flexible than other packages available for estimating zero-inflated models via maximum likelihood estimation and is faster than packages that use MCMC sampling for estimation [8].
3. Model selection and diagnostic checks
To inform model selection, we use the Akaike Information Criterion (AIC) [2] and Bayesian information criteria (BIC) [28]. AIC and BIC are defined as , , where m is the number of parameters in the model, n is number of observations, D is the deviance defined as twice of negative log likelihood in the ZIP model. , where
(14) |
The smaller the values of AIC and BIC, the better a model fits the data. Examining residuals is a standard tool for assessing the adequacy of regression models. For discrete response, Pearson or deviance residuals are far from normality; graphical and quantitative inspection of these residuals provides little information for model diagnosis [14]. Hence, the adequacy of the ZIP models is examined on the basis of randomized quantile residuals (RQR), as developed by Dunn and Smyth [14]. RQR can be defined as follows. Suppose denote the CDF for the response variable following ZIP distribution given the set of covariates and , for the binary and count component, respectively, where and . Let be the corresponding probability mass function of . Since F is discrete, it is then randomized into a uniform random number, which is defined as a function with a random number from the uniform distribution on as an additional argument,
(15) |
where is the lower limit of F at , i.e. , the lower limit in the ‘gap’ of at . RQR for is the standard normal quantile corresponding to the random lower tail probability with and estimated from the sample,
(16) |
where is the quantile function of the standard normal distribution, and is a random number uniformly distributed on . These residuals are expected to approximately follow a standard normal distribution, if the model is correctly specified. Hence, the validity of the model can be assessed by graphing the RQRs versus the predicted response variable. If the model fits the data well, RQRs should be randomly scattered between and 3 without a discernible pattern. The normality of the RQRs can be examined using Q-Q plots, which should lie along the straight diagonal line, if the model is correctly specified.
4. Motivating example: respiratory hospital admissions data
The adverse effect of ambient air pollution has drawn considerable attention over the past decade and has been shown to be associated with respiratory morbidity and mortality. In this motivating example, our goal is to study the relationship between nitrogen dioxide ( ) and the number of hospital admissions for respiratory causes in Turin province (Italy) in 2004, while accounting for the differential size of the population at risk from the study area. The dataset records the number of observed hospitalizations for respiratory causes and population size at the municipality level as well as the average for the same period and the same areas. The data were obtained from the data repository of the Spatial and Spatial-temporal Bayesian Models with R-INLA [6].
Of the 315 municipalities, 173 (54.9 ) had zero hospitalizations. We categorize according to its tertiles, i.e. 80.57 and 126.54 into three levels (reference category), and . As shown in Figure 1, the distribution of the hospitalization counts is highly positively skewed and the distribution depends on the values of NO with higher response values occurring at a higher level of NO . Let and denote the number of observed hospitalizations and size of the population at risk at the ith municipality, respectively. In the context of disease mapping, the expected number is often included as an offset term, which is often expressed as the number of cases defined by an epidemiologic ‘null model’ of incidence, i.e. the product of , the number of individuals at risk in region i, and r, a constant ‘baseline’ risk per individual defined as , the global observed disease rate. The following four competing models are considered, which are expressed as
(17) |
(18) |
(19) |
(20) |
where is the probability of no hospital admissions at the ith municipality and is the expected mean number of hospitalizations of the Poisson distribution; and denote the dummy variables for and , respectively. To account for unobservable heterogeneity, the area-specific random effect, , is included in the model. In the model , the coefficient on log( ) is considered as an unknown parameter in both the binary and Poisson processes to be estimated in the model-fitting procedure, where the superscript b refers to ‘both’ components; includes as an explanatory variable only in the count component, where the superscript c refers to ‘count’ component. In contrast, considers the population at risk as an offset term in both the binary and count components of the ZIP model and only includes the offset term in the count component. The parameter estimations were carried out in R (R Core Team, 2019) via glmmTMB package [8]. For the binary component of the candidate models, and are not significantly associated with the probability of excessive zeros and therefore were removed from the binary component of the models.
From Table 1, ZIP-W gave the smallest values of AIC and BIC, suggesting that they provided the best fit to the data compared to the other competing models. The intercept for the binary component of the ZIP-W model is significantly different from zero with and is positive. Under model ZIP-O , the estimated intercept is negative and significantly different from zero . By comparison, the estimated intercept of the binary component is not significantly different from zero under models ZIP-W and ZIP-O . The opposite signs of the estimated intercept under ZIP-O compared to ZIP-W and non-significance under models ZIP-W and ZIP-O are due to the model misspecification of the exposure effect in the binary component, so is trying to recover from this misspecification.
Table 1. Parameter estimates (Est), standard error (SE), p-value, AIC and BIC values for the ZIP-W , ZIP-W , ZIP-O and ZIP-O models for modeling the number of hospital admissions for respiratory causes over 315 municipalities in Turin province (Italy) in 2004.
ZIP-W | ZIP-W | ZIP-O | ZIP-O | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Est | SE | p-value | Est | SE | p-value | Est | SE | p-value | Est | SE | p-value | |
Binary component | ||||||||||||
(intercept) | 4.24 | 0.52 | 0.00 | 0.03 | 0.12 | 0.81 | −3.31 | 0.20 | 0.00 | −0.11 | 0.13 | 0.39 |
−2.27 | 0.27 | 0.00 | ||||||||||
Count component | ||||||||||||
(intercept) | 1.04 | 0.13 | 0.00 | 0.79 | 0.14 | 0.00 | −0.55 | 0.15 | 0.00 | 0.21 | 0.13 | 0.10 |
0.24 | 0.12 | 0.04 | 0.34 | 0.12 | 0.00 | 0.85 | 0.17 | 0.00 | 0.36 | 0.14 | 0.01 | |
0.43 | 0.12 | 0.00 | 0.48 | 0.13 | 0.00 | 0.91 | 0.17 | 0.00 | 0.26 | 0.14 | 0.07 | |
0.73 | 0.04 | 0.00 | 0.78 | 0.04 | 0.00 | |||||||
0.15 | 0.38 | 0.00 | 0.15 | 0.39 | 0.00 | 0.48 | 0.69 | 0.00 | 0.23 | 0.48 | 0.00 | |
Model fit | ||||||||||||
AIC | 1358.04 | 1560.22 | 1856.33 | 1585.32 | ||||||||
BIC | 1384.30 | 1582.74 | 1875.10 | 1604.08 |
The effect of for the binary component of the model ZIP-W is estimated as , which models the fact that the probability of observing an excess zero count decreases as the extent of exposure increases. In other words, odds of observing an excess zero count are inversely proportional to . In contrast, the effect of for the count component of the ZIP-W model is estimated as , which reflects that the conditional mean count increases as the extent of exposure increases. Thus, including the population at risk as an offset term by imposing its effect as one results in an incorrect assumption, particularly for the binary component. This is evident from the model comparison between ZIP-W ( , ) and ZIP-O ( , ) or ZIP-W ( , ) and ZIP-O ( , ). For covariates included in the Poisson component of the ZIP models, and are significantly and positively related to the expected counts of hospital admissions under the best fitting model ZIP-W with the estimated effects equal to for and for . The effects of and are estimated to be slightly larger under model ZIP-W and much larger under ZIP-O as compared to ZIP-W . A notable difference between ZIP-W and ZIP-O is that the two methods provide estimates of opposite signs for the intercept. Under model ZIP-O , only is positively and significantly associated with the conditional mean number of hospitalizations with , but is not significant at the level of significance. The variances of the random effect are all estimated significantly different from zero, suggesting the importance of accounting for unobserved regional effect. We also observe that the variance component under the model ZIP-O ( ) is much larger than the model ZIP-O ( ). Models ZIP-W and ZIP-W give an almost identical estimate for the variance component ( ). These results suggest that including exposure as a covariate in either or both of the binary and count components of the ZIP model can explain the residual variation of the response variable.
We have so far only considered NO and extent of exposure have an ‘additive’ effect; however, NO and exposure may have an interaction effect. We, therefore, extended ZIP-W and ZIP-W models by including the interactions between and and in both the binary and count components, named as ZIP-W and only in the count component, named as ZIP-W . Our model fit shows that and have no main and interaction effect with for the binary component of the ZIP model and therefore were excluded from the model ZIP-W . Nevertheless, and have significant interaction effect with exposure for the Poisson component of the ZIP-W model, but not for ZIP-W , as shown in Table 2. ZIP-W also gives the best model fit among all the competing models ( , ). These results indicate that (i) failing to include in the binary component of the model may lead to incorrect inference for the parameters in the count component (ii) covariates and extent of exposure can have a significant interaction effect on the outcome.
Table 2. Parameter estimates (Est), standard error (SE), p-value, AIC and BIC values for the ZIP-W and ZIP-W models including the interaction between exposure and NO for modeling the number of hospital admissions for respiratory causes over 315 municipalities in Turin province (Italy) in 2004.
ZIP-W | ZIP-W | |||||
---|---|---|---|---|---|---|
Est | SE | p-value | Est | SE | p-value | |
Binary component | ||||||
(Intercept) | 4.25 | 0.52 | 0.00 | 0.13 | 0.15 | 0.39 |
−2.28 | 0.27 | 0.00 | ||||
Count component | ||||||
(Intercept) | 2.21 | 0.26 | 0.00 | 1.37 | 0.72 | 0.06 |
−0.57 | 0.31 | 0.06 | 0.24 | 0.73 | 0.74 | |
−1.26 | 0.31 | 0.00 | −0.44 | 0.73 | 0.55 | |
0.34 | 0.12 | 0.00 | 0.04 | 0.28 | 0.90 | |
0.63 | 0.11 | 0.00 | 0.31 | 0.27 | 0.25 | |
0.25 | 0.10 | 0.02 | 0.57 | 0.27 | 0.04 | |
0.11 | 0.33 | 0.00 | 0.12 | 0.35 | 0.00 | |
Model fit | ||||||
AIC | 1325.53 | 1550.40 | ||||
BIC | 1359.30 | 1580.42 |
We also considered fitting Poisson and NB models including the random effect term at the municipality level with and without interactions between and to examine if a simpler model can adequately describing the data. Our results indicate that Poisson and NB models fit to the data worse than the candidate ZIP models, i.e. for Poisson without interaction ( , ) and NB without interaction ( , ) and Poisson with interaction ( , ) NB with interaction ( , ). These results suggest the importance of accounting for excess zeros in this application.
In many applications, interest focuses on estimating and predicting marginal means given the explanatory variables. Misspecification of the extent of the population at risk may also have an impact on the marginal mean response, since the marginal means depend on the estimated parameters in both binary and Poisson processes. In this application, we are interested in estimating and predicting the population-level mean counts of the number of hospitalizations over the studied area in association with the NO and size of the population at risk. For ease of computation without marginalizing and conditioning on the random effects, the population-level predicted values of the response variable are calculated as the predicted unconditional counts at the mode (i.e. area-level effect ), as suggested by Brooks et al. [8]. For example, the predicted response value under model ZIP-W can be calculated as,
For the standard errors of the predicted values, posterior predictive simulations were used by drawing multivariate normal samples from the parameters for the fixed effects, given that the resulting estimators follow asymptotically normal distributions [8].
Figure 2 displays the predicted response value and point-wise confidence intervals of the respiratory hospitalization counts at the mode (i.e. area-specific random effect ) against by the three categories of NO for all the considered ZIP models. The results based on models ZIP-W and ZIP-W clearly demonstrate the significance of the interaction effect between NO and , with the number of hospitalizations increases more sharply as increases for NO at the highest category (127, 196] as compared to the other NO levels. For the models without interaction terms, i.e. ZIP-W , ZIP-W , and ZIP-O , the predicted marginal means did not differ substantially at various levels of NO and appeared to be averaged out across the levels. ZIP-O yielded unreasonably low predicted values compared to other models.
We also examined the model goodness of fit by comparing the predicted and the observed response variable, as shown in Figure 3. The panels in the left column are for the entire data and the panels in the right column are for displaying the data after excluding the largest observed response value for the ease of visualization. It is evident that the predicted values under models ZIP-W and ZIP-W are very close to the observed values. The other competing models result in underestimated response values. The results suggest that the models ZIP-W and ZIP-W predict the number of hospitalizations better than the other candidate models.
Despite the aforementioned numerical comparisons of model fit, a careful residual analysis indicates that model ZIP-W has the best model fit with RQRs nearly normally distributed and no discernible pattern, as shown in the top panels of Figure 4. The distributions of the RQRs under the models ZIP-W , ZIP-W , ZIP-O and ZIP-O exhibit bimodal pattern separated at zeros, which suggest the importance of including exposure as an exploratory variable in the binary component of the ZIP model in this application. Altogether, failure to properly model the effect of exposure in both the binary and count components of the ZIP model can have a serious impact on the marginal parameter estimates. Inevitably, incorrect inferences may follow.
5. Simulation studies
Simulations were carried out to investigate the performances of the proposed method relative to the traditional methods specifying the underlying population at risk as an offset term in the model under a wide range of scenarios.
5.1. Data-generating mechanism
In the simulations, zero-inflated counts were generated from ZIP-W model of sizes n = 250, 500 and 1000, defined as
(21) |
where , , and with and . We considered and , respectively, to reflect the fact that the probability of excessive zeros decreases with increasing extent of exposure and the conditional mean count increases over the increased extent of exposure. The values are chosen to mimic our motivating example presented in Section 4. The intercept for the binary component is set as 5 to yield about 53 zeros in the simulated datasets. The expected population at risk is simulated from a zero truncated negative binomial distribution with probability mass function where r is set as 0.1 and p is set as 0.0005, which gives the mean about 40, median 10 and variance 6000. For simplicity of presentation, we assume , and log( ) do not interact.
For each simulation scenario, we generated 200 random samples from the true model and fitted ZIP-W , ZIP-O and ZIP-O to determine the impact of mismodelling the effect of exposure on the parameter estimates in both parts of the ZIP model and the overall model fits. To simulate zero inflated data, we firstly simulate the latent variable from Bernoulli ( ) with . Then, if , ; otherwise, simulate from Poisson with mean .
To assess how biased the parameter estimates can be related to the increased or decreased effect of exposures, additional simulation studies were conducted by setting and , respectively. We also considered simulating data from ZIP-O , which includes the exposure as an offset term only in the count component of the model, to examine the impact of over parametrization of our proposed model ZIP-W on statistical inference.
5.2. Performance measures
The goals of the simulation study are to determine how parameter estimates, standard errors, coverage probability and overall model fits are affected by misspecification of the effect of the extent of exposures. To this end, we present the bias (the mean of the estimated parameter minus the true value), mean square error (MSE, the average of the sum of the squared differences between the estimated parameter and the true value), the coverage probabilities (CP) of 95 confidence intervals of the estimated parameters and average values of AIC and BIC over repeated samples.
5.3. Simulation results
Table 3 reports the bias, MSE, and CP of the confidence intervals for the parameter estimates from ZIP-W , ZIP-W , ZIP-O and ZIP-O fitted to the 200 simulated datasets generated from model ZIP-W of sample size n = 250, 500 and 1000, respectively. For the binary component, only ZIP-W results in unbiased estimates with CPs very close to the nominal level of 0.95. In contrast, fitting the misspecified model results in severely biased parameter estimates with CPs far from the nominal level. CP also decreases towards zero as the sample size increases from 250 to 1000. In particular, the intercept ( ) is highly affected by model misspecification yielding very large bias and there is also a substantial bias in the estimated regression coefficients and . The biased parameter estimates of the ZIP-W model indicates that omitting exposure as a covariate in the binary component of the model results in biased estimates and invalid inference. This result is consistent with the finding from the literature, which showed that if a covariate is removed from a Poisson model, both the estimated regression coefficient and the standard error are the same as the results based on the full model [5]. ZIP-O gives the worst model fit, yielding the largest bias, MSE, and lowest CP. This result is not surprising, since ZIP-O constrains the regression coefficients for the exposure in both model components equal to one; nevertheless, both the binary and count processes are influenced by the exposure, but in opposite directions. Hence, the degree of model misspecification is higher compared to other candidate models.
Table 3. Bias, MSE and coverage probability (CP) of the confidence intervals for the parameter estimates from models ZIP-W , ZIP-W , ZIP-O and ZIP-O fitted to the 200 simulated datasets generated from the model ZIP-W of sample size n = 250, 500 and 1000, respectively.
Binary component | Count component | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
ZIP-W | ZIP-W | ZIP-O | ZIP-O | ZIP-W | ZIP-W | ZIP-O | ZIP-O | |||
n = 250 | ||||||||||
Bias | 0.229 | −4.786 | −8.100 | −4.944 | −0.005 | −0.012 | −0.946 | −0.938 | ||
−0.033 | 0.667 | 1.248 | 0.778 | 0.000 | 0.001 | 0.030 | 0.025 | |||
−0.097 | 0.674 | 1.273 | 0.787 | 0.002 | 0.003 | 0.039 | 0.033 | |||
MSE | 0.583 | 22.918 | 65.716 | 24.464 | 0.002 | 0.002 | 0.896 | 0.881 | ||
0.252 | 0.539 | 2.246 | 0.705 | 0.000 | 0.000 | 0.006 | 0.005 | |||
0.240 | 0.524 | 2.166 | 0.695 | 0.000 | 0.000 | 0.008 | 0.008 | |||
CP | 0.965 | 0.000 | 0.000 | 0.000 | 0.950 | 0.940 | 0.000 | 0.000 | ||
0.935 | 0.345 | 0.285 | 0.265 | 0.940 | 0.945 | 0.360 | 0.365 | |||
0.955 | 0.275 | 0.240 | 0.230 | 0.975 | 0.970 | 0.260 | 0.290 | |||
n = 500 | ||||||||||
Bias | 0.088 | −4.764 | −8.437 | −4.966 | −0.001 | −0.010 | −0.958 | −0.949 | ||
0.003 | 0.707 | 1.528 | 0.845 | 0.000 | 0.001 | 0.021 | 0.015 | |||
−0.024 | 0.714 | 1.436 | 0.845 | 0.000 | 0.001 | 0.033 | 0.027 | |||
MSE | 0.255 | 22.703 | 71.240 | 24.668 | 0.001 | 0.001 | 0.919 | 0.900 | ||
0.128 | 0.542 | 2.721 | 0.759 | 0.000 | 0.000 | 0.002 | 0.002 | |||
0.121 | 0.546 | 2.396 | 0.754 | 0.000 | 0.000 | 0.004 | 0.003 | |||
CP | 0.960 | 0.000 | 0.000 | 0.000 | 0.955 | 0.935 | 0.000 | 0.000 | ||
0.955 | 0.090 | 0.100 | 0.020 | 0.955 | 0.950 | 0.425 | 0.475 | |||
0.940 | 0.040 | 0.070 | 0.020 | 0.970 | 0.975 | 0.285 | 0.300 | |||
n = 1000 | ||||||||||
Bias | 0.051 | −4.811 | −8.644 | −5.028 | −0.001 | −0.009 | −0.991 | −0.982 | ||
−0.007 | 0.703 | 1.494 | 0.849 | 0.000 | 0.001 | 0.023 | 0.018 | |||
−0.027 | 0.713 | 1.390 | 0.848 | −0.001 | 0.000 | 0.027 | 0.022 | |||
MSE | 0.137 | 23.146 | 74.749 | 25.284 | 0.000 | 0.000 | 0.982 | 0.965 | ||
0.057 | 0.513 | 2.424 | 0.741 | 0.000 | 0.000 | 0.002 | 0.002 | |||
0.052 | 0.529 | 2.129 | 0.742 | 0.000 | 0.000 | 0.002 | 0.002 | |||
CP | 0.950 | 0.000 | 0.000 | 0.000 | 0.950 | 0.935 | 0.000 | 0.000 | ||
0.950 | 0.000 | 0.020 | 0.000 | 0.955 | 0.935 | 0.290 | 0.310 | |||
0.950 | 0.000 | 0.005 | 0.000 | 0.935 | 0.920 | 0.255 | 0.270 |
For the count component of the ZIP models, ZIP-W and ZIP-W result in nearly unbiased estimates with CPs very close to the nominal level of 0.95. The results based on the ZIP-W model indicate that omitting exposure as a covariate in the binary component has minimal impact on the parameter estimates for the count component. This result is consistent with the finding from the literature that if a covariate is removed from a Poisson model, both the estimated regression coefficient and the standard error are the same as those of the full model [26]. Nevertheless, under ZIP-O and ZIP-O models, the intercept is severely underestimated, the estimated regression coefficients are biased upward and the CPs decline as sample size increases. For the ease of comparison of the results among the considered models, we also graphed the estimated regression coefficients over the 200 simulated samples in Figure 5 and the CPs of the 95 confidence intervals for the regression coefficients in Figure 6. Table 4 presents the average AIC and average BIC over 200 simulated datasets for model comparison. The results indicate that ZIP-W consistently gives the smallest AIC and BIC compared to other competing models in all scenarios, followed by ZIP-W , ZIP-O and ZIP-O . Model ZIP-O gives the worst model fit, since it forces a model fit that is inconsistent with the binary and Poisson processes of the ZIP model.
Table 4. Comparison of model fit of ZIP-W , ZIP-W , ZIP-O and ZIP-O in terms of average AIC and BIC across 200 simulated datasets generated from the model ZIP-W .
n | ZIP-W | ZIP-W | ZIP-O | ZIP-O |
---|---|---|---|---|
AIC | ||||
250 | 976.520 | 1162.257 | 2064.654 | 1732.514 |
500 | 1900.825 | 2288.361 | 4130.966 | 3448.865 |
1000 | 3887.104 | 4671.542 | 8960.393 | 7597.910 |
BIC | ||||
250 | 1004.692 | 1186.908 | 2085.782 | 1753.643 |
500 | 1934.542 | 2317.863 | 4156.254 | 3474.153 |
1000 | 3926.366 | 4705.896 | 8989.840 | 7627.356 |
Additional simulation studies were conducted by increasing or decreasing the values of and . For the scenario when , the results are presented in Figures S1 and S2 in the web supplementary materials, which indicate that the results for the binary component remain roughly consistent with the results from the previous simulation setting when and . However, for the count component, the performances of ZIP-O and ZIP-O become worse with increased bias and much lower CPs as compared to Figures 5 and 6, since models ZIP-O and ZIP-O constrain and deviates more from one than . As a comparison, when and , as shown in Figures S3 and S4, the parameter estimates of the regression coefficients in the binary component are less biased and CPs are closer to the nominal level compared to the previous setting, since is less deviated from 1 as compared to . The results of the model fit for these additional simulation studies are presented in Tables S1 and S2, with the results being consistent with the previous simulation setting showing model ZIP-W outperforms the other competing models. In summary, the results based on the model ZIP-W suggest that ignoring modeling the effect of varying exposure in the binary component of the ZIP model can bias the estimation of the covariate effect in the binary component. Such bias becomes more severe as the effect of varying exposure increases. The results based on the ZIP-O and ZIP-O indicate that incorporating the varying exposure as an offset term can lead to biased and inefficient parameter estimates in both the binary and count components of the ZIP model. The simulation results confirmed that the degree of bias and variation of the estimated regression coefficients depend on the effect of the exposure variable.
In another set of the simulation study, we simulate data from ZIP-O model. Our results (Figures S5 and S6) indicate that both ZIP-W and ZIP-W models provide parameter estimates with negligible bias, low MSE and nominal coverage probabilities reasonably close to 95 . Only ZIP-O yielded biased estimates for the binary component but not for the count component, since it imposes unreasonable assumption on the effect of exposure in the binary component. Overall, it appears over parametrization by estimating and rather than restricting them equal to one as specified in the offset terms for both binary and count component has negligible effects on inference.
We also run another set of simulations to ascertain whether the percentage of zeros is an important feature in determining the impact of misspecification of the effect of varying exposure. We generated data from the ZIP-W model with about , , and of zeros. Our simulation results are comparable to the results presented earlier.
6. Conclusion and future work
In this study, we reviewed zero-inflated regression models with a focus on investigating the extent to which misspecification of modeling underlying population at risk on the estimation of the regression coefficients and overall model fit for the zero-inflated model. We showed that including an offset term could be very restrictive in the sense that it forces the effect of exposure as one, which can be inconsistent with the data, as shown in our motivating example. Therefore, we formulated and developed a framework to understand the nature of zero-inflated models by allowing the extent of exposure to be included in both parts of the binary and count components of the ZI model as a regular covariate.
The evidence provided in this paper serves as a warning not to make strong assumptions about the effect of exposure, like those embodied in using offset in a Poisson distribution. It is wise at least to make a sensitivity check by estimating the effect of the varying exposure. Also, the probability of excessive zero may also depend on the population at risk. The relationship between exposure and the probability of excessive zero component, therefore, needs to be carefully assessed and properly incorporated in the model.
In our motivating example, the ZIP model with varying exposure being included in both the binary and count components as a covariate fits this particular data set well. However, in some situations, after accounting for zero-inflation and adjusting for the effects of the covariates and varying exposure, the data may still suggest additional overdispersions. The proposed modeling approach could then be applied to other zero-inflated models to account for additional overdispersion, such as zero-inflated negative binomial model and zero-inflated generalized Poisson model [12,33]. Score test could be conducted to help determine whether a more complex model is appropriate, without fitting a more complex model [33].
In addition, including varying exposure as a covariate in both the binary and count components of the ZI model leads to an increase in the number of parameters to be estimated. In the situation where many covariates are involved, the variable selection needs to be conducted to address the potential over parametrization problem, especially when the sample size is small. Traditional variable selection procedures, such as the automated variable selection methods, may result in models that are unstable and not reproducible [3]. Penalized regression methods are popular for selecting variables, which keep all the variables in the model but constrain the regression coefficients by shrinking them toward zero. A variety of penalty functions can be considered, such as Least Absolute Shrinkage and Selection Operator (LASSO) [29], Smoothly Clipped Absolute Deviation penalty (SCAD) [15], and minimax concave penalty (MCP) [37]. Penalized regression methods have been extended for selecting parsimonious zero-inflated models [9,30,31,36]. Future studies will be conducted to evaluate the performance of the proposed modeling strategy with these variable selection methods under different ratios of the number of candidate covariates to the sample size.
Note that zero-inflated and hurdle models have been extended to model longitudinal or clustered count measures with excess zeros by linking the binary and count components using a shared subject-specific random effect term or bivariate normal distribution. The linkage of the model components allows the dependence between the binary and count components of the model [24,25]. As a result, the binary and count processes will not act independently; that is, model misspecification for one component may have an impact on the other component passed through the shared random effect terms. Future work will be conducted to investigate the impact of misspecification of the exposure effect on such correlated random effects models.
In our simulations and empirical studies, we considered the linear effect of the log of the exposure variable. However, more flexible modeling of exposure-outcome associations can be applied to avoid constraining a priori functional form of this relationship to a particular parametric family of functions, such as conventionally used linear functions.
Supplementary Material
Acknowledgments
This research was supported by the discovery grant from the Natural Sciences and Engineering Research Council of Canada. The author is also grateful to the Editor, Associate Editor, and two anonymous referees for their very valuable and constructive comments, which greatly helped to improve the quality of this paper.
Funding Statement
This research was supported by the discovery grant from the Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada [RGPIN 07212-2019].
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Agresti A., Foundations of Linear and Generalized Linear Models, Wiley, New York, 2015. [Google Scholar]
- 2.Akaike H., Information theory as an extension of the maximum likelihood principle, in Second International Symposium on Information Theory, B.V. Petrov and B.F. Csaki, eds., Academiai Kiado; Budapest, 1973.
- 3.Austin P. and Tu J., Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality, J. Clin. Epidemiol. 57 (2004), pp. 1138–1146. doi: 10.1016/j.jclinepi.2004.04.003 [DOI] [PubMed] [Google Scholar]
- 4.Baetschmann G. and Winkelmann R., Modeling zero-inflated count data when exposure varies: With an application to tumor counts, Biom. J. 55 (2013), pp. 679–686. doi: 10.1002/bimj.201200021 [DOI] [PubMed] [Google Scholar]
- 5.Begg M.D. and Lagakos S., Loss in efficiency caused by omitting covariates and misspecifying exposure in logistic regression models, J. Am. Stat. Assoc. 88 (1993), pp. 166–170. [Google Scholar]
- 6.Blangiardo M. and Cameletti M., Spatial and Spatio-temporal Bayesian Models with R-INLA, Wiley, New York, 2015. [Google Scholar]
- 7.Bohning D., Dietz E., Schlattmann P., Mendonca L., and Kirchner U., The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology, J. R. Stat. Soc. Ser. A 162 (1999), pp. 195–209. doi: 10.1111/1467-985X.00130 [DOI] [Google Scholar]
- 8.Brooks M.E., Kristensen K., van Benthem K.J., Magnusson A., Berg C.W., Nielsen A., Skaug H.J., Maechler M., and Bolker B.M., glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling, R J. 9 (2017), pp. 378–400. Available at https://journal.r-project.org/archive/2017/RJ-2017-066/index.html. doi: 10.32614/RJ-2017-066 [DOI] [Google Scholar]
- 9.Buu A., Johnson N.J., Li R., and Tan X., New variable selection methods for zero-inflated count data with applications to the substance abuse field, Stat. Med. 30 (2011), pp. 2326–2340. doi: 10.1002/sim.4268 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Buu A., Li R., Tan X., and Zucker R., Statistical models for longitudinal zero-inflated count data with applications to the substance abuse field, Stat. Med. 31 (2012), pp. 4074–4086. doi: 10.1002/sim.5510 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cheung Y., Zero-inflated models for regression analysis of count data: A study of growth and development, Stat. Med. 21 (2002), pp. 1461–1469. doi: 10.1002/sim.1088 [DOI] [PubMed] [Google Scholar]
- 12.Czado C., Erhardt V., Min A., and Wagner S., Zero-inflated generalized Poisson models with regression effects on the mean, dispersion and zero-inflation level applied to patent outsourcing rates, Stat. Modelling 7 (2007), pp. 125–153. doi: 10.1177/1471082X0700700202 [DOI] [Google Scholar]
- 13.Dai L., Sweat M.D., and Gebregziabher M., Modeling excess zeros and heterogeneity in count data from a complex survey design with application to the demographic health survey in sub-Saharan Africa, Stat. Methods Med. Res. 27 (2018), pp. 208–220. doi: 10.1177/0962280215626608 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dunn P.K. and Smyth G.K., Randomized quantile residuals, J. Comput. Graph. Stat. 5 (1996), pp. 236–244. [Google Scholar]
- 15.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
- 16.Hall D.B., Zero-inflated poisson and binomial regression with random effects: A case study, Biometrics 56 (2000), pp. 1030–1039. doi: 10.1111/j.0006-341X.2000.01030.x [DOI] [PubMed] [Google Scholar]
- 17.Heilbron D.C., Zero-altered and other regression models for count data with added zeros, Biom. J. 36 (1994), pp. 531–547. doi: 10.1002/bimj.4710360505 [DOI] [Google Scholar]
- 18.Kristensen K., Nielsen A., Berg C.W., Skaug H., and Bell B.M., TMB: Automatic differentiation and Laplace approximation, J. Stat. Softw. 70 (2016), pp. 1–21. doi: 10.18637/jss.v070.i05 [DOI] [Google Scholar]
- 19.Lambert D., Zero-inflated Poisson regression with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. doi: 10.2307/1269547 [DOI] [Google Scholar]
- 20.Lee A.H., Wang K., and Yau K.K., Analysis of zero-inflated Poisson data incorporating extent of exposure, Biom. J. 43 (2001), pp. 963–975. doi: [DOI] [Google Scholar]
- 21.Loquiha O., Hens N., Chavane L., Temmerman M., Osman N., Faes C., and Aerts M., Mapping maternal mortality rate via spatial zero-inflated models for count data: A case study of facility-based maternal deaths from Mozambique, PLoS ONE 13 (2018), e0202186. doi: 10.1371/journal.pone.0202186 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Min Y. and Agresti A., Random effect models for repeated measures of zero-inflated count data, Stat. Modelling 5 (2005), pp. 1–19. doi: 10.1191/1471082X05st084oa [DOI] [Google Scholar]
- 23.Mullahy J., Specification and testing of some modified count data models, J. Econom. 33 (1986), pp. 341–365. doi: 10.1016/0304-4076(86)90002-3 [DOI] [Google Scholar]
- 24.Neelon B., Ghosh P., and Loebs P., A spatial Poisson hurdle model for exploring geographic variation in emergency department visits, J. R. Stat. Soc. Ser. A 176 (2013), pp. 389–413. doi: 10.1111/j.1467-985X.2012.01039.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Neelon B., O'Malley A., and Normand S., A Bayesian model for repeated measures zero-inflated count data with application to outpatient psychiatric service use, Stat. Modelling 10 (2010), pp. 421–439. doi: 10.1177/1471082X0901000404 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Petersen M.R. and Deddens J.A., Effects of omitting a covariate in poisson models when the data are balanced, Canad. J. Statist. 28 (2000), pp. 439–445. doi: 10.2307/3315990 [DOI] [Google Scholar]
- 27.Rose C., Martin S., Wannemuehler K., and Plikaytis B., On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data, J. Biopharm. Stat. 16 (2006), pp. 463–481. doi: 10.1080/10543400600719384 [DOI] [PubMed] [Google Scholar]
- 28.Schwarz G., Estimating the dimension of a model, Ann. Statist. 6 (1978), pp. 461–464. doi: 10.1214/aos/1176344136 [DOI] [Google Scholar]
- 29.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc. 58 (1996), pp. 267–288. [Google Scholar]
- 30.Wang Z., Ma S., and Wang C., Variable selection for zero-inflated and overdispersed data with application to health care demand in germany, Biom. J. 57 (2015), pp. 867–884. doi: 10.1002/bimj.201400143 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wang Z., Ma S., Wang C., Zappitelli M., Devarajan P., and Parikh C., EM for regularized zero-inflated regression models with applications to postoperative morbidity after cardiac surgery in children, Stat. Med. 33 (2014), pp. 5192–5208. doi: 10.1002/sim.6314 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Xia Y., Sun J., and Chen D.G., Modeling zero-inflated microbiome data, in Statistical Analysis of Microbiome Data with R, Springer Singapore, Singapore, 2018, pp. 453–496.
- 33.Yang Z., Hardin J.W., and Addy C.L., Testing overdispersion in the zero-inflated Poisson model, J. Stat. Plan. Inference 139 (2009), pp. 3340–3353. doi: 10.1016/j.jspi.2009.03.016 [DOI] [Google Scholar]
- 34.Yau K. and Lee A., Zero-inflated poisson regression with random effects to evaluate an occupational injury prevention programme, Stat. Med. 20 (2001), pp. 2907–2920. doi: 10.1002/sim.860 [DOI] [PubMed] [Google Scholar]
- 35.Zeileis A., Kleiber C., and Jackman S., Regression models for count data in R, J. Stat. Softw. 27 (2008), pp. 1–25. [Google Scholar]
- 36.Zeng P., Wei Y., Zhao Y., Liu J., Liu L., Zhang R., Gou J., Huang S., and Chen F., Variable selection approach for zero-inflated count data via adaptive lasso, J. Appl. Stat. 41 (2014), pp. 879–894. doi: 10.1080/02664763.2013.858672 [DOI] [Google Scholar]
- 37.Zhang C.H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
- 38.Zhen Z., Shao L., and Zhang L., Spatial hurdle models for predicting the number of children with lead poisoning, Int. J. Eviron. Res. Public Health 15 (2018), p. 1792. doi: 10.3390/ijerph15091792 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.