Multiply robust imputation procedures for zero-inflated distributions in surveys

Sixia Chen; David Haziza

doi:10.1007/s40300-017-0128-9

. Author manuscript; available in PMC: 2018 Dec 1.

Published in final edited form as: Metron. 2017 Oct 11;75(3):333–343. doi: 10.1007/s40300-017-0128-9

Multiply robust imputation procedures for zero-inflated distributions in surveys

Sixia Chen ^*, David Haziza ^†

PMCID: PMC5777636 NIHMSID: NIHMS908576 PMID: 29371744

Abstract

Item nonresponse in surveys is usually treated by some form of single imputation. In practice, the survey variable subject to missing values may exhibit a large number of zero-valued observations. In this paper, we propose multiply robust imputation procedures for treating this type of variable. Our procedures may be based on multiple imputation models and/or multiple nonresponse models. An imputation procedure is said to be multiply robust if the resulting estimator is consistent when all models but one are misspecified. The variance of the imputed estimators is estimated through a generalized jackknife variance estimation procedure. Results from a simulation study suggest that the proposed procedures perform well in terms of bias, efficiency and coverage rate.

Keywords: Double robustness, Item nonresponse, Imputation, Multiple robustness, Variance estimation, Zero-valued observations

1 Introduction

Item nonresponse in surveys occurs in surveys when some, but not all, survey variables are missing. Estimators based on the complete cases are generally biased unless the data are missing completely at random (Rubin, 1976). Reducing the potential bias due to nonresponse requires the availability of fully observed variables that are recorded for both the responding and nonresponding units. For more information about different approaches for handling missing values, see Little and Rubin (2002) and Kim and Shao (2013), among others. Item nonresponse is usually treated by some form of single imputation, whereby a missing value is replaced by a single plausible imputed value constructed on the basis of fully observed variables. With imputation, a complete data file is produced, allowing data users to obtain point estimates through complete data estimation procedures. A number of single imputation procedures are used in practice; e.g., mean, ratio and regression imputation, nearest-neighbour imputation and random hot-deck imputation, among others; see Haziza (2009) for a detailed description of commonly used imputation procedures.

In practice, the survey variable subject to missingness may be prone to a large number of zero-valued observations. This situation arises, for example, in the context of the Industrial Consumption of Energy survey (ICE) conducted by Statistics Canada. ICE provides estimates of energy consumption by manufacturing establishments in Canada. It collects data on the consumption of various energy commodities such as electricity, natural gas, propane, diesel, wood and steam. For variables such as propane consumptions, the data contain a large number of zero-valued observations as many Canadian establishments do not use propane. Imputing this type of survey variable requires a zero-inflated imputation model to reflect the mixture of zero-valued and nonzero-valued observations. The imputation model is a set of assumptions about the survey variable requiring imputation. Most often, the imputed values are derived from the assumed imputation model. However, being based on a single model, the resulting point estimators are vulnerable to misspecification of the imputation model. Protection against misspecification of the model may be achieved by assuming a nonresponse model, which is a set of assumptions about the unknown nonresponse mechanism. The imputed values are derived by using both models. The resulting estimator is doubly robust in the sense that it remains consistent if either the imputation model or the nonresponse model is correctly specified; see Haziza and Rao (2006), Kim and Park (2006) and and Kim and Haziza (2014), among others. Haziza et al. (2014) proposed doubly robust imputation procedures for finite population means in the presence of a large number of zero-valued observations. Gelein (2016) extended the results of Haziza et al. (2014) to the case of random regression imputation and considered the estimation of the finite population distribution function. Although doubly robust methods provide some protection, some empirical investigations have shown that they tend to have poor numerical performances in terms of bias and efficiency if both models are misspecified; e.g., Kang and Schafer (2007), Han (2014b) and Chen and Haziza (2017a).

In recent years, multiply robust estimation procedures have attracted some attention; see Han and Wang (2013), Han (2014a, 2014b), Chan and Yam (2014) and Chen and Haziza (2017a, b, c) among others. The concept of multiple robustness can be viewed as an extension of double robustness. Unlike doubly robust procedures, multiply robust procedures make use of multiple nonresponse models and/or multiple imputation models. An estimation procedure is multiply robust if the resulting estimator is consistent when all models but one are misspecified. In the context of single imputation, Chen and Haziza (2017a) proposed multiply robust imputation procedures for complex survey data. Multiply robust multiple imputation procedures were studied in Chen and Haziza (2017c). When all the models are misspecified, a number of empirical investigations available in the literature suggest that multiply robust estimation procedures perform much better than doubly robust procedures based on the same working models, in terms bias and efficiency; see e.g., Han (2014b) and Chen and Haziza (2017a). In this paper, we extend the approach of Chen and Haziza (2017a) to the case of survey variables exhibiting a large number of zero-valued observations and propose multiply robust imputation procedures for treating this type of variable.

Our paper is organized as follows. The notation and the basic setup is introduced in Section 2. In Section 3, we describe two multiply robust imputation procedures tailored for survey variables subject to missingness and exhibiting a large number of zeros. Jackknife variance estimation is discussed in Section 4. The results of a simulation study, comparing several estimation procedures, in terms of bias, efficiency and coverage rate, are presented in Section 5. Section 6 concludes the article with some discussion.

2 Basic setup

Consider a finite population ℘ of size N. We are interested in estimating the finite population mean Ȳ = t_y/N, where t_y = Σ_i∈_℘y_i, denotes the finite population total corresponding to the survey variable y. A sample S is selected from ℘ according to a given sampling design p(S) with first-order inclusion probabilities π_i and second-order inclusion probabilities π_ij, i ≠ j. A complete data estimator of Ȳ is the Hájek estimator (Hájek, 1971):

\hat{\bar{Y}} = \frac{1}{\hat{N}} \sum_{i \in S} w_{i} y_{i},

(1)

where $w_{i} = π_{i}^{- 1}$ is the design (or basic) weight attached to unit i and N̂ = Σ_i∈_S w_i is a design-unbiased estimator of the population size N. When the y-variable is subject to missing values, the complete data estimator (1) cannot be computed.

We assume the y-variable obeys the following zero-inflated model:

y_{i} = δ_{i} {m (x_{i}; β) + ε_{i}} + (1 - δ_{i}) \times 0,

(2)

where m(·) is an unknown function, δ_i is a zero/non-zero indicator such that δ_i = 1 if y_i ≠ 0 and δ_i = 0 otherwise, x_i is a vector of fully observed variables associated with unit i and β is a vector of unknown parameters. We assume that δ_i is independent of ε_i after accounting for x_i. We also assume E(ε_i | x_i) = 0, E(ε_iε_j | x_i) = 0 for i ≠ j and V (ε_i | x_i) = σ², where σ² is an unknown parameter. Further, we assume that δ_i follows a Bernoulli distribution with unknown probability function

Pr (δ_{i} = 1 ∣ x_{i}, y_{i}) = Pr (δ_{i} = 1 ∣ x_{i}) \equiv q (x_{i}; γ),

(3)

where q(x_i; .) is a unknown function with parameter γ. Combining (2) and (3), we have E(y_i | x_i) = q(x_i; γ)m(x_i; β) ≡ M(x_i; β, γ). Model (2) is called an imputation model. Note that for the imputation model to be correctly specified, both m(x; β) and q(x; γ) must be correctly specified.

Let r_i be the response indicator such that r_i = 1 if y_i is observed and r_i = 0 if y_i is missing. We make the Missing At Random (MAR) assumption (Rubin, 1976); i.e.,

Pr (r_{i} = 1 ∣ x_{i}, y_{i}) = Pr (r_{i} = 1 ∣ x_{i}) \equiv p (x_{i}; α),

(4)

where p(x_i; .) is a unknown function with parameter α. Model (4) is called a nonresponse model.

If one places a complete reliance on the nonresponse model (4), a natural estimator of Ȳ is the propensity score weighted estimator

{\hat{\bar{Y}}}_{P S} = \frac{1}{\hat{N}} \sum_{i \in S} w_{i} \frac{r_{i}}{p (x_{i}; \hat{α})} y_{i},

(5)

where α̂ is a consistent estimator of α.

If one places a complete reliance on the imputation model (2), a natural estimator of Ȳ is the imputed estimator

{\hat{\bar{Y}}}_{I} = \frac{1}{\hat{N}} {\sum_{i \in S} w_{i} r_{i} y_{i} + \sum_{i \in S} w_{i} (1 - r_{i}) M (x_{i}; \hat{β}, \hat{γ})},

(6)

where γ̂ and β̂ denote consistent estimators of γ and β, respectively.

Alternatively, one may combine both the imputation model (2) and the nonresponse model (4) to obtain the doubly robust estimator

{\hat{\bar{Y}}}_{D R} = \frac{1}{\hat{N}} [\sum_{i \in S} w_{i} \frac{r_{i}}{p (x_{i}; \hat{α})} y_{i} + \sum_{i \in S} w_{i} {1 - \frac{r_{i}}{p (x_{i}; \hat{α})}} M (x_{i}; \hat{β}, \hat{γ})] .

(7)

Because V (ε_i | x_i) = σ²λ^⊤x_i, the estimator ${\hat{\bar{Y}}}_{D R}$ can be written as a weighted sum of observed and imputed values; see, e.g., Haziza and Rao (2006) and Haziza et al. (2014).

In the next section, we introduce multiply robust imputation procedures that allow for multiple models for p(x; α) and/or m(x; β) and/or q(x; γ), leading to an imputed estimator that remains consistent if at least one of the imputation models (which entails correctly specifying both m(x; β) and q(x; γ)) or nonresponse models is correctly specified.

3 Proposed method

Let 𝒞₁ = {p^j(x_i; α^j); j = 1, . . . , J} be a class of J nonresponse models, 𝒞₂ = {m^k(x_i; β^k); k = 1, . . . , K} be the class of K models for m(x; β) and 𝒞₃ = {q^l(xⁱ; γ_l); l = 1, . . . , L} be the class of L models for q(x; γ).

Estimators of γ^l are obtained by solving the following survey weighted estimating equations

S_{γ}^{l} (γ^{l}) = \sum_{i \in S} w_{i} r_{i} \frac{δ_{i} - q^{l} (x_{i}; γ^{l})}{q^{l} (x_{i}; γ^{l}) {1 - q^{l} (x_{i}; γ^{l})}} \frac{\partial q^{l} (x_{i}; γ^{l})}{\partial γ^{l}} = 0, l = 1, \dots, L .

Define Û_qi = q¹(x_i; γ̂¹), . . . , q^L(x_i; γ̂^L))^⊤. If one of the models in 𝒞₃ is correctly specified, then a consistent estimator of q(x_i; γ) is ${\hat{q}}_{i} = {\hat{U}}_{q i}^{⊤} \times ({\hat{η}}_{q}^{2} / {\hat{η}}_{q}^{⊤} {\hat{η}}_{q})$ , where

{\hat{η}}_{q} = {(\sum_{i \in S} w_{i} r_{i} {\hat{U}}_{q i} {\hat{U}}_{q i}^{⊤})}^{- 1} \sum_{i \in S} w_{i} r_{i} {\hat{U}}_{q i} δ_{i} .

Here and throughout the paper, if a = (a₁, · · ·, a_ℓ) is a ℓ-vector, a² denotes the column vector ( $a_{1}^{2}, \dots, a_{ℓ}^{2}$ ). Using the normalized scores rather than the customary prediction ${\hat{U}}_{q i}^{⊤} {\hat{η}}_{q}$ ensures that the weighted average lies in the appropriate range; see Duan and Yin (2017). Note that η̂_q is the weighted least square estimator obtained regressing δ_i on Û_qi based on the responding units. The prediction q̂_i can be viewed as a score that compresses the model information contained in the L models for q(x; γ). Due to the nature of the least squares procedure, q̂_i reproduces the true fitted probability if one of the models in 𝒞₃ is correctly specified. Suppose that the first model, q¹(x_i; γ¹) in 𝒞₃ is correctly specified. Then, it can be shown that q̂_i = q¹(x_i; γ̂¹)+O_p(n^−1/2); see Duan and Yin (2017).

The estimators α̂^j and β̂^k can be obtained by solving the following survey weighted estimating equations

S_{α}^{j} (α^{j}) = \sum_{i \in S} w_{i} \frac{r_{i} - p^{j} (x_{i}; α^{j})}{p^{j} (x_{i}; α^{j}) {1 - p^{j} (x_{i}; α^{j})}} \frac{\partial p^{j} (x_{i}; α^{j})}{\partial α^{j}} = 0,

and

S_{β}^{k} (β^{k}) = \sum_{i \in S} w_{i} r_{i} δ_{i} {y_{i} - m^{k} (x_{i}; β^{k})} \frac{\partial m^{k} (x_{i}; β^{k})}{\partial β^{k}} = 0.

Define

{\hat{U}}_{p i} = (p^{1} (x_{i}; {\hat{α}}^{1}), \dots, p^{J} (x_{i}; {\hat{α}}^{J})), {\hat{U}}_{m i} = (m^{1} (x_{i}; {\hat{β}}^{1}), \dots, m^{K} (x_{i}; {\hat{β}}^{K})) .

To summarize the working models information, we regress r_i on Û_pi and y_i on Û_mi, which leads to the weighted least square regression coefficients

{\hat{η}}_{p} = {(\sum_{i \in S} w_{i} {\hat{U}}_{p i} {\hat{U}}_{p i}^{⊤})}^{- 1} \sum_{i \in S} w_{i} {\hat{U}}_{p i} r_{i}

and

{\hat{η}}_{m} = {(\sum_{i \in S} w_{i} r_{i} δ_{i} {\hat{U}}_{m i} {\hat{U}}_{m i}^{⊤})}^{- 1} \sum_{i \in S} w_{i} r_{i} δ_{i} {\hat{U}}_{m i} y_{i} .

Let ${\hat{p}}_{i} = {\hat{U}}_{p i}^{⊤} \times ({\hat{η}}_{p}^{2} / {\hat{η}}_{p}^{⊤} {\hat{η}}_{p})$ and ${\hat{m}}_{i} = {\hat{U}}_{m i}^{⊤} \times ({\hat{η}}_{m}^{2} / {\hat{η}}_{m}^{⊤} {\hat{η}}_{m})$ . As for q̂_i, the scores p̂_i and m̂_i compress respectively the information contained in the J nonresponse models and the K models for m(x; β). It can be shown that p̂_i is a consistent estimator of p(x_i; α) if one of the models in 𝒞₁ is correctly specified and m̂_i is a consistent estimator of m(x_i; β) if one of the models in 𝒞₂ is correctly specified.

Using the compressed scores q̂_i and m̂_i, we define M̂_i = q̂_i m̂_i. Following Chen and Haziza (2017a), we consider the imputed value

y_{i}^{*} = h_{i}^{⊤} \hat{τ}

(8)

for the missing y_i, where h_i = (1, M̂_i)^⊤ with

\hat{τ} = {\sum_{i \in S} w_{i} r_{i} ({\hat{p}}_{i}^{- 1} - 1) h_{i} h_{i}^{⊤}}^{- 1} \sum_{i \in S} w_{i} r_{i} ({\hat{p}}_{i}^{- 1} - 1) h_{i} y_{i} .

The resulting imputed estimator is given by

{\hat{\bar{Y}}}_{M R} = \frac{1}{\hat{N}} {\sum_{i \in S} w_{i} r_{i} y_{i} + \sum_{i \in S} w_{i} (1 - r_{i}) h_{i}^{⊤} \hat{τ}} .

(9)

The imputation procedure (8) is referred to as a deterministic multiply robust imputation procedure. Since the first component of h_i is 1, we have

\sum_{i \in S} r_{i} w_{i} ({\hat{p}}_{i}^{- 1} - 1) (y_{i} - h_{i}^{⊤} \hat{τ}) = 0

and (9) can be rewritten as

{\hat{\bar{Y}}}_{M R} = \frac{1}{\hat{N}} {\sum_{i \in S} \frac{r_{i}}{{\hat{p}}_{i}} y_{i} + \sum_{i \in S} (1 - \frac{r_{i}}{{\hat{p}}_{i}}) h_{i}^{⊤} \hat{τ}} .

Using similar techniques to those used in Chen and Haziza (2017a), it can be shown that ${\hat{\bar{Y}}}_{M R}$ is multiply robust.

To better reflect the mixture of zero-valued and nonzero-valued observations, we consider the following random version of (8):

y_{i}^{*} = {\begin{cases} {\hat{q}}_{i}^{- 1} h_{i}^{⊤} \hat{τ} & with probability {\hat{q}}_{i} \\ 0 & with probability 1 - {\hat{q}}_{i} . \end{cases}

(10)

The imputation procedure (10) is referred to as a random multiply robust imputation procedure. We denote by ${\hat{\bar{Y}}}_{M R}^{*}$ the resulting imputed estimator. Let E_I denote the expectation with respect to the random imputation mechanism. Noting that $E_{I} (y_{i}^{*}) = h_{i}^{⊤} \hat{τ}$ , it follows that the estimator $E_{I} ({\hat{\bar{Y}}}_{M R}^{*}) = {\hat{\bar{Y}}}_{M R}$ , where ${\hat{\bar{Y}}}_{M R}$ is given by (9). Therefore, ${\hat{\bar{Y}}}_{M R}^{*}$ is also multiply robust.

4 Variance estimation

It is well known that treating the imputed values as observed leads to an underestimation of the variance of the imputed estimators and ultimately to confidence intervals that are too narrow. In the case of negligible sampling fractions, there exist several variance estimation procedures taking nonresponse and imputation into account; see, e.g., Rao and Shao (1992) and Shao and Sitter (1996). In this section, we consider the generalized jackknife procedure of Berger (2007) for estimating the variance of ${\hat{\bar{Y}}}_{M R}$ and ${\hat{\bar{Y}}}_{M R}^{*}$ . In the complete data situation, Berger (2007) showed that his jackknife procedure leads to a consistent variance estimator provided that the sampling design has a high entropy and that the sampling fraction n/N is negligible. Examples of high entropy sampling designs include conditional Poisson sampling design (Hájek, 1964), the Rao-Sampford design (Rao, 1965; Sampford, 1967), randomized systematic sampling (Tillé, 2006, Section 7.2) and sequential Poisson sampling (Ohlsson, 1998).

We start by describing the jackknife procedure for ${\hat{\bar{Y}}}_{M R}$ . Let w_i₍_j₎ be the jackknife weights such that w_i₍_j₎ = n(n − 1)⁻¹w_i if i ≠ j and w_i₍_j₎ = 0 if i = j. Let ${\hat{\bar{Y}}}_{M R (j)}$ be the estimator of ${\hat{\bar{Y}}}_{M R}$ based on the data with the j-th unit excluded,

{\hat{\bar{Y}}}_{M R (j)} = \frac{1}{{\hat{N}}_{(j)}} {\sum_{i \in S} w_{i (j)} r_{i} y_{i} + \sum_{i \in S} w_{i (j)} (1 - r_{i}) y_{i (j)}^{*}},

where N̂₍_j₎ = Σ_i∈_S w_i₍_j₎, $y_{i (j)}^{*} = h_{i (j)}^{⊤} {\hat{τ}}_{(j)}$ with $h_{i (j)}^{⊤}$ and τ̂₍_j₎ computed in the same way as $h_{i}^{⊤}$ and τ̂ in Section 3 but with the jackknife weights w_i₍_j₎ instead of the original weights w_i. A jackknife variance estimator is given by

{\hat{V}}_{J} = \frac{n}{n - 1} \sum_{i \in S} (1 - π_{i}) {(u_{i} - \sum_{k \in S} ψ_{k} u_{k})}^{2},

(11)

where $u_{i} = (1 - w_{i}) ({\hat{\bar{Y}}}_{M R} - {\hat{\bar{Y}}}_{M R (i)})$ and ψ_i = c_i/Σ_k∈S c_k with c_i = n/(n − 1)(1 − π_i). The consistency of V̂_J can be established through the reverse approach of Shao and Steel (1999); see also Fay (1991) and Kim and Rao (2009). It is worth noting that V̂_J remains consistent for the true variance of ${\hat{\bar{Y}}}_{M R}$ even if all the models in 𝒞₁, 𝒞₂ and 𝒞₃ are misspecified. A (1 − ζ%) confidence interval for Ȳ is

{\hat{\bar{Y}}}_{M R} \pm z_{ζ / 2} \sqrt{{\hat{V}}_{J}},

(12)

where z_ζ/₂ is the upper (1−ζ/2%) critical value for the standard normal distribution. The confidence interval (12) is multiply robust in the sense that its coverage rate is close to nominal rate if all models but one are misspecified.

Turning to ${\hat{\bar{Y}}}_{M R}^{*}$ , we express its variance as

V ({\hat{\bar{Y}}}_{M R}^{*}) = V ({\hat{\bar{Y}}}_{M R} + {\hat{\bar{Y}}}_{M R}^{*} - {\hat{\bar{Y}}}_{M R}) = V ({\hat{\bar{Y}}}_{M R}) + V ({\hat{\bar{Y}}}_{M R}^{*} - {\hat{\bar{Y}}}_{M R}) = V ({\hat{\bar{Y}}}_{M R}) + E {V_{I} ({\hat{\bar{Y}}}_{M R}^{*} - {\hat{\bar{Y}}}_{M R})} = V ({\hat{\bar{Y}}}_{M R}) + E [V_{I} {\frac{1}{\hat{N}} \sum_{i \in S} w_{i} (1 - r_{i}) (y_{i}^{*} - h_{i}^{⊤} \hat{τ})}] = V ({\hat{\bar{Y}}}_{M R}) + E {\frac{1}{{\hat{N}}^{2}} \sum_{i \in S} w_{i}^{2} (1 - r_{i}) ({\hat{q}}_{i}^{- 1} - 1) {(h_{i}^{⊤} \hat{τ})}^{2}},

(13)

where V_I(.) denotes the variance with respect to the random imputation mechanism. It follows that a consistent variance estimator is given by

{\hat{V}}_{J}^{*} = {\hat{V}}_{J} + {\hat{V}}_{I}^{*},

(14)

where V̂_J is given by (11) and

{\hat{V}}_{I}^{*} = \frac{I}{{\hat{N}}^{2}} \sum_{i \in S} w_{i}^{2} (1 - r_{i}) ({\hat{q}}_{i}^{- 1} - 1) {(h_{i}^{⊤} \hat{τ})}^{2} .

A (1 − ζ%) confidence interval for Ȳ is

{\hat{\bar{Y}}}_{M R}^{*} \pm z_{ζ / 2} \sqrt{{\hat{V}}_{I}^{*}} .

5 Simulation study

We conducted a simulation study to assess the performance of several estimation procedures in terms of bias and efficiency. We considered the simulation setup of Haziza et al. (2014). We generated a finite population of size N = 10, 000 consisting of two variables: a survey variable y and an auxiliary variable x. First, the x-values were generated from a Gamma distribution with shift parameter 2 and scale parameter 5. Given the x-values, the population y-values were generated according to the model

y_{i} = δ_{i} (15 + 1.5 x_{i} + ε_{i}),

where the errors ε_i’s were generated from a normal distribution with mean equal to zero and variance equal to σ², whose value was set so as to lead to a coefficient of determination between x and y of 70% for units with δ_i = 1 (i.e., the non-zero part of the population). The δ_i’s were generated from a Bernoulli distribution with probability q(x_i; γ) such that

log (\frac{q (x_{i}; γ)}{1 - q (x_{i}; γ)}) = γ_{0} + γ_{1} x_{i},

where γ₀ and γ₁ were set so that the proportion of zero-valued observations was approximately equal to 50%.

From the finite population, we selected B = 1, 000 samples, of size n = 200, according to simple random sampling without replacement. In each sample, the response indicators r_i were generated from a Bernoulli distribution with probability p(x_i; α) such that

log (\frac{p (x_{i}; α)}{1 - p (x_{i}; α)}) = α_{0} + α_{1} x_{i},

where α₀ and α₁ were set so that the overall response rate was approximately equal to 70%.

We were interested in estimating the finite population mean of the y-values, Ȳ. Because the proposed imputation procedures may be based on different combinations of the models, we used six digits between parentheses to distinguish estimators constructed using different models. The first two digits correspond to the correct and incorrect specification of p(x; α), respectively, the third and fourth digits corresponds to the correct and incorrect specification of m(x; β), respectively, and the last two digits correspond to the correct and incorrect specification of q(x; γ), respectively. For instance, the estimator $\hat{\bar{Y}} (10 ∣ 10 ∣ 10)$ is based on correct specification of p(x; α), m(x; β) and q(x; γ), whereas the estimator $\hat{\bar{Y}} (11 ∣ 11 ∣ 11)$ is based on all the (correct and incorrect) models for p(x; α), m(x; β) and q(x; γ). In our experiments, the incorrectly specified models for p(x; α), m(x; β) and q(x; γ) only included the intercept; that is, the x-variable was omitted from each of the three models.

We computed the following estimators: (i) The compete data estimator, ${\hat{\bar{Y}}}_{COM}$ , given by (1); (ii) The imputed estimator ${\hat{\bar{Y}}}_{I}$ , given by (6) based on correct and incorrect specification of m(x; β) and q(x; γ); (iii) The multiply robust estimator, ${\hat{\bar{Y}}}_{M R}$ , given by (9) based on correct and incorrect specification of p(x; α), m(x; β) and q(x; γ).

For each point estimator, we computed the Monte Carlo percent relative bias (RB), the relative standard error (RSE) and the relative root mean square error (RRMSE). The results are presented in Table 1. Before discussing the results, it is worth noting that, in our simulation set-up, some pairs of estimators were identical. This is due to the misspecified nonresponse model p(x; α) that contains only the intercept. As a result, the estimated response probability based on the incorrect model was equal to the overall response rate for all the units. In this case, only one estimator of each pair is shown in Table 1. Pairs of identical estimators include ${\hat{\bar{Y}}}_{M R} (10 ∣ 11 ∣ 11)$ and ${\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 11), {\hat{\bar{Y}}}_{M R} (10 ∣ 11 ∣ 01)$ and ${\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 01)$ and ${\hat{\bar{Y}}}_{M R} (11 ∣ 01 ∣ 01)$ and ${\hat{\bar{Y}}}_{M R} (10 ∣ 01 ∣ 01)$ .

Table 1.

Relative bias (RB), relative standard error (RSE) and relative root mean squared error (RRMSE) of all estimators

Estimators

RB(%)

RSE(%)

RRMSE(%)

{\hat{\bar{Y}}}_{COM}

0.38

7.56

7.57

{\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 10)

0.77

8.62

8.65

{\hat{\bar{Y}}}_{I} (00 ∣ 01 ∣ 10)

10.64

9.31

14.14

{\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 01)

−29.54

8.85

30.84

{\hat{\bar{Y}}}_{I} (00 ∣ 01 ∣ 01)

−26.52

9.25

28.09

{\hat{\bar{Y}}}_{M R} (10 ∣ 10 ∣ 10)

0.87

8.82

8.86

{\hat{\bar{Y}}}_{M R} (10 ∣ 01 ∣ 10)

1.23

8.92

9.00

{\hat{\bar{Y}}}_{M R} (10 ∣ 10 ∣ 01)

1.50

8.97

9.10

{\hat{\bar{Y}}}_{M R} (01 ∣ 10 ∣ 01)

−5.97

10.41

12.00

{\hat{\bar{Y}}}_{M R} (01 ∣ 11 ∣ 11)

0.65

8.64

8.66

{\hat{\bar{Y}}}_{M R} (11 ∣ 01 ∣ 11)

1.19

8.92

9.00

{\hat{\bar{Y}}}_{M R} (11 ∣ 10 ∣ 11)

0.89

8.75

8.79

{\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 11)

0.89

8.75

8.79

{\hat{\bar{Y}}}_{M R} (01 ∣ 11 ∣ 01)

−5.97

10.41

12.00

{\hat{\bar{Y}}}_{M R} (11 ∣ 01 ∣ 01)

1.04

8.88

8.94

{\hat{\bar{Y}}}_{M R} (11 ∣ 10 ∣ 01)

1.52

8.98

9.11

{\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 01)

1.52

8.98

9.11

Open in a new tab

From Table 1, the complete data estimator showed negligible bias and was the most efficient estimator in all the scenarios, as expected. We now turn to the estimators based on a single imputation model, which is typically what would be done in practice. The estimator ${\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 10)$ , based on the correct imputation model only, showed a negligible bias (approximately 0.77%) and a value of RRMSE of 8.65. This estimator can be viewed as the gold standard in terms of bias an efficiency. On the other hand, when either m(x; β) and/or q(x; γ) was misspecified, the resulting estimator was highly biased. For example, the estimator ${\hat{\bar{Y}}}_{I} (00 ∣ 01 ∣ 10)$ showed a value of RB of 10.64%, whereas the estimator ${\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 01)$ showed a value of RB equal to −29.54%. Finally, under the misspecification of both m(x; β) and q(x; γ), the estimator ${\hat{\bar{Y}}}_{I} (00 ∣ 01 ∣ 01)$ exhibited a bias approximately equal to −26.52%. Therefore, the estimators based on a single misspecified imputation model (i.e., either m(x; β) and/or q(x; γ) was incorrectly misspecified) showed poor performances in terms of bias, which is not surprising.

Turning to the MR estimators based on a single nonresponse model and a single imputation model, they showed a small bias, when either the model for p(x; α) or the imputation model (which entails the correct specification of both m(x; β) and q(x; γ)) was correctly specified. When both models were correctly specified, the estimator ${\hat{\bar{Y}}}_{M R} (10 ∣ 10 ∣ 10)$ showed a value of RB of 0.86% and a value of RRMSE equal to 8.86. A comparison of ${\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 10)$ and ${\hat{\bar{Y}}}_{M R} (10 ∣ 10 ∣ 10)$ suggests that the latter was almost as efficient as the former with a slight increase in RRMSE of 2.4% approximately. However, when both p(x; α) and m(x; β) were misspecified, the MR estimators were significantly biased. For example, the RB of ${\hat{\bar{Y}}}_{M R} (01 ∣ 10 ∣ 01)$ was equal to −5.97%. It is interesting to note that, depite being inconsistent, the estimator ${\hat{\bar{Y}}}_{M R} (01 ∣ 10 ∣ 01)$ performed much better than ${\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 01)$ in terms of bias (−5.95% vs. −29.54%).

Finally, we discuss the performance of MR estimator based on more than one imputation model and/or more than one nonresponse model. All the multiply robust estimators showed small biases and were almost as efficient as ${\hat{\bar{Y}}}_{M R} (00 ∣ 10 ∣ 10)$ in all the scenarios. For example, the estimator ${\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 11)$ that was based all the models, showed a value of RB of 0.89% and a value of RRMSE of 8.79. Therefore, the loss of efficiency with respect to the gold standard ${\hat{\bar{Y}}}_{I} (00 ∣ 10 ∣ 10)$ was only about 1.6%. These results tend to suggest that incorporating multiple models does not seem to affect the efficiency significantly.

In addition, we assessed the performance of the jackknife variance estimator given by (11) in terms of percent relative bias (RB) as well as the coverage rate (CR) of the confidence interval (12). Table 2 presents the results corresponding to five multiply robust estimators. Other estimators led to very similar results and hence, are not shown here. From Table 2, the jackknife variance estimator showed a small bias with a value of RB less than 7.6%. The coverage rates were close to the nominal rate (95%) in all the scenarios.

Table 2.

Relative bias (RB) and Coverage rate (CR) of the jackknife variance estimator

Estimators

RB(%)

{\hat{\bar{Y}}}_{M R} (01 ∣ 11 ∣ 11)

4.0

94.8

{\hat{\bar{Y}}}_{M R} (10 ∣ 11 ∣ 11)

5.7

95.0

{\hat{\bar{Y}}}_{M R} (11 ∣ 01 ∣ 11)

7.5

95.5

{\hat{\bar{Y}}}_{M R} (11 ∣ 10 ∣ 11)

7.6

95.5

{\hat{\bar{Y}}}_{M R} (11 ∣ 11 ∣ 11)

7.6

95.0

Open in a new tab

6 Discussion

We proposed multiply robust imputation procedures for complex survey data when the variable subject to missingness exhibits a large number of zero-valued observations. The proposed procedures showed good performances in terms of bias and efficiency. To better reflect the mixture of zero-valued and nonzero-valued observations, we suggested the use of (10) rather than (8). However, the resulting estimator, ${\hat{\bar{Y}}}_{M R}^{*}$ , suffers from an additional variability arising from the random imputation mechanism. Eliminating this variance can be achieved through the use of a balanced random imputation procedure similar to the one described in Haziza et al. (2014). This is a topic of future research.

If the interest lies in estimating the finite population distribution function, a random noise may be added for the portion of sample units with δ_i = 1. More specifically, the missing y_i is replaced by

y_{i}^{*} = {\hat{δ}}_{i}^{*} ({\hat{m}}_{i} + \hat{σ} {\sqrt{c}}_{i} ε_{i}^{*}),

where σ̂ is an estimator of σ, ${\hat{δ}}_{i}^{*} = 1$ with probability q̂_i and ${\hat{δ}}_{i}^{*} = 0$ with probability 1−q̂_i, m̂_i is defined in Section 3 and $ε_{i}^{*}$ are selected independently and with replacement from the set of residuals R_r = {e_j ; r_j = 1, δ_j = 1} with $e_{j} = {(\hat{σ} {\sqrt{c}}_{i})}^{- 1} (y_{j} - {\hat{m}}_{j})$ with probability proportional to w_j. Estimation of the finite population distribution function will be presented elsewhere.

Acknowledgments

The authors would like to sincerely thank the guest co-editors Mary Dickson, Jean Opsomer and Giovanna Ranalli for their invitation to contribute to this special issue of Metron. The first author wishes to acknowledge the partial funding provided by National Institutes of Health, National Institute of General Medical Sciences (Grant 1 U54GM104938), an IDeA-CTR to the University of Oklahoma Health Sciences Center. The second author wishes to acknowledge the support of grants from the Natural Sciences and Engineering Research Council and the Canadian Statistical Sciences Institute.

References

Berger YG. A jackknife variance estimator for unistage stratified samples with unequal probabilities. Biometrika. 2007;94:953–964. [Google Scholar]
Chan KCG, Yam SCP. Oracle, multiple robust and multipurpose calibration in missing response problem. Statistical Science. 2014;29:380–386. [Google Scholar]
Chen S, Haziza D. Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika. 2017a;102:439–453. [Google Scholar]
Chen S, Haziza D. Jackknife empirical likelihood method for multiply robust estimation with missing data. 2017b doi: 10.1016/j.csda.2018.05.011. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen S, Haziza D. Multiply robust nonparametric multiple imputation for the treatment of missing data. 2017c In revision for Statistica Sinica. [Google Scholar]
Duan X, Yin G. Ensemble approaches to estimation of the population mean with missing responses. 2017 To appear in the Scandinavian Journal of Statistics. [Google Scholar]
Fay RE. A design-based perspective on missing data variance. Proceedings of the 1991 Annual Research Conference, US Bureau of the Census; 1991. pp. 429–440. [Google Scholar]
Gelein B. Preserving the distribution function in surveys in case of imputation for zero inflated data. 2016 Submitted. [Google Scholar]
Hájek J. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Annals of the Institute of Statistical Mathematics. 1964;35:1491–1523. [Google Scholar]
Hájek J. Comment on “An essay on the logical foundations of survey sampling” by Basu, D. In: Godambe VP, Sprott DA, editors. Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston; 1971. p. 236. [Google Scholar]
Han P, Wang L. Estimation with Missing Data: Beyond Double Robustness. Biometrika. 2013;100:417–430. [Google Scholar]
Han P. A further study of the multiply robust estimator in missing data analysis. Journal of Statistical Planning and Inference. 2014a;148:101–110. [Google Scholar]
Han P. Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association. 2014b;109:1159–1173. [Google Scholar]
Haziza D. Imputation and inference in the presence of missing data. In: Pfeffermann D, Rao CR, editors. In Handbook of Statistics, Volume 29A, Sample Surveys: Theory Methods and Inference. Elsevier; 2009. pp. 215–246. [Google Scholar]
Haziza D, Rao JNK. A nonresponse model approach to inference under imputation for missing survey data. Survey Methodology. 2006;32:53–64. [Google Scholar]
Haziza D, Nambeu CO, Chauvet G. Doubly robust imputation procedures for finite population means in the presence of a large number of zeros. The Canadian Journal of Statistics. 2014;42:650–669. [Google Scholar]
Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:523–539. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim JK, Haziza D. Doubly robust inference with missing data in survey sampling. Statistica Sinica. 2014;24:375–394. [Google Scholar]
Kim JK, Park H. Imputation using response probability. Canadian Journal of Statistics. 2006;34:171–182. [Google Scholar]
Kim JK, Rao JNK. Unified approach to linearization variance estimation from survey data after imputation for item nonresponse. Biometrika. 2009;96:917–932. [Google Scholar]
Kim JK, Shao J. Statistical Methods for Handling Incomplete Data. London: Chapman and Hall/CRC; 2013. [Google Scholar]
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. John Wiley & Sons; Hoboken, NJ: 2002. [Google Scholar]
Ohlsson E. Sequential Poisson sampling. Journal of Official Statistics. 1998;14:149–162. [Google Scholar]
Rao JNK. On two simple schemas of unequal probability sampling without replacement. Journal of the Indian Statistical Association. 1965;3:173–180. [Google Scholar]
Rao JNK, Shao J. Jackknife variance estimation with survey data under hot-deck imputation. Biometrika. 1992;79:811–822. [Google Scholar]
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
Sampford MR. On sampling without replacement with unequal probabilities of selection. Biometrika. 1967;54:499–513. [PubMed] [Google Scholar]
Shao J, Sitter RR. Bootstrap for imputed survey data. Journal of the American Statistical Association. 1996;93:819–831. [Google Scholar]
Shao J, Steel P. Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association. 1999;94:254–265. [Google Scholar]
Tillé Y. Sampling Algorithms. Springer-Verlag; New York: 2006. [Google Scholar]

[R1] Berger YG. A jackknife variance estimator for unistage stratified samples with unequal probabilities. Biometrika. 2007;94:953–964. [Google Scholar]

[R2] Chan KCG, Yam SCP. Oracle, multiple robust and multipurpose calibration in missing response problem. Statistical Science. 2014;29:380–386. [Google Scholar]

[R3] Chen S, Haziza D. Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika. 2017a;102:439–453. [Google Scholar]

[R4] Chen S, Haziza D. Jackknife empirical likelihood method for multiply robust estimation with missing data. 2017b doi: 10.1016/j.csda.2018.05.011. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chen S, Haziza D. Multiply robust nonparametric multiple imputation for the treatment of missing data. 2017c In revision for Statistica Sinica. [Google Scholar]

[R6] Duan X, Yin G. Ensemble approaches to estimation of the population mean with missing responses. 2017 To appear in the Scandinavian Journal of Statistics. [Google Scholar]

[R7] Fay RE. A design-based perspective on missing data variance. Proceedings of the 1991 Annual Research Conference, US Bureau of the Census; 1991. pp. 429–440. [Google Scholar]

[R8] Gelein B. Preserving the distribution function in surveys in case of imputation for zero inflated data. 2016 Submitted. [Google Scholar]

[R9] Hájek J. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Annals of the Institute of Statistical Mathematics. 1964;35:1491–1523. [Google Scholar]

[R10] Hájek J. Comment on “An essay on the logical foundations of survey sampling” by Basu, D. In: Godambe VP, Sprott DA, editors. Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston; 1971. p. 236. [Google Scholar]

[R11] Han P, Wang L. Estimation with Missing Data: Beyond Double Robustness. Biometrika. 2013;100:417–430. [Google Scholar]

[R12] Han P. A further study of the multiply robust estimator in missing data analysis. Journal of Statistical Planning and Inference. 2014a;148:101–110. [Google Scholar]

[R13] Han P. Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association. 2014b;109:1159–1173. [Google Scholar]

[R14] Haziza D. Imputation and inference in the presence of missing data. In: Pfeffermann D, Rao CR, editors. In Handbook of Statistics, Volume 29A, Sample Surveys: Theory Methods and Inference. Elsevier; 2009. pp. 215–246. [Google Scholar]

[R15] Haziza D, Rao JNK. A nonresponse model approach to inference under imputation for missing survey data. Survey Methodology. 2006;32:53–64. [Google Scholar]

[R16] Haziza D, Nambeu CO, Chauvet G. Doubly robust imputation procedures for finite population means in the presence of a large number of zeros. The Canadian Journal of Statistics. 2014;42:650–669. [Google Scholar]

[R17] Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion) Statistical Science. 2007;22:523–539. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Kim JK, Haziza D. Doubly robust inference with missing data in survey sampling. Statistica Sinica. 2014;24:375–394. [Google Scholar]

[R19] Kim JK, Park H. Imputation using response probability. Canadian Journal of Statistics. 2006;34:171–182. [Google Scholar]

[R20] Kim JK, Rao JNK. Unified approach to linearization variance estimation from survey data after imputation for item nonresponse. Biometrika. 2009;96:917–932. [Google Scholar]

[R21] Kim JK, Shao J. Statistical Methods for Handling Incomplete Data. London: Chapman and Hall/CRC; 2013. [Google Scholar]

[R22] Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. John Wiley & Sons; Hoboken, NJ: 2002. [Google Scholar]

[R23] Ohlsson E. Sequential Poisson sampling. Journal of Official Statistics. 1998;14:149–162. [Google Scholar]

[R24] Rao JNK. On two simple schemas of unequal probability sampling without replacement. Journal of the Indian Statistical Association. 1965;3:173–180. [Google Scholar]

[R25] Rao JNK, Shao J. Jackknife variance estimation with survey data under hot-deck imputation. Biometrika. 1992;79:811–822. [Google Scholar]

[R26] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]

[R27] Sampford MR. On sampling without replacement with unequal probabilities of selection. Biometrika. 1967;54:499–513. [PubMed] [Google Scholar]

[R28] Shao J, Sitter RR. Bootstrap for imputed survey data. Journal of the American Statistical Association. 1996;93:819–831. [Google Scholar]

[R29] Shao J, Steel P. Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association. 1999;94:254–265. [Google Scholar]

[R30] Tillé Y. Sampling Algorithms. Springer-Verlag; New York: 2006. [Google Scholar]

PERMALINK

Multiply robust imputation procedures for zero-inflated distributions in surveys

Sixia Chen

David Haziza

Abstract

1 Introduction

2 Basic setup

3 Proposed method

4 Variance estimation

5 Simulation study

Table 1.

Table 2.

6 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Multiply robust imputation procedures for zero-inflated distributions in surveys

Sixia Chen

David Haziza

Abstract

1 Introduction

2 Basic setup

3 Proposed method

4 Variance estimation

5 Simulation study

Table 1.

Table 2.

6 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases