Doubly Robust Nonparametric Multiple Imputation for Ignorable Missing Data

Qi Long; Chiu-Hsieh Hsu; Yisheng Li

. Author manuscript; available in PMC: 2013 Jan 1.

Published in final edited form as: Stat Sin. 2012;22:149–172.

Doubly Robust Nonparametric Multiple Imputation for Ignorable Missing Data

Qi Long ^1,^✉, Chiu-Hsieh Hsu ², Yisheng Li ³

PMCID: PMC3280694 NIHMSID: NIHMS296043 PMID: 22347786

Abstract

Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.

Keywords: Doubly robust, Missing at random, Multiple imputation, Nearest neighbor, Nonparametric imputation, Sensitivity analysis

1 Introduction

Missing data are common in medical and social science studies. When the missingness does not depend on the data values, missing or observed, the data are called missing completely at random (MCAR) and one can perform a so-called complete case analysis by ignoring the observations with missing values (Little and Rubin (2002)). In practice, the assumption of MCAR is rarely satisfied, and a complete case analysis can lead to biased estimation. A more realistic missing mechanism is that the data are missing at random (MAR) (Rubin (1987); Little and Rubin (2002)), which means that the missing status is independent of missing values conditional on observed covariates.

1.1 The Problem

Let Y denote the outcome variable of interest with missing values, δ denote the missingness indicator, δ = 0 if Y is missing and δ = 1 if Y is observed, and X = (X₁, X₂, ..., X_p) denote a set of fully observed covariates that are predictive of Y or δ. Suppose (Y_i, X_i, δ_i) (i = 1, . . . , n) constitute an independent and identically distributed sample from n subjects. The observed data can be written as (δ_iY_i, X_i, _i) (i = 1, . . . , n) where δ_iY_i is missing when δ_i = 0. We consider the estimation of μ = E(Y) when Y is independent of δ given X. For this type of problems, one can use imputation methods (either single or multiple) or the inverse probability weighting methods that are doubly robust.

1.2 Imputation Methods

An imputation method replaces each missing value with one “plausible” value (single imputation) or a set of “plausible” values (multiple imputation, MI), and subsequently standard analysis is performed using the imputed datasets. Adjustments are necessary for computing standard errors to account for the uncertainty of the imputed values (Rubin (1987); Little and Rubin (2002)). The imputation models can be parametric (Matloff (1981); Little and Rubin (2002)), semiparametric (Wang, Linton, and Härdle (2004)) or nonparametric (Titterington and Sedransk (1989); Cheng (1994); Aerts et al. (2002)). Despite being efficient when the parametric component is correctly specified, the parametric and semiparametric imputation methods are sensitive to model misspecifications. While a nonparametric imputation approach is more robust to model misspecification, a different challenge arises. Most existing nonparametric regression imputation methods focus on the case of a single fully observed covariate. For example, Cheng (1994) studied a single imputation approach that imputes missing values with kernel estimates of the conditional mean of an outcome variable given a continuous covariate; Aerts et al. (2002) studied an MI approach, which was based on the nonparametrically estimated distribution of the outcome variable conditional on the covariate using kernel methods, among others. The main difficulty with nonparametric imputation methods is that as the number of covariates increases, it becomes increasingly difficult to estimate either the conditional distribution or the conditional expectation of the outcome variable given the covariates, due to the curse of dimensionality. In practice, performance loss for the nonparametric imputation methods can be substantial even when only a small number of covariates are used. It is important to have a nonparametric approach that alleviates the curse of dimensionality in the presence of multiple covariates.

1.3 Doubly Robust Methods

The earliest doubly robust method for missing data is the calibration estimator (CE), also known as the the generalized regression estimator (Cassel, Sarndal, and Wretman (1976)), which extended the inverse probability weighting method (Horvitz and Thompson (1952)). The CE uses two working models based on the observed covariates: one model predicts the missing values, and the other model predicts the missing probabilities. Specifically, the estimator is a result of expressing the mean of Y as a sum of prediction and inverse probability-weighted prediction errors,

μ = E [E (Y ∣ X)] + E [δ \frac{(Y - E (Y ∣ X))}{π (X)}],

where π(X) = E(δ|X). Plugging the estimate of each parameter into the expression leads to

{\hat{μ}}_{C E} = n^{- 1} \sum_{i = 1}^{n} {\hat{Y}}_{i} + n^{- 1} \sum_{i = 1}^{n} δ_{i} w_{i} (Y_{i} - {\hat{Y}}_{i}) = n^{- 1} \sum_{i = 1}^{n} {\hat{Y}}_{i} + n^{- 1} \sum_{i = 1}^{n} δ_{i} w_{i} {\hat{∊}}_{i},

(1)

where Ŷ is the prediction based on a regression model for E(Y|X) which is fit using the complete cases, $w = 1 ∕ \hat{π} (X_{i})$ is the inverse of the estimated probabilities of being observed that is often computed using a regression model for π(X), and $\hat{∊} = Y - \hat{Y}$ . On the right hand side (RHS) of (1), the first term is equivalent to imputing all Y values using a model for E(Y|X), and the second term is a sum of inverse probability-weighted prediction errors due to the model for E(Y|X) based on the weights estimated using π(X). If the model for E(Y|X) is correctly specified, then the second term converges to 0 and ${\hat{μ}}_{C E}$ converges to μ. If the model for π(X) is correctly specified, then one can show that the second term consistently removes any bias that may be associated with the first term and hence ${\hat{μ}}_{C E}$ still converges to μ. As a result, ${\hat{μ}}_{C E}$ is consistent if either of the two models is correctly specified.

Other doubly robust estimators have been introduced that use a parametric model to impute missing values and inverse probability-weighted prediction errors to correct potential bias that is associated with the parametric model for imputation. In particular, doubly robust methods were extended to regression settings (Robins, Rotnitzky, and Zhao (1994)) and repeated measurement data (Robins, Rotnitzky, and Zhao (1995)). In the context of estimating a population mean, Qin, Shao, and Zhang (2008) and Cao, Tsiatis, and Davidian (2009) proposed two elegant approaches to improve the efficiency of the CE when the imputation model is incorrectly specified, and their methods achieve the semiparametric efficient lower bound when both models are correctly specified. During the review process, a recent related work by Hu, Follmann, and Qin (2010) was brought to our attention; they extended the CE estimator through the use of a nonparametric imputation model, where the dimension of the covariates is reduced through a parametric working index.

The double-robustness property of the CE and its extensions, though attractive, has its limitations. If one working model is misspecified, especially if it is seriously misspecified, a doubly robust estimator, although consistent, can have increased bias or variance in small samples. When both models are misspecified, a doubly robust estimator can underperform other estimators that are not doubly robust (Kang and Schafer (2007)). In addition, the inverse probability weighting step can be sensitive to missing probabilities that are close to 1. Therefore, it is desirable to develop an inference procedure, that reduces the impact of the misspecification of both working models, allows us to select and rely more heavily on the working model that is correctly specified, and is less sensitive to missing probabilities that are close to 1.

1.4 Doubly Robust Multiple Imputation

We propose a new nonparametric MI method, and the method alleviates the curse of dimensionality, that limits the usefulness of existing nonparametric imputation methods. The proposed method is doubly robust, and differs from the CE and its extensions in that it does not use inverse-probability weighting. Our method has several advantages. First, it is more robust to two misspecified working models, being nonparametric. Second, it lessens the adverse impact of extreme missing probabilities. The method avoids the inverse probability weighting and relies only on imputation based on two working models; since one imputation creates only one pseudo observation for each observation with missing values, its impact is considerably less than that of inverse probability weighting in the presence of extreme missing probabilities. We also propose a new sensitivity analysis for empirically evaluating the validity of working models through varying weights that are used to define similarity between observations based on two working models. We note that for the CE and its related estimators the existing sensitivity analyses primarily focus on the impact of non-ignorable missingness (Rotnitzky, Robins, and Scharfstein (1998); Scharfstein, Rotnitzky, and Robins (1999)), and one that is similar to ours is neither available nor obvious. The proposed sensitivity analysis allows investigators to select optimal weights so that the resulting estimator relies completely or more heavily on the working model that is likely to be correctly specified to achieve improved efficiency. Furthermore, the use of two weights allows investigators to incorporate their prior beliefs on the validity of two working models. In summary, our approach is intended to combine the strengths of nonparametric imputation methods and the CE method, and to overcome their respective limitations. Our main goal is not to achieve a gain in efficiency; thus, we primarily focus on comparing our approach with existing imputation methods and the CE.

The rest of the paper is organized as follows. In Section 2, we present the doubly robust nonparametric MI approach and its sensitivity analysis. In Section 3, we evaluate finite sample performance in simulation studies. In Section 4, we illustrate the proposed approach using data from a colorectal adenoma study. We conclude with some discussion in Section 5.

2 The Methodology

We first introduce the working models, then describe in detail the doubly robust nonparametric multiple imputation procedures.

2.1 Working Models and Predictive Scores

In order to use the fully observed X to define an imputing set for each observation with missing Y , we consider two working models. Based on the idea of predictive mean matching (Rubin (1987)), the first model is for the outcome Y ,

E (Y ∣ X_{o}) = l_{1} (X_{o}, β),

(2)

where l₁ is a specified real-valued smooth function, X_o is a set of p₁ observed covariates, and β = (β₁, . . . , β_p1)^T is a vector of regression coefficients. Here l₁ is considered a predictive score of Y , for example, one can use the linear regression model, l₁(X_o, β) = β^TX_o. When the working model (2) for Y is correctly specified, an imputing set for each missing Y defined through the predictive score can lead to an improvement in efficiency when the missing mechanism is MCAR, i.e., E(δ|X, Y) = E(δ), and a reduction in bias when missing mechanism is MAR, i.e., E(δ|X, Y) = E(δ|X). If the working regression model for Y is misspecified, bias may remain even under a MAR mechanism. Hence, based on the idea of propensity score matching (Rosenbaum and Rubin (1983)), we take a model for predicting missingness to be

E (δ ∣ X_{m}) = l_{2} (X_{m}, α),

(3)

where l₂ is a specified real-valued smooth function, X_m is a set of p₂ observed covariates, and α = (α₁, . . . , α_p2)^T is a vector of regression coefficients. Here l₂ is considered a predictive score of δ, for example, one can use the logistic regression model, l₂(X_m, α) = exp(α^TX_m)/(1 + exp(α^TX_m)). We note that more complicated models can be used in (2) and (3). For instance, when p₁ and p₂ are large, a Lasso regression (Tibshirani (1996)) can be used for Model (2), and a generalized boosted model (GBM) (McCaffrey, Ridgewar, and Morral (2004)) can be used for Model (3). The functions l₁(X_o) and l₂(X_m) may include higher order terms of X. We denote the estimators based on (2) and (3) by $\hat{β}$ and $\hat{α}$ ; throughout, $\hat{β}$ and $\hat{α}$ are assumed to be $\sqrt{n} -consistent$ M-estimators or Z-estimators (van der Vaart (1998)). The incorrect specification of working models can be in the functional forms of l₁ and l₂ or in the set of covariates included. In practical applications, it is difficult to correctly choose a model for E(Y|X_o) = l₁(X_o, β) and there is no guarantee that the assumed model is correct. Our hope is that the proposed method can improve estimation efficiency if the assumed model is reasonably good, though not perfect.

Let Z₁ = l₁(X_o, β) and Z₂ = l₂(X_m, α). After (2) and (3) are fit using methods such as the maximum likelihood estimation or estimation equations that achieve $\sqrt{n} -consistency$ , the estimated predictive scores are $({\hat{Z}}_{1}, {\hat{Z}}_{2}) = {l_{1} (X_{o}, \hat{β}), l_{2} (X_{m}, \hat{α})}$ . Alternatively, one could take the predictive scores as a monotonic transformation of (Z₁, Z₂). For example, if l₂(X_m, α) = exp(α^TX_m)/(1+exp(α^TX_m)), the linear combination α^TX_m could be taken as the predictive score for δ. The proposed strategy summarizes the multi-dimensional X with a two-dimensional predictive score, Z ≡ (Z₁, Z₂). In the presence of only one predictive covariate, the predictive score is the covariate itself, and there is no need to fit the two working models.

2.2 Multiple Imputation (MI) Estimator

To stabilize the imputation, each predictive score is standardized by subtracting its mean and dividing by its standard deviation; the resulting score is denoted by S ≡ (S₁, S₂). Given S, for each subject with missing Y we create an imputing set that consists of observed responses from subjects who are similar. Specifically, S is used to select the imputing set by calculating the distance between subjects as

d (i, j) = {ω_{o} {[S_{1} (i) - S_{1} (j)]}^{2} + ω_{m} {[S_{2} (i) - S_{2} (j)]}^{2}}^{1 ∕ 2},

(4)

where ω_o and ω_m are non-negative weights for the predictive scores from (2) and (3), respectively, satisfying ω_o + ω_m = 1. This choice of the weights (ω_o, ω_m) can reflect the confidence of investigators on each working model. While the double-robustness property no longer holds when ω_o or ω_m is 1, such weights are useful in sensitivity analysis, as will be illustrated in Section 2.5. For each subject i with missing Y , the distance d(i, j) is used to define a set of a neighborhood, denoted by R_K(i), that consists of K subjects who have the smallest K distances (d) from subject i.

Given the imputing sets, we propose a multiple imputation (MI) estimator for the parameter of interest, μ. In the lth imputation, given the R_K(i) for each subject i with missing outcome, a $Y_{i}^{*}$ is randomly drawn with equal probability from R_K(i) to replace the missing Y for subject i. We repeat this step for all subjects with missing Y , and let ${{\tilde{Y}}_{i} (l) = δ_{i} Y_{i} + (1 - δ_{i}) Y_{i}^{*} (l) (i = 1, \dots, n)}$ and $\hat{μ} (l) = \sum_{i = 1}^{n} {\tilde{Y}}_{i} (l)$ be the l^th imputed data set and the associated mean estimator, respectively. The imputation scheme is independently repeated L times to obtain L imputed data sets, and the subsequent analysis of multiple imputed data sets follows well-established rules in Rubin (1987) and Little and Rubin (2002). The final MI estimator of μ is

{\hat{μ}}_{M I} = \frac{1}{L} \sum_{l = 1}^{L} \hat{μ} (l) .

(5)

We refer to this method as MI(K, ω_o, ω_m), where K is the number of the nearest neighbors, and ω_o and ω_m are the weights used to define the distance in (4).

2.3 Theoretical Properties of MI Estimator

We set forth the asymptotic properties of the proposed MI estimator as n → ∞ and L → ∞; a sketch of the proofs is provided in Appendix. Let →_p denote convergence in probability.

Proposition 1. Under Conditions (B1) and (B2) in Appendix, there exist β⁰and α⁰such that $\hat{β} \to_{p} β^{0}$ and $\hat{α} \to_{p} α^{0}$ .

Proposition 1 implies that the limits of $\hat{β}$ and $\hat{α}$ exist even if both working models are misspecified. When the working models (2) and (3) are correctly specified, then β⁰ and α⁰ are the true parameter values. Take the true predictive scores evaluated at the limits of $\hat{β}$ and $\hat{α}$ as $Z^{0} \equiv (Z_{1}^{0}, Z_{2}^{0})) = {l_{1} (X_{o}, β^{0}), l_{2} (X_{m}, α^{0})}$ .

Proposition 2. If Y is independent of δ conditional on X and if either (2) or (3) is correctly specified, then E(Y|δ, Z⁰) = E(Y|Z⁰).

Note that the result in Proposition 2 is weaker than the conditional independence between Y and δ given Z, as (2) is postulated on the mean of Y only, not on the distribution of Y.

We consider here the multiple imputation estimator computed using Z⁰ instead of $\hat{Z}$ , denoted by ${\hat{μ}}_{M I}^{0}$ . Take μ(Z⁰) = E(Y|Z⁰), π(Z⁰) = Pr(δ = 1|Z⁰), and σ²(Z⁰) = var(Y|Z⁰).

Theorem 1. Under Conditions (A1)-(A3) in Appendix, if ω_o and ω_m are positive and either (2) or (3) is correctly specified, $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$ has an asymptotic normal distribution with mean 0 and variance

σ_{M I}^{2} = σ_{1}^{2} + σ_{2}^{2} + σ_{3}^{2} + 2 σ_{23},

where $σ_{1}^{2} = v a r [μ (Z^{0})]$ , $σ_{2}^{2} = E [v a r (δ {Y - μ (Z^{0})} ∣ Z^{0})]$ , $σ_{3}^{2} = E [\frac{{[1 - π (Z^{0})]}^{2}}{π {(Z^{0})}^{2}} v a r [δ {Y - μ (Z^{0})} ∣ Z^{0}]]$ , and $σ_{23} = E [\frac{1 - π (Z^{0})}{π (Z^{0})} v a r [δ {Y - μ (Z^{0})} ∣ Z^{0}]]$ .

Theorem 1 implies that ${\hat{μ}}_{M I}^{0}$ is doubly robust and achieves a $\sqrt{n}$ convergence rate. We note that the results in Theorem 1 hold for all fixed positive weights; this is analogous to the results using kernel methods (Cheng (1994); Aerts et al. (2002)): the asymptotic results do not depend on the specific form of a kernel function. In finite samples, the impact of varying weights can be appreciable, as seen later in simulation studies. In Theorem 1, we do not need to specify the full conditional distribution of Y given X in (2), and Y can be continuous or discrete. Given additional conditions, one can simplify the asymptotic variance in Theorem 1.

Corollary 1. If ω_o and ω_m are positive and either (2) or (3) is correctly specified, and if Y is independent of δ given Z⁰or E(Y|δ, Z⁰) = E(Y|Z⁰) and E(Y²|δ, Z⁰) = E(Y²|Z⁰), $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$ is asymptotically normal with mean 0 and variance $σ_{M I}^{2} = v a r [μ (Z^{0})] + E [σ^{2} (Z^{0}) ∕ π (Z^{0})]$ .

Two remarks are in order. First, when a monotonic transformation of Z⁰ is defined as the predictive scores, the asymptotic results in Theorem 1 still hold. In our numerical examples in Sections 3 and 4, predictive scores are defined as linear combinations of the covariates (allowing possibly higher order terms of the covariates to be included) in both the linear and logistic regression models. Second, since β⁰ and α⁰ are unknown in practice, we need to replace them with their $\sqrt{n} -consistent$ estimators ( $\hat{β}$ and $\hat{α}$ ) to compute Ẑ and subsequently the MI estimator ${\hat{μ}}_{M I}$ . Using the influence functions for $\hat{β}$ and $\hat{α}$ , and similar arguments as in the proof of Theorem 1, it is straightforward to show that ${\hat{μ}}_{M I}$ has the same asymptotic normal distribution as in Theorem 1.

We can rewrite the asymptotic variance of ${\hat{μ}}_{M I}$ in Corollary 1 as n^–1 (var(Y) + E[{π(Z⁰)^–1 – 1}σ²(Z⁰)]). When both working models are correctly specified, the asymptotic variance of ${\hat{μ}}_{M I}$ reduces to n^–1var(Y) + E[{π(X)^–1 – 1}σ²(X)]), which is the same as the asymptotic variance of ${\hat{θ}}_{C E}$ as shown in Table 1 of Hu et al. (2010). Furthermore, when (2) is correctly specified and (3) is incorrectly specified, one can perform sensitivity analysis as described in Section 2.5, likely leading to a ${\hat{μ}}_{M I}$ that relies only on the correctly specified working model for E(Y|X); as shown in Tsiatis and Davidian (2007), this estimator is optimal and the propensity score is not needed.

Table 1.

Simulation results when true models are Model (O1) for Y and (Ml) for δ with μ = 10 and n = 400. MIB(K,ω_o,ω_m) denotes the bootstrap MI method using K-nearest neighbors and weights ω_o and ω_m; L = 5 imputed datasets were used.

Method	RB(%)	SD	SE	MSE	CR(%)
CC	13.81	0.290	0.287	1.99	0.2
Correct Working Models for Both Y and δ
CE	-0.03	0.305	0.300	0.090	94.5
PMI	0.01	0.291	0.243	0.059	89.2
MIB(3,1.0,0.0)	0.60	0.307	0.307	0.098	94.7
MIB(3,0.8,0.2)	0.60	0.317	0.305	0.097	93.3
MIB(3,0.5,0.5)	0.64	0.311	0.307	0.098	94.3
MIB(3,0.2,0.8)	0.68	0.314	0.310	0.101	93.6
MIB(3,0.0,1.0)	1.07	0.319	0.331	0.121	94.9
Wrong Working Model for Y only (O1W)
CE	0.02	0.343	0.362	0.131	96.2
PMI	6.68	0.295	0.245	0.506	29.2
MIB(3,1.0,0.0)	6.94	0.307	0.303	0.573	44.8
MIB(3,0.8,0.2)	1.86	0.306	0.298	0.123	91.4
MIB(3,0.5,0.5)	1.39	0.311	0.303	0.111	93.0
MIB(3,0.2,0.8)	1.10	0.311	0.310	0.108	94.7
MIB(3,0.0,1.0)	1.07	0.321	0.328	0.119	94.1
Wrong Working Model for δ only (M1W)
CE	-0.01	0.289	0.275	0.076	93.5
MIB(3,1.0,0.0)	0.60	0.305	0.307	0.098	94.5
MIB(3,0.8,0.2)	0.84	0.308	0.297	0.095	93.1
MIB(3,0.5,0.5)	1.15	0.303	0.295	0.100	94.2
MIB(3,0.2,0.8)	1.70	0.304	0.294	0.115	90.7
MIB(3,0.0,1.0)	7.06	0.313	0.309	0.594	43.3

Open in a new tab

As shown in Theorem 1 and Corollary 1, the formula for the asymptotic variance is fairly complicated and involves the density functions of the estimated predictive scores (Ẑ). In practice, these density functions are often estimated using nonparametric methods, so the practical usefulness of the asymptotic variance is limited in small samples, in particular. We propose a more convenient alternative for estimating the variance of the proposed estimator, the well-established method (Rubin (1987); Little and Rubin (2002)) for estimating the variance of an MI estimator through combining within-imputation and between-imputation variances. However, it follows from Little and Rubin (2002) that the MI procedure in Section 2.2 is improper in the sense that it fails to incorporate the variability of estimating $\hat{β}$ and $\hat{α}$ . As a result, the standard method cannot be directly applied.

2.4 Bootstrap Multiple Imputation

To overcome the difficulty of estimating the variance of ${\hat{μ}}_{M I}$ , we incorporate a bootstrap step. In the lth imputation, it consists of the following steps.

Draw a random sample of equal size with replacement from the original data set, fit models (2) and (3) using this bootstrap sample, and compute $S = (S_{1}^{(l)}, S_{2}^{(l)})$ .
Compute the distance between a subject with a missing outcome, say subject i, and all other subjects that have an observed outcome, in the bootstrap sample as defined above. The imputing set for subject i is the nearest neighborhood $R_{K}^{(l)} (i)$ consisting of K subjects in the bootstrap sample with the K smallest distances from subject i. Draw a value $Y_{B i}^{*} (l)$ for subject i from $R_{K}^{(l)} (i)$ .
Take ${\tilde{Y}}_{B i} (l) = δ_{i} Y_{i} + (1 - δ_{i}) Y_{B i}^{*} (l), i = 1$ , . . . , n, and ${\hat{μ}}_{B} (l) = \sum_{i = 1}^{n} {\tilde{Y}}_{B i} (l)$ be the l^th bootstrap imputed data set and the associated mean estimator, respectively.

Repeating the bootstrap imputation L times, the final bootstrap MI estimator is ${\hat{μ}}_{M I B} = \frac{1}{L} \sum_{l = 1}^{L} {\hat{μ}}_{B} (l)$ , which is referred to as MIB(K, ω_o, ω_m). The bootstrap MI is a proper imputation (Little and Rubin (2002)) and hence its variance can be readily estimated using the sum of a between-imputation and a within-imputation component. The addition of a bootstrap step has been shown to allow the estimation of the variance of MI estimators in other settings (Heitjan and Little (1991); Rubin and Schenker (1991)). In our experience, L = 5 imputations suffice to achieve good performances in finite samples. We note that the bootstrap MI method in Aerts et al. (2002) uses a different bootstrap scheme that only resamples the complete observations, whereas our bootstrap scheme allows a resampling of the observations with missing values. In practice, non-convergence may arise when repeatedly fitting working models, say the logistic regression model, in bootstrap samples; when this happens, the bootstrap samples with the non-convergence issue are discarded.

2.5 Sensitivity Analysis

The choice of (ω_o, ω_m) plays an important role in computing MIB(K, ω_o, ω_m), and multiple estimates can be obtained using different weights. As a natural extension, a sensitivity analysis can be performed to evaluate the validity of (2) and (3). Specifically, since MIB(K, ω_o, ω_m) with nonzero weights are doubly robust and the MIB(K, 1, 0) and MIB(K, 0, 1) estimators are not, the differences between MIB(K, ω_o, ω_m) with nonzero weights, MIB(K, 1, 0), and MIB(K, 0, 1) can inform on the validity of both working models, and hence provide a justification for a sensitivity analysis. For example, if the working model (2) is correctly specified and (3) is not, then one can expect a MIB(K, ω_o, ω_m) estimator with nonzero weights to be similar to the MIB(K, 1, 0) estimator, with both estimators different from the MIB(K, 0, 1) estimator due to its bias. In this case, the MIB(K, 1, 0) estimator may be preferred to a MIB(K, ω_o, ω_m) estimator with nonzero weights, as the use of a misspecified working model (3) may introduce extra noise to estimation. If the MIB estimates do not vary considerably when changing the values of weights from one extreme (ω_o = 1, ω_m = 0) to another (ω_o = 0, ω_m = 1), one might have more confidence in the results and choose the optimal weight (ω_o = 1, ω_m = 0). In addition, the specification of (ω_o, ω_m) provides a natural way to incorporate prior beliefs on the validity of the two working models.

3 Simulation Studies

We conducted simulations to evaluate the finite sample performance and, in particular, the impact of incorrect specification of one or both working models and the choice of weights (ω_o, ω_m). The following estimators are compared: the sample mean of observed Y values (CC); the calibration estimator (CE) proposed by Cassel et al. (1976); a parametric MI estimator (PMI), where a regression model with the fully observed covariates is fit and then used to draw imputes for each missing observation; the proposed bootstrap nonparametric MI estimator (MIB) ${\hat{μ}}_{M I B}$ . The CE, PMI, and MIB estimators involve fitting a regression model for Y , and the CE and MIB estimators also involve fitting a regression model for. In our simulation studies and data example, working models were fit using the method of maximum likelihood.

Five fully observed covariates (X = (X₁, . . . , X₅)) were generated from independent uniform distributions on (–1, 1). For the outcome of interest (Y), two true models were considered: Model (O1) where, conditional on X, Y was generated from a normal distribution with a mean E(Y|X) = 10 + 2X₁ – 2X₂ + 3X₃ – 3X₄ + 1.5X₅ and a variance of 9; Model (O2) where, conditional on X, log(Y) was normal with a mean of 0.5 + 0.5X₁ – X₂ + 1.5X₃ – 2X₄ + 0.5X₅ and a variance of 3. For the missingness indicator (δ), two true models were considered: Model (M1) where δ was generated from a logit model, logit[Pr(δ = 1|X)] = 0.5X₁ – X₂ + X₃ – X₄ + X₅; Model (M2) where δ was generated from another logit model, logit[Pr(δ = 1|X)] = 0.5 + 2X₁ – 4X₂ + 2X₃ – 2X₄ + 2X₅. Model (M1) generates missing probabilities that are mostly bounded away from 1, whereas Model (M2) generates more missing probabilities that are close to 1; specifically, the probability of being missing is greater than 0.95 in 0.5% of observations under Model (M1) and in 15.5% of observations under Model (M2). Simulations were conducted for combinations of the true models for Y and δ, and the following incorrect working models were used: only three predictors, X₁, X₂, and X₃, were included in the working models for Y and δ, denoted by Model (O1W), (O2W), (M1W), and (M2W), respectively; when the true model for Y is Model (O2), Model (O1) was also used as an incorrect working model (2). For each simulation scenario, the following measures were evaluated using 1000 Monte Carlo data sets: the average relative bias (RB) computed using the ratio of the bias to the absolute value of the nonzero true value; the average standard error (SE) computed using a bootstrap method for CE and by combining the within and between variances for PMI and MIB; the mean squared error (MSE); the coverage rate of 95% Wald confidence intervals (CI).

The MIB estimators were computed using five different sets of values for (ω_o, ω_m): (1,0), (0.8,0.2), (0.5,0.5), (0.2,0.8) and (0,1). Note that MIB(K, 1, 0) is similar to the local multiple imputation estimator (LMI) (Aerts et al. (2002)); they both use only the outcome prediction model. Still, as the number of covariates increases, LMI is subject to the curse of dimensionality, and MIB(K, 1, 0) is not.

The simulation results for n = 400 are reported in Tables 1-3, in which K = 3 is used for the MIB estimators. Note that the CC estimator exhibits substantial bias in all cases. In Table 1, the true models for Y and δ are Models (O1) and (M1), respectively, and the missing probabilities are moderate. When both working models are correctly specified, the bias is negligible for all estimators including the MIB estimators with different weighting schemes; among them, the PMI estimator has the worst coverage rate. While the bias is negligible for all MIB estimators with non-zero weights, the MIB method leads to an even smaller bias when a larger weight is assigned to the working model for Y (ω_o). When only the working model for Y is misspecified as Model (O1W), the PMI and MIB(3,1,0) estimators exhibit considerable bias, both of which rely solely on the correct specification of the working model for Y. The other four MIB estimators and CE estimator show negligible bias. In this case, as the weight increases from 0.2 to 1 for the correct working model for δ, the bias of MIB estimator decreases slightly; these MIB estimators also have slightly lower MSE compared to the CE estimator. Similarly, when only the working model for δ is misspecified as Model (M1W), the MIB(3,0,1) exhibits considerable bias due to its sole reliance on the working model for δ, whereas the other four MIB estimators and CE estimators exhibit negligible bias. In this case, the CE estimator has slightly lower MSE compared to the MIB estimators. A similar pattern regarding the impact of weights on bias is also observed in Tables 2-3, which indicates that a sensitivity analysis using different weights is useful in choosing better weighting schemes. In addition, the impact of weights on SE is minimal for MIB estimators with nonzero weights when the missing probabilities are moderate. In all cases considered in Table 1, the doubly robust MIB estimators achieve similar performance in terms of MSE when compared to the CE estimator. In addition, the doubly robust MIB estimators show slightly larger bias and MSE when the working model for Y is misspecified compared to when the working model for δ is misspecified.

Table 3.

Simulation results with true model (O2) for Y and (Ml) for δ with μ = 8.932, and n = 400.

Method	RB(%)	SD	SE	MSE	CR(%)

GS	0.87	1.560	1.418	2.017	89.4
CC	54.69	2.988	2.610	30.676	58.0
Wrong Working Model for Y only (O2W)
CE	-0.98	1.638	1.523	2.327	90.2
PMI	24.43	2.034	2.176	9.497	94.0
MIB(3,1.0,0.0)	19.75	2.024	1.819	6.421	93.6
MIB(3,0.8,0.2)	0.94	1.686	1.527	2.339	91.9
MIB(3,0.5,0.5)	0.23	1.660	1.510	2.281	90.5
MIB(3,0.2,0.8)	-0.16	1.656	1.504	2.262	90.1
MIB(3,0.0,1.0)	0.01	1.658	1.530	2.341	90.9
Wrong Working Model for Y only (O1)
CE	-2.99	1.947	2.102	4.490	91.3
PMI	38.61	1.605	2.267	17.034	65.1
MIB(3,1.0,0.0)	-0.20	1.661	1.495	2.235	89.6
MIB(3,0.8,0.2)	-1.14	1.629	1.472	2.177	89.7
MIB(3,0.5,0.5)	-1.22	1.648	1.475	2.187	89.9
MIB(3,0.2,0.8)	-1.60	1.633	1.469	2.178	88.6
MIB(3,0.0,1.0)	0.10	1.663	1.525	2.326	90.4
Wrong Working Model for δ only (M1W)
CE	14.88	1.958	1.843	5.163	96.3
MIB(3,1.0,0.0)	-0.30	1.664	1.493	2.230	90.3
MIB(3,0.8,0.2)	-0.36	1.677	1.486	2.209	89.6
MIB(3,0.5,0.5)	-0.21	1.658	1.482	2.197	90.2
MIB(3,0.2,0.8)	-0.09	1.653	1.483	2.199	89.6
MIB(3,0.0,1.0)	19.92	2.027	1.849	6.584	93.8
Wrong Working Models for both (O1) and (M1W)
CE	29.62	1.380	1.933	10.737	65.5
MIB(3,1.0,0.0)	-0.20	1.665	1.493	2.229	90.5
MIB(3,0.8,0.2)	-0.47	1.664	1.478	2.186	90.1
MIB(3,0.5,0.5)	-0.31	1.662	1.480	2.191	89.7
MIB(3,0.2,0.8)	0.02	1.670	1.480	2.190	90.1
MIB(3,0.0,1.0)	19.78	2.035	2.035	6.530	94.1

Open in a new tab

Table 2.

Simulation results when true models are Model (O1) for Y and Model (M2) for δ, with μ = 10, and n = 400.

Method	RB(%)	SD	SE	MSE	CR(%)

CC	17.47	0.256	0.259	3.119	0.0
Correct Working Models for Both
CE	-0.6	1.640	0.515	0.269	91.7
PMI	0.04	0.323	0.241	0.058	87.0
MIB(3,1.0,0.0)	1.80	0.365	0.375	0.173	90.0
MIB(3,0.8,0.2)	1.83	0.396	0.375	0.174	88.6
MIB(3,0.5,0.5)	2.04	0.420	0.396	0.198	89.3
MIB(3,0.2,0.8)	2.40	0.439	0.420	0.234	88.8
MIB(3,0.0,1.0)	2.99	0.462	0.487	0.327	90.3
Wrong Working Model for Y only (O1W)
CE	-0.57	3.659	0.759	0.579	79.9
PMI	9.15	0.302	0.234	0.892	7.1
MIB(3,1.0,0.0)	9.92	0.314	0.327	1.091	24.5
MIB(3,0.8,0.2)	4.30	0.366	0.354	0.310	77.8
MIB(3,0.5,0.5)	3.61	0.404	0.382	0.276	82.6
MIB(3,0.2,0.8)	3.13	0.430	0.408	0.264	86.3
MIB(3,0.0,1.0)	2.97	0.463	0.488	0.328	90.6
Wrong Working Model for δ only (M2W)
CE	0.02	0.377	0.334	0.112	92.6
MIB(3,1.0,0.0)	1.78	0.366	0.374	0.172	91.9
MIB(3,0.8,0.2)	2.30	0.381	0.355	0.179	87.8
MIB(3,0.5,0.5)	3.09	0.392	0.359	0.224	84.7
MIB(3,0.2,0.8)	4.23	0.387	0.362	0.310	78.6
MIB(3,0.0,1.0)	10.72	0.373	0.373	1.288	31.5

Open in a new tab

In Table 2, the true models for Y and δ are Models (O1) and (M2), respectively. The true outcome model is the same as in Table 1, but the missing probabilities are more extreme. Since the estimated missing probabilities are unstable, the performance of all estimators degrades, though to different degrees. When both working models are correctly specified, or only the working model for Y is misspecified, the CE estimator has considerably larger MSE than the MIB estimators with nonzero weights, and its SE substantially underestimates the sampling standard deviation (SD) indicating that the CE estimate is not stable. When only the working model for δ is misspecified, the weights for CE are stabilized while the correct working model for Y protects CE from being inconsistent; as a result, its performance is slightly better than that of MIB estimators. Also in this case, the impact of weights on SE is appreciable for MIB estimators with nonzero weights due to unstable estimated missing probabilities; specifically, a larger weight for the predictive score for δ tends to result in larger SD. The results regarding PMI and the impact of weighting on bias are similar to what are observed in Table 1.

Table 3 presents the simulation results when Y was generated from Model (O2), and δ was generated from Model (M1). When both working models are correctly specified, the results are similar to those in Tables 1 and 2 and hence are not included in Table 3. Two incorrect working models were used for Y : Model (O2W) which used a wrong set of covariates, and Model (O1) which used the correct set of covariates but assumed an incorrect mean function. In addition, we considered a case where both working models were misspecified. Since Y does not follow a normal distribution, we also computed the sample mean of all Y values, which is regarded as the gold standard (GS), and the coverage rate for GS is shown to be somewhat below the nominal level. When the working model for Y is misspecified as Model (O2W), CE achieves satisfactory performance; when the working model for Y is misspecified as Model (O1), CE shows appreciable bias and larger MSE compared to the MIB estimators with nonzero weights. In both cases, the MIB estimators with nonzero weights achieve satisfactory performance and PMI shows substantial bias. Interestingly, the MIB(3,1,0) estimator exhibits substantial bias when the incorrect working model (O2W) is used, whereas it shows negligible bias when the incorrect working model (O1) is used; this suggests that the MIB(3,1,0) estimator is robust to the misspecification of the working model for Y if the correct set of covariates are included. As discussed previously, the MIB method is nonparametric and only uses the predictive scores to evaluate the similarity between subjects, hence its dependence on two working models is weaker than that of the CE estimator. As long as the estimated predictive scores (Ẑ) are highly correlated with the true scores, the MIB method, say, MIB(3,1,0), is expected to work. Similarly, when using the two incorrect working models (O1) and (M1W), the MIB estimators with nonzero weights still achieve performances that are comparable to GS, whereas the CE estimator shows substantial bias and coverage well below the nominal level. When the working model for δ is misspecified as Model (M1W), the CE estimator also exhibits appreciable and greater bias compared to the results in Tables 1 and 2. We note that the coverage rates of the CE estimator can be misleading in this case, since it usually exhibits large sample variation in addition to its appreciable bias. In this setting, the impact of extreme missing probabilities was also examined; it is similar to what is observed in Tables 1 and 2.

We conducted additional simulations to investigate the impact of the number of the nearest neighbors (K) and the sample size (n). As K increases, the bias of the MIB estimator increases and its SE decreases slightly (results not shown). The MSE is comparable for different K's, though K = 3 in general leads to slightly lower MSEs. As the sample size increases, the performances of MIB and CE methods improve, whereas the performances of CC and PMI methods remain unsatisfactory when the mean model for Y is misspecified.

To summarize, the proposed MIB estimators achieve similar or better performance in all settings compared with other estimators considered in our simulation studies. Our results suggest that it is more important to correctly specify the working model for Y, and larger weight for the predictive score for δ can lead to larger SD for MIB estimators. Thus, it is recommended to choose larger ω_o value (say, ≥ 0.5) in the absence of prior knowledge on the working models.

4 Data Example

We illustrate the proposed method using a colorectal adenoma data set. A colorectal polyp prevention trial was conducted at the Arizona Cancer Center, in which data were collected from 1192 patients who underwent removal of a colorectal adenoma. Demographic information such as age, gender, body mass index (BMI), and dietary information (e.g. vitamin D), based on the Arizona Food Frequency Questionnaire (AFFQ) (Martnez et al. (1999)), were collected for all participants. The dietary intake based on the AFFQ is known to be subject to measurement error. To have a more accurate measurement, an assay based on blood/tissue samples was performed to measure the dietary intake at serum level for a subpopulation of the 1192 participants. In particular, 598 participants were selected to have their serum vitamin D levels (Y) measured. For those participants who were not selected, their serum vitamin D levels were regarded as missing data (δ = 0). We were interested in estimating the mean serum vitamin D level in the overall study population. While the selection for performing the assay was not explicitly based on demographics or disease characteristics, practical constraints in the implementation of the selection procedure may well have led to an imbalance between those who were selected and who were not. To account for a potential MAR mechanism, we applied the proposed method to estimating the overall mean serum vitamin D level.

We first constructed working models for Y and δ. Based on linear regression analyses, the serum vitamin D level was shown to be significantly associated with gender and the BMI of a patient, the number of baseline adenomas, and the vitamin D intake derived from the AFFQ. Based on logistic regression analyses for the missingness, its association with the number of baseline adenomas achieves statistical significance with an estimated odds ratio (OR) 1.18 and a 95% CI (1.04, 1.34), and its association with the gender of a patient achieves marginal statistical significance with an estimated OR 1.27 and a 95% CI (0.99, 1.61). Consequently, to compute the CE and MI estimators, we included the gender and BMI of a patient, the number of baseline adenomas, and the vitamin D intake from the AFFQ as covariates to fit a linear regression model for predicting serum vitamin D level, and included the patient's gender and the number of baseline adenomas as covariates to fit a logistic regression model for predicting the missing probability. To compute MIB estimators with different weighting schemes, we chose K = 3 and L = 5.

The results are reported in Table 4 for CC, CE, and MIB. All methods produce a similar point estimate of the mean serum vitamin D level. The CE method produces a lower estimate (5.4% lower) of standard error (SE) compared to the CC analysis. The MIB method produces a lower SE (19.8% lower) compared to the CC analysis when a small weight (e.g. 0.2) is used for the predictive score for the missing probability. When building the working model for δ, one sees that the association between missingness and other covariates is in general weak, i.e., ORs close to 1. Thus, it is likely that the missing data mechanism is close to MCAR in this dataset. In addition, our results seem to indicate that the working model for Y is approximately correct. Consequently, a larger weight for the predictive score of the missing probability may introduce extra noise to the estimation in a single data set, which can manifest itself in the form of higher SE, larger bias, or both. In summary, the working model for the missing probability is likely incorrect, whereas the working model for the outcome is likely close to the truth. As a result, either MIB(3, 0.8, 0.2) or MIB(3, 1, 0) could be chosen as the estimate of the overall mean serum vitamin D level.

Table 4.

Estimation of the overall mean level of serum vitamin D for a colon cancer study.

Method	Estimate	SE	95% CI
CC	26.262	0.385	(25.508, 27.016)
CE	26.267	0.364	(25.554, 26.981)
MIB(3,1.0, 0.0)	26.133	0.315	(25.516, 26.751)
MIB(3,0.8, 0.2)	26.364	0.309	(25.759, 26.969)
MIB(3,0.5, 0.5)	26.249	0.500	(25.269, 27.229)
MIB(3,0.2, 0.8)	26.558	0.330	(25.911, 27.206)
MIB(3,0.0, 1.0)	26.438	1.642	(23.220, 29.656)

Open in a new tab

5 Discussion

Under MAR, we have investigated a nonparametric multiple imputation approach to estimating the marginal mean of a continuous or discrete random variable that is missing for some subjects. Working models are used to achieve two main goals: dimension reduction and double-robustness. Compared to CE and its related estimators, our approach has weak reliance on both working models in the sense that it only uses the estimated predictive scores to evaluate the similarity between subjects; along the lines of Hu et al. (2010) (in particular, DEFINITION 1, p. 306), as long as our predictive scores Z₁ and Z₂ are an atom of E(Y|X) and E(δ|X), respectively, the results in Section 2 still hold. Our approach also incorporates a bootstrap step, which provides a convenient way to estimate the variance of the estimator. In addition, our proposed sensitivity analysis allows investigators to incorporate prior beliefs on the validity of the working models, and, more importantly, evaluate the validity of the working models, which in turn enables investigators to choose an optimal estimator. For CE and its related estimators, a similar sensitivity analysis is lacking and it is not obvious how to develop such a sensitivity analysis. The proposed approach can be extended to other settings such as regression analysis in the presence of missing data. Based on our numerical results, we recommend that investigators always perform the sensitivity analysis and set ω_o to a larger value in the absence of strong prior knowledge.

In the context of surveys, Haziza and Beaumont (2007) proposed imputation methods based on two scores that are similar to our Z. They proposed to use a classification algorithm to partition the sample into disjoint classes and then to impute the missing values within each class; this differs from our approach in that our method allows the K-nearest neighbors to overlap. Their approach may encounter difficulty when no obvious clusters exist in the data, and its theoretical properties are unknown. We have used a K-nearest neighbor method to allow for adaptation to the local density of the data and missing probabilities. It is a future interest to study principled approaches for selecting K as well as additional adaptations.

Acknowledgment

We would like to thank Editor Kung-Yee Liang, an associate editor, and two referees for helpful comments that greatly improved an earlier draft of this manuscript. This work was supported in part by an NIH/NCI R03 grant.

APPENDIX: PROOFS

Let h₁(Z⁰) be the density function of Z⁰. The following regularity conditions are stated for Theorem 1 and Corollary 1.

(A1) Y has finite first and second moments, and $σ_{1}^{2}$ , $σ_{2}^{2}$ , $σ_{3}^{2}$ , and σ₂₃ as defined in Theorem 1 are finite.
(A2) $\frac{K}{n} \to 0$ and $\frac{K}{log (n)} \to \infty$ .
(A3) h₁(Z⁰) and π(Z⁰) are continuous and bounded away from 0 in the compact support of Z⁰.

A Proof of Proposition 1

We prove Proposition 1 for $\hat{β}$ only, since the arguments for $\hat{α}$ are similar. The following conditions are assumed to hold for $\hat{β}$ :

(B1) $\hat{β}$ is the maximizer of a strictly concave objective function, $ℓ_{n} (β)$ , or the unique solution to a set of estimation equations, U_n(β) = 0.
(B2) $ℓ_{n} (β)$ (or U_n(β)) converges almost surely to $ℓ (β) = E {ℓ_{n} (β)}$ (or U(β) = E{U_n(β)}), uniformly in β; $ℓ (β)$ is strictly concave with a unique maximizer β⁰ (or U(β) has a unique solution β⁰).

Note that Conditions (B1) and (B2) are satisfied for most regression models including linear and generalized linear models and for many estimation equations. In either case, it follows from arguments similar to those in Section 5.2 of van der Vaart (1998) that $\hat{β} \to_{p} β^{0}$ . As discussed in Section 5.2 of van der Vaart (1998), Conditions (B1) and (B2) can be relaxed and Proposition 1 holds for most, if not all, potential working models for Y .

B Proof of Proposition 2

If (3) is correctly specified, $Z_{2}^{0} = l_{2} (X_{m}, α^{0})$ is the propensity score defined in Rosenbaum and Rubin (1983) and Proposition 2 follows from arguments similar to theirs. If (2) is correctly specified, then $Z_{1}^{0} = l_{1} (X_{o}, β^{0}) = E (Y ∣ X_{o})$ and $E (Y ∣ δ, Z^{0}) = E {E (Y ∣ δ, Z^{0}, X) ∣ δ, Z^{0}} = E {E (Y ∣ Z^{0}, X) ∣ δ, Z^{0}} = E (Z_{1}^{0} ∣ δ, Z^{0}) = Z_{1}^{0} = E (Y ∣ Z^{0})$ , where the second equality is due to MAR. The proof is complete.

C Proof of Theorem 1

Under MAR, if the weights (ω_o, ω_m) are positive and either of the working models (2) and (3) is correctly specified, then it follows from Proposition 2 that E(Y|Z⁰, δ) = E(Y|Z⁰); this implies that we can use E(Y|Z⁰, δ = 1) based on observed data to impute E(Y|Z⁰, δ = 0) for observations with missing Y . This result is used implicitly throughout the proof.

To derive the asymptotic distribution of ${\hat{μ}}_{M I}^{0}$ with positive weights, we first consider another estimator, the K-nearest-neighbor plug-in estimator

\hat{μ} = \frac{1}{n} \sum_{i = 1}^{n} [δ_{i} Y_{i} + (1 - δ_{i}) \sum_{j \in R_{K} (i)} \frac{1}{K} Y_{j}],

where R_K(i) is the set of K nearest observed neighbors of Y_i defined using the distance in (4). We note that the K nearest neighbors are chosen from the subjects with observed outcomes, i.e., δ = 1.

C.1 Asymptotic Distribution of $\sqrt{n} (\hat{μ} - μ)$

Consider a more general case where

\hat{μ} = \frac{1}{n} \sum_{i = 1}^{n} [δ_{i} Y_{i} + (1 - δ_{i}) \sum_{j = 1}^{n} \frac{W_{i j} δ_{j}}{\sum_{k} W_{i k} δ_{k}} Y_{j}]

and $W_{i j} = W (Z_{i}^{0}, Z_{j}^{0})$ are consistent probability weights as defined in Stone (1977). Then we write

\hat{μ} - μ = T_{1} + T_{2} + T_{3} + T_{4},

where $T_{1} = \frac{1}{n} \sum_{i = 1}^{n} [μ (Z_{i}^{0}) - μ]$ , $T_{2} = \frac{1}{n} \sum_{i = 1}^{n} δ_{i} [Y_{i} - μ (Z_{i}^{0})]$ , $T_{3} = \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j = 1}^{n} \frac{W_{i j} δ_{j}}{\sum_{k} W_{i k} δ_{k}} [Y_{j} - μ (Z_{j}^{0})]$ , and $T_{4} = \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j = 1}^{n} \frac{W_{i j} δ_{j}}{\sum_{k} W_{i k} δ_{k}} [μ (Z_{j}^{0}) - μ (Z_{i}^{0})]$ .

It is straightforward to show that $\sqrt{n} T_{1} \to_{d} N (0, σ_{1}^{2})$ and $\sqrt{n} T_{2} \to_{d} N (0, σ_{2}^{2})$ , where $σ_{1}^{2} = v a r [μ (Z^{0})])$ and $σ_{2}^{2} = E [v a r (δ {Y - μ (Z^{0})} ∣ Z^{0})]$ . Let

\begin{matrix} T_{3}^{*} = \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j = 1}^{n} \frac{1}{n} \frac{W_{i j} δ_{j}}{h (Z_{i}^{0})} [Y_{j} - μ (Z_{j}^{0})] \\ T_{4}^{*} = \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j = 1}^{n} \frac{W_{i j} δ_{j}}{h (Z_{i}^{0})} [μ (Z_{j}^{0}) - μ (Z_{i}^{0})], \end{matrix}

where $h (Z_{i}^{0}) = π (Z_{i}^{0}) h_{1} (Z_{i}^{0}); h (Z_{i}^{0})$ can be estimated by $\hat{h} (Z_{i}^{0}) = \sum_{k} W_{i k} δ_{k} ∕ n$ . Next, we show that $\sqrt{n} T_{3} - \sqrt{n} T_{3}^{*} \to_{p} 0$ and $\sqrt{n} T_{4} - \sqrt{n} T_{4}^{*} \to_{p} 0$ . Since proofs are similar, we only focus on T₃. Appealing to previous work on the uniform convergency of nearest neighbor density estimates (Devroye and Wagner (1977)) and kernel density estimates (Silverman (1978)), it is straightforward to show the strong uniform convergency of ĥ(Z⁰) to h(Z⁰) under Conditions (A2) and (A3). Note that

E {(T_{3} - T_{3}^{*})}^{2} = E {[\frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j = 1}^{n} \frac{1}{n} \frac{W_{i j} δ_{j}}{h (Z_{i}^{0})} {Y_{j} - μ (Z_{j}^{0})} {\frac{h (Z_{i}^{0})}{\hat{h} (Z_{i}^{0})} - 1}]}^{2} .

Following the proof of the asymptotic distribution of $T_{3}^{*}$ that is discussed later, it can be shown that $E {(\sqrt{n} T_{3}^{*})}^{2} = σ_{3}^{2} + o (1)$ , the RHS of which is bounded. Then, given the uniform integrability of ${(\sqrt{n} T_{3}^{*})}^{2}$ and the uniform convergence of ĥ(Z⁰), one can establish the asymptotic mean square equivalence of $\sqrt{n} T_{3}$ and $\sqrt{n} T_{3}^{*}$ . It follows that $\sqrt{n} T_{3} - \sqrt{n} T_{3}^{*} \to_{p} 0$ . Similarly one can prove that $\sqrt{n} T_{4} - \sqrt{n} T_{4}^{*} \to_{p} 0$ .

We now show that $\sqrt{n} T_{3}^{*} \to_{d} N (0, σ_{3}^{2})$ . Take $R_{i} = δ_{i} [Y_{i} - μ (Z_{i}^{0})]$ , we can reexpress $T_{3}^{*}$ as

\begin{matrix} T_{3}^{*} & = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} (1 - δ_{i}) \frac{W_{i j}}{h (Z_{i}^{0})} R_{j} = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \frac{1}{2} [(1 - δ_{i}) \frac{W_{i j}}{h (Z_{i}^{0})} R_{j} + (1 - δ_{j}) \frac{W_{j i}}{h (Z_{j}^{0})} R_{i}] \\ = \frac{1}{n^{2}} \sum_{i \neq j} H (Z_{i}^{*}, Z_{j}^{*}), \end{matrix}

where $Z_{i}^{*} = (Y_{i}, Z_{i}^{0}, δ_{i})$ , $H (Z_{i}^{*}, Z_{j}^{*}) = \frac{1}{2} [(1 - δ_{i}) \frac{W_{i j}}{h (Z_{i}^{0})} R_{j} + (1 - δ_{j}) \frac{W_{j i}}{h (Z_{j}^{0})} R_{i}]$ . Now, $U = \frac{1}{n (n - 1)} \sum_{i \neq j} H (Z_{i}^{*}, Z_{j}^{*})$ is a standard U-Statistic and $T_{3}^{*} = \frac{n - 1}{n} U$ . It is straightforward to show that $\sqrt{n} T_{3}^{*} - \sqrt{n} U \to_{p} 0$ . Applying standard U-statistic theory, let the projection of U be Û and then $\hat{U} = \frac{2}{n} \sum_{i} H_{1} (Z_{i}^{*})$ where

\begin{matrix} H_{1} (Z_{i}^{*}) & = E [H (Z_{i}^{*}, Z_{j}^{*}) ∣ Z_{i}^{*}] = \frac{1}{2} R_{i} E [\frac{W (Z^{0}, Z_{i}^{0})}{h (Z^{0})} (1 - π (Z^{0})) ∣ Z_{i}^{0}] \\ = \frac{1}{2} R_{i} \frac{1 - π (Z_{i}^{0})}{π (Z_{i}^{0})} + O (\frac{K}{n}) \end{matrix}

It can be readily shown that Var $[H_{1} (Z^{*})] = \frac{1}{4} σ_{3}^{2} + O (\frac{K}{n})$ with $σ_{3}^{2} = E [\frac{{[1 - π Z^{0})]}^{2}}{π {(Z^{0})}^{2}} v a r [δ {Y - μ (Z^{0})} ∣ Z^{0}]]$ . Since the $H_{1} (Z_{i}^{*}) ’ s$ are mutually independent, it follows that $\sqrt{n} \hat{U} \to_{d} N (0, σ_{3}^{2})$ and hence $\sqrt{n} T_{3} \to_{d} N (0, σ_{3}^{2})$ . Similarly, one can show that $\sqrt{n} T_{4} \to_{p} 0$ .

Finally, it is straightforward to show that T₁ ⊥ T₂, T₁ ⊥ Û, and $C o v (\sqrt{n} T_{2}, \sqrt{n} \hat{U}) \to σ_{23}$ , where $σ_{23} = E [\frac{1 - π (Z^{0})}{π (Z^{0})} v a r [δ {Y - μ (Z^{0})} ∣ Z^{0}]]$ . It then follows that $\sqrt{n} (\hat{μ} - μ) \to_{d} N (0, σ^{2})$ , where $σ^{2} = σ_{1}^{2} + σ_{2}^{2} + σ_{3}^{2} + 2 σ_{23}$ .

C.2 Asymptotic Distribution of $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$

After rearranging terms, we have

{\hat{μ}}_{M I}^{0} = \frac{1}{n} \sum_{i = 1}^{n} [δ_{i} Y_{i} + (1 - δ_{i}) \sum_{j \in R_{k} (i)} \frac{l_{i j}}{L} Y_{j}],

where l_ij is the number of imputed data sets in which Y_j is used to impute Y_i, and $\sum_{j} l_{i j} = L$ . Then we have

{\hat{μ}}_{M I}^{0} - \hat{μ} = \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \sum_{j \in R_{K} (i)} (\frac{l_{i j}}{L} - \frac{1}{K}) Y_{j},

where $E (\frac{l_{i j}}{L} ∣ K) = \frac{1}{K}$ , $Var (\frac{l_{i j}}{L}) = \frac{K - 1}{L \times K^{2}}$ , and $Cov (\frac{l_{i j}}{L}, \frac{l_{i k}}{L} ∣ K) = \frac{1}{L \times K^{2}}$ . One can then show that $E {[\sqrt{n} ({\hat{μ}}_{M I}^{0} - \hat{μ})]}^{2} \leq \frac{1}{L} C$ , where C = EY² and is finite under Condition (A1). If $\frac{1}{L} \to 0$ , then $\sqrt{n} ({\hat{μ}}_{M I}^{0} - \hat{μ})$ converges to 0 in L₂, and hence in probability. The asymptotic distributon of $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$ is therefore $N (0, σ_{M I}^{2})$ where $σ_{M I}^{2} = σ_{1}^{2} + σ_{2}^{2} + σ_{3}^{2} + 2 σ_{23}$ . The proof of Theorem 1 is now complete.

Two remarks are in order. First, X may include both categorical or continuous variables; when X are all categorical in one or both working models, one or both components of Z⁰ are discrete, and continuity and compactness are then defined given the usual topology for a discrete space. It is well known that a compact discrete space is finite; as a result, it is straightforward to show that the results of Theorem 1 still hold. Secondly, the main result in Theorem 1 is proved under Condition (A3); this can be relaxed using a trimming technique similar to that in Härdle, Janssen, and Serfling (1988).

D Proof of Corollary 1

If Y is independent of δ given Z⁰, or E(Y|δ,Z⁰) = E(Y|Z⁰) and E(Y²|δ, Z⁰) = E(Y²|Z⁰), then it is straightforward to show that $σ_{2}^{2} = E [E (δ ∣ Z^{0}) v a r {Y - μ (Z^{0}) ∣ Z^{0}}] = E [π (Z^{0}) σ^{2} (Z^{0})]$ . Similarly, one can show that $σ_{3}^{2} = E [\frac{σ^{2} (Z^{0}) {[1 - π (Z^{0})]}^{2}}{π (Z^{0})}]$ and $σ_{23} = E [{1 - π (Z^{0})} σ^{2} (Z^{0})]$ . It follows that $σ_{M I}^{2} = v a r [μ (Z^{0})] + E [σ^{2} (Z^{0}) ∕ π (Z^{0})]$ . The proof is complete.

References

Aerts M, Claeskens G, Hens N, Molenberghs G. Local multiple imputation. Biometrika. 2002;89:375–388. [Google Scholar]
Cao W, Tsiatis A, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–734. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cassel CM, Sarndal CE, Wretman JH. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika. 1976;63:615–620. [Google Scholar]
Cheng PE. Kernel estimation of distribution functions and quantities with missing data. Journal of the American Statistical Association. 1994;89:81–87. [Google Scholar]
Devroye LP, Wagner TJ. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics. 1977;5:536–540. [Google Scholar]
Härdle W, Janssen P, Serfling R. Strong uniform consistency rates for estimators of conditional functionals. The Annals of Statistics. 1988;16:1428–1449. [Google Scholar]
Haziza D, Beaumont J. On the construction of imputation classes in surveys. International Statistical Review. 2007;75:25–43. [Google Scholar]
Heitjan DF, Little RJA. Multiple imputation for the fatal accident reporting system. Applied Statistics. 1991;40:13–29. [Google Scholar]
Horvitz D, Thompson D. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685. [Google Scholar]
Hu Z, Follmann D, Qin J. Semiparametric dimension reduction estimation for mean response with missing data. Biometrika. 2010;97:305–319. doi: 10.1093/biomet/asq005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang JDY, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007;22:523–539. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd Edition Wiley; New York: 2002. [Google Scholar]
Martnez ME, Marshall JR, Graver E, Whitacre R, Woolf K, Ritenbaugh C, S. AD. Reliability and validity of a self-administered food frequency questionnaire in a chemoprevention trial of adenoma recurrence. Cancer Epidemiology, Biomarkers & Prevention. 1999;8:941–946. [PubMed] [Google Scholar]
Matloff NM. Use of regression functions for improved estimation of means. Biometrika. 1981;68:685–689. [Google Scholar]
McCaffrey D, Ridgewar G, Morral A. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods. 2004;9:403–425. doi: 10.1037/1082-989X.9.4.403. [DOI] [PubMed] [Google Scholar]
Qin J, Shao J, Zhang B. Effcient and doubly robust imputation for covariate-dependent missing responses. Journal of the American Statistical Association. 2008;103:797–809. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–86. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]
Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
Rotnitzky A, Robins J, Scharfstein D. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association. 1998;93:1321–1339. [Google Scholar]
Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; New York: 1987. [Google Scholar]
Rubin DB, Schenker N. Multiple imputation in health-care databases: An overview and some applications. Statistics in Medicine. 1991;10:585–598. doi: 10.1002/sim.4780100410. [DOI] [PubMed] [Google Scholar]
Scharfstein D, Rotnitzky A, Robins J. Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association. 1999;94:1096–1120. [Google Scholar]
Silverman BW. Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. The Annals of Statistics. 1978;6:177–184. [Google Scholar]
Stone CJ. Consistent nonparametric regression. The Annals of Statistics. 1977;5:595–645. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]
Titterington DM, Sedransk J. Imputation of missing values using density estimation. Statistics and Probability Letters. 1989;8:411–418. [Google Scholar]
Tsiatis A, Davidian M. Comment on ‘Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data.’. Statist. Sci. 2007;22:569–573. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998. [Google Scholar]
Wang Q, Linton O, Härdle W. Semiparametric regression analysis with missing response at random. Journal of the American Statistical Association. 2004;99:334–345. [Google Scholar]

[R1] Aerts M, Claeskens G, Hens N, Molenberghs G. Local multiple imputation. Biometrika. 2002;89:375–388. [Google Scholar]

[R2] Cao W, Tsiatis A, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–734. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cassel CM, Sarndal CE, Wretman JH. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika. 1976;63:615–620. [Google Scholar]

[R4] Cheng PE. Kernel estimation of distribution functions and quantities with missing data. Journal of the American Statistical Association. 1994;89:81–87. [Google Scholar]

[R5] Devroye LP, Wagner TJ. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics. 1977;5:536–540. [Google Scholar]

[R6] Härdle W, Janssen P, Serfling R. Strong uniform consistency rates for estimators of conditional functionals. The Annals of Statistics. 1988;16:1428–1449. [Google Scholar]

[R7] Haziza D, Beaumont J. On the construction of imputation classes in surveys. International Statistical Review. 2007;75:25–43. [Google Scholar]

[R8] Heitjan DF, Little RJA. Multiple imputation for the fatal accident reporting system. Applied Statistics. 1991;40:13–29. [Google Scholar]

[R9] Horvitz D, Thompson D. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685. [Google Scholar]

[R10] Hu Z, Follmann D, Qin J. Semiparametric dimension reduction estimation for mean response with missing data. Biometrika. 2010;97:305–319. doi: 10.1093/biomet/asq005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Kang JDY, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007;22:523–539. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd Edition Wiley; New York: 2002. [Google Scholar]

[R13] Martnez ME, Marshall JR, Graver E, Whitacre R, Woolf K, Ritenbaugh C, S. AD. Reliability and validity of a self-administered food frequency questionnaire in a chemoprevention trial of adenoma recurrence. Cancer Epidemiology, Biomarkers & Prevention. 1999;8:941–946. [PubMed] [Google Scholar]

[R14] Matloff NM. Use of regression functions for improved estimation of means. Biometrika. 1981;68:685–689. [Google Scholar]

[R15] McCaffrey D, Ridgewar G, Morral A. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods. 2004;9:403–425. doi: 10.1037/1082-989X.9.4.403. [DOI] [PubMed] [Google Scholar]

[R16] Qin J, Shao J, Zhang B. Effcient and doubly robust imputation for covariate-dependent missing responses. Journal of the American Statistical Association. 2008;103:797–809. [Google Scholar]

[R17] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–86. [Google Scholar]

[R18] Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]

[R19] Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]

[R20] Rotnitzky A, Robins J, Scharfstein D. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association. 1998;93:1321–1339. [Google Scholar]

[R21] Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; New York: 1987. [Google Scholar]

[R22] Rubin DB, Schenker N. Multiple imputation in health-care databases: An overview and some applications. Statistics in Medicine. 1991;10:585–598. doi: 10.1002/sim.4780100410. [DOI] [PubMed] [Google Scholar]

[R23] Scharfstein D, Rotnitzky A, Robins J. Adjusting for nonignorable dropout using semiparametric nonresponse models. Journal of the American Statistical Association. 1999;94:1096–1120. [Google Scholar]

[R24] Silverman BW. Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. The Annals of Statistics. 1978;6:177–184. [Google Scholar]

[R25] Stone CJ. Consistent nonparametric regression. The Annals of Statistics. 1977;5:595–645. [Google Scholar]

[R26] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]

[R27] Titterington DM, Sedransk J. Imputation of missing values using density estimation. Statistics and Probability Letters. 1989;8:411–418. [Google Scholar]

[R28] Tsiatis A, Davidian M. Comment on ‘Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data.’. Statist. Sci. 2007;22:569–573. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; New York: 1998. [Google Scholar]

[R30] Wang Q, Linton O, Härdle W. Semiparametric regression analysis with missing response at random. Journal of the American Statistical Association. 2004;99:334–345. [Google Scholar]

PERMALINK

Doubly Robust Nonparametric Multiple Imputation for Ignorable Missing Data

Qi Long

Chiu-Hsieh Hsu

Yisheng Li

Abstract

1 Introduction

1.1 The Problem

1.2 Imputation Methods

1.3 Doubly Robust Methods

1.4 Doubly Robust Multiple Imputation

2 The Methodology

2.1 Working Models and Predictive Scores

2.2 Multiple Imputation (MI) Estimator

2.3 Theoretical Properties of MI Estimator

Table 1.

2.4 Bootstrap Multiple Imputation

2.5 Sensitivity Analysis

3 Simulation Studies

Table 3.

Table 2.

4 Data Example

Table 4.

5 Discussion

Acknowledgment

APPENDIX: PROOFS

A Proof of Proposition 1

B Proof of Proposition 2

C Proof of Theorem 1

C.1 Asymptotic Distribution of $\sqrt{n} (\hat{μ} - μ)$

C.2 Asymptotic Distribution of $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$

D Proof of Corollary 1

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Doubly Robust Nonparametric Multiple Imputation for Ignorable Missing Data

Qi Long

Chiu-Hsieh Hsu

Yisheng Li

Abstract

1 Introduction

1.1 The Problem

1.2 Imputation Methods

1.3 Doubly Robust Methods

1.4 Doubly Robust Multiple Imputation

2 The Methodology

2.1 Working Models and Predictive Scores

2.2 Multiple Imputation (MI) Estimator

2.3 Theoretical Properties of MI Estimator

Table 1.

2.4 Bootstrap Multiple Imputation

2.5 Sensitivity Analysis

3 Simulation Studies

Table 3.

Table 2.

4 Data Example

Table 4.

5 Discussion

Acknowledgment

APPENDIX: PROOFS

A Proof of Proposition 1

B Proof of Proposition 2

C Proof of Theorem 1

C.1 Asymptotic Distribution of n(μ^−μ)

C.2 Asymptotic Distribution of n(μ^MI0−μ)

D Proof of Corollary 1

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

C.1 Asymptotic Distribution of $\sqrt{n} (\hat{μ} - μ)$

C.2 Asymptotic Distribution of $\sqrt{n} ({\hat{μ}}_{M I}^{0} - μ)$