A TWO-STEP SEMIPARAMETRIC METHOD TO ACCOMMODATE SAMPLING WEIGHTS IN MULTIPLE IMPUTATION

Hanzhi Zhou; Michael R Elliott; Trviellore E Raghunathan

doi:10.1111/biom.12413

. Author manuscript; available in PMC: 2018 Jul 25.

Published in final edited form as: Biometrics. 2015 Sep 22;72(1):242–252. doi: 10.1111/biom.12413

A TWO-STEP SEMIPARAMETRIC METHOD TO ACCOMMODATE SAMPLING WEIGHTS IN MULTIPLE IMPUTATION

Hanzhi Zhou ^1,^*, Michael R Elliott ^2,^3,^**, Trviellore E Raghunathan ^2,^3,^***

PMCID: PMC6058975 NIHMSID: NIHMS981246 PMID: 26393409

Summary:

Multiple imputation (MI) is a well-established method to handle item-nonresponse in sample surveys. Survey data obtained from complex sampling designs often involve features that include unequal probability of selection. MI requires imputation to be congenial, that is, for the imputations to come from a Bayesian predictive distribution and for the observed and complete data estimator to equal the posterior mean given the observed or complete data, and similarly for the observed and complete variance estimator to equal the posterior variance given the observed or complete data; more colloquially, the analyst and imputer make similar modeling assumptions. Yet multiply-imputed datasets from complex sample designs with unequal sampling weights are typically imputed under simple random sampling assumptions and then analyzed using methods that account for the sampling weights. This is a setting in which the analyst assumes more than the imputer, which can led to biased estimates and anti-conservative inference. Less commonly-used alternatives such as including case weights as predictors in the imputation model typically require interaction terms for more complex estimators such as regression coefficients, and can be vulnerable to model misspecification and difficult to implement. We develop a simple two-step MI framework that accounts for sampling weights using a weighted finite population Bayesian bootstrap method to validly impute the whole population (including item non-response) from the observed data. In the second step, having generated posterior predictive distributions of the entire population, we use standard IID imputation to handle the item non-response. Simulation results show that the proposed method has good frequentist properties and is robust to model misspecification compared to alternative approaches. We apply the proposed method to accommodate missing data in the Behavioral Risk Factor Surveillance System when estimating means and parameters of regression models.

Keywords: Polya posterior, Bayesian bootstrap, missing data, sampling design, Behavioral Risk Factor Surveillance System (BRFSS)

1. Introduction

Both item nonresponse and sampling weights are typical features of survey data obtained from complex sample designs. Item nonresponse occurs when some respondents do not answer all the items in a survey questionnaire, e.g. both “don’t know” and refusal answers are considered as item nonresponse. Sampling weights arise as a correction factor to compensate for over- or under-representation of units in the target population due to unequal selection probabilities. The Behavior Risk Factor Surveillance System (BRFSS) has both a substantial proportion of missing data on income measures as well as survey weights that adjust for different sampling rates among states and oversampling of adults in smaller sized households, as well as for non-response bias by poststratifying and raking to known control totals for basic demographics.

When the proportion of item-level missing values is nontrivial and the data are not missing completely at random (MCAR), typical solutions for missing data like complete case analysis often lead to increased bias and reduced statistical power. Multiple imputation (MI) (Rubin, 1987) is a principled method for addressing item-level missing data. MI has a Bayesian conceptualization. The basic idea is to fill in missing data with M sets of plausible values. These are obtained as repeated draws from the posterior predictive distribution of the missing components of the sample Y_mis given its observed components Y_obs, i.e. p(Y_mis ∣ Y_obs). The production of multiple “completed” datasets ${(Y_{o b s}, Y_{m i s}^{(1)}), \dots, (Y_{o b s}, Y_{m i s}^{(M)})}$ is typically done by an “imputer” who has access to the data to develop reasonable models for generating the predictive distribution of Y_mis, allowing the “analyst” to then analyze each of the M imputed datasets and combine the point and variance estimates using the combining rules developed by Rubin (1987). Examples of this approach include imputation for blood alcohol concentration in the Fatal Accident Reporting System (FARS) (Heitjan & Little, 1991) and income imputation in the National Health Interview Survey (NHIS) (Schenker et al., 2006).

While the imputer/analyst distinction is convenient, Meng (1994) pointed out that this can lead to problems with inference when the imputer and analyst assume different data models (“uncongeniality”). Meng shows that, when the imputer assumes a richer model than the analyst, the resulting MI analysis would typically be mildly conservative, whereas if the analyst assumed a richer model than the imputer, the resulting MI analysis could be either conservative or anti-conservative. In settings where the observed data are obtained using an unequal probability sampling design, data are typically imputed using models in which variables are assumed to be independent and identically distributed (IID), and then analyzed using a weighted design-based approach that accounts for the unequal selection probability. This presents an example of the latter form of uncongeniality that can lead to biased point estimation and below-nominal confidence interval coverage. It is therefore important to incorporate unequal probabilities of selection/sampling weights in the imputation procedure.

A simple and seemingly straightforward way to incorporate sampling weights in MI is to let the imputer’s model condition on a few key design variables that determine probabilities of inclusion, such as measure of size and stratification variables. However, not all design information is typically available in public use data due to disclosure risk concerns. Another option is to summarize the design information by using weights as a covariate in the imputation, perhaps after log transformation or categorization in “weight strata” and modeling them as dummy indicators. However, the modeling task may be complicated by attempting to include all interactions of weights (or weight-related design variables) with other covariates in the model (Meng 1994; Kim et al. 2006). Moreover, this approach typically requires the functional form of the interaction to be modeled correctly, using a spline or other non-parametric form to be robust against model misspecification (Elliott and Little 2000; Zheng and Little 2005; Breidt, Claeskens. and Opsomer, 2005). In addition, Kim et al. and Seaman et al. (2011) show that, in the case of a target complete data estimator of a weighted total, the standard Rubin MI variance formula is no longer an unbiased even if the imputation model is correctly specified without the weights, since the weights induce a covariance between the MI point estimator and the (latent) complete data estimator, a quantity that is not accounted for in the Rubin MI variance formula; typically this covariance is negative, so that the standard MI variance estimators and associated confidence intervals are conservative, but complex adjustment must be made to regain nominal p-values and coverage.

This manuscript develops a modified MI framework to account for sampling weights from single-stage designs. We propose a two-step MI procedure. In the first step, we develop and use a weighted finite population Bayesian bootstrap (weighted FPBB) to validly impute the whole population (including item non-response) from the observed data. In the second step, having generated posterior predictive distributions of the entire population, we use standard IID imputation to handle the item non-response. Our suggested procedure allows the parametric imputation model to no longer need to model interactions between weights and covariates in the imputation regression model to account for model misspecification. In addition, since we are imputing to a synthetic population, all weights are constant and equal to 1, so no covariance between the MI point estimator and the complete data estimator is induced.

The rest of this manuscript is organized as follows. Section 2 provides a detailed overview of the proposed two-step semiparametric multiple imputation procedure to accommodate weighted data. (We term it “semiparametric” because the design features, in particular the weights, are accommodated non-parametrically, whereas the actual imputation is conducted under a standard parametric model.) We focus on the setting where the selection probabilities are obtained from a probability proportional to size (PPS) sample design, although the methods we develop can be used with any selection weights. Section 2 then discusses point estimation and inference using the MI datasets from the proposed procedure. Section 3 provides a simulation study in the context of a single-stage probability-proportional-to-size sample design to estimate population means and regression coefficients under a variety of settings where sampling weights are associated to differing degrees with both the outcome and the probability of nonresponse, and where failure to account for design in the imputation procedure has differing degrees of impact. We compare the performances of the proposed two-step MI and the fully parametric MI in terms of robustness to different degrees of model misspecification. Section 4 applies the proposed procedure to estimate means, linear, and loglinear regression models, describing marginal and joint distributions of income and health insurance accessibility, using data from the 2009 Behavioral Risk Factor Surveillance System (BRFSS). Section 5 concludes with a brief discussion of possible extensions.

2. A Two-Step Semiparametric MI Procedure

Bayesian finite population inference (Ericson 1969) has been proposed as a means to harmonize design and model-based approaches for sample survey inference (Little 2004, 2011). Under this approach, we focus on the posterior predictive distribution of our finite population quantity of interest (e.g., population mean, population regression parameter) obtained from the posterior predictive distribution for the non-sampled elements of the population. To make matters more concrete, consider the setting in the absence of missing data where we have a scalar outcome Y, sampling weight w based on a single stage PPS design, and no missing data. Our complete data consists of the vector of sampling indicators I for the population, sampled Y_s for which I = 1, the non-sampled Y_ns for which I = 0, and similarly w_s and w_ns. Given the sampling weights, the sampling mechanism generating I is assumed to be independent of Y (p(I∣Y,w) = p(I∣w)), and thus ignorable in the modeling. Assuming a model for the outcome given the sampling weights p(Y∣θ,w) parameterized by θ with prior p(θ), the posterior predictive distribution for the non-sampled elements of the population Y_ns is given by

p (Y_{n s} ∣ Y_{s}, w_{s}) \propto \int p (Y_{n s} ∣ Y_{s}, θ, w) p (θ ∣ Y_{s}, w) p (w_{n s} ∣ w_{s}) d θ d w_{n s}

(1)

Previous work has tackled estimation of this predictive distribution in a variety of ways. Zheng and Little (2004, 2005) and Chen, Little and Elliott (2010) assumed that the sampling weights were known for all subjects, so that w_s = w, reducing (1) to p(Y_ns∣ Y_s, w) ∝ ∫ p(Y_ns∣ Y_s, θ, w) p(θ∣ Y_s, w)dθ; these authors then obtained draws from the posterior predictive distribution under fairly weak modeling assumptions (parametric regression model for p(Y∣ θ,w) based on penalized splines). Little and Zheng (2007) and Zangeneh, Keener, and Little (2011) considered the situation in which weights are observed only for the sample (as in a public use data setting), and obtained predictive draws for p(w_ns ∣ w_s) under a Dirichlet model with a non-informative (Haldane) prior; the resulting predictive draw of the population of weights was then used as in Zheng and Little to obtain posterior predictive draws of Y_ns. Dong, Elliott, and Raghunathan (2014) consider a different factorization of (1):

p (Y_{n s} ∣ Y_{s}, w_{s}) \propto \int p (Y_{n s}, w_{n s} ∣ Y_{s}, w_{s}) p (Y_{s}, w_{s}) d w_{n s} .

(2)

The parameter θ is dropped because the draws of Y_s, w_s are made directly from the posterior of the empirical CDF of Y_s, w_s using a Bayesian bootstrap (BB) procedure (Rubin 1981). Draws from Y_ns, w_ns ∣ Y_s, w_s are then made using a weighted finite population Bayesian boostrap (FPBB) procedure described in Cohen (1997).

Here we extend the approach of Dong, Elliott, and Raghunathan to accommodate missing data due to item-level non-response. We assume that, had we taken a census of the entire population, we could have observed a vector of response indicators R = (R_s, R_ns) where R_s corresponds to the response indicators observed in the sample, and R_ns to the response indicators associated with the non-sampled elements. We then divide the sampled Y_s = (Y_s,obs, Y_s,mis) into the fully-observed and missing elements, corresponding to the sampled Y values associated with R_s = 1 and R_s = 0 respectively, and similarly the non-sampled Y_ns = (Y_ns,obs, Y_ns,mis) into those that would have be observed had they been sampled (R_ns = 1), and those that would have had missing values (R_ns = 0). We also assume a fully-observable covariate X = (X_s, X_ns) consisting of the sampled and non-sampled elements respectively. Note that we can combine the observed from the sampled and nonsampled parts of the population to obtain the potentially “observable” Y_obs = (Y_s,obs, Y_ns,obs), and similarly those missing Y_mis = (Y_s,mis, Y_ns,mis). We assume ignorable missingness, so that p(R∣ Y,w) = p(R∣ Y_obs,w), allowing R to be ignored in the model along with I. Extending (1) under to incorporate item-level missingness then yields

p (Y_{n s, o b s}, X_{n s} ∣ Y_{s, o b s}, X_{s}, w_{s}) = \int p (Y_{n s, o b s}, X_{n s}, Y_{m i s} ∣ Y_{s, o b s}, X_{s}, w_{s}) d Y_{m i s}

We can generate from p(Y_ns,obs, Y_ns, Y_mis ′ Y_s,obs,w_s) by simply allowing the missing values in Y to be generated along with the observed values for Y and X using the weights FPBB procedure. We then integrate out with respect to Y_mis by assuming a parametric model for Y ∣ X :

\begin{matrix} \int p (Y_{n s, o b s}, X_{n s}, Y_{m i s} ∣ Y_{s, o b s}, X_{s}, w_{s}) d Y_{m i s} \\ \int p (Y_{m i s} ∣ Y_{n s, o b s}, X_{n s}, Y_{s, o b s}, X_{s}, w_{w}) p (Y_{n s, o b s}, X_{n s} ∣ Y_{s, o b s}, X_{s}, w_{s}) d Y_{m i s} = \\ \int \int p (Y_{m i s} ∣ Y_{n s, o b s}, X_{n s}, Y_{s, o b s}, X_{s}, w_{s}, θ) p (Y_{n s, o b s}, X_{n s} ∣ Y_{s, o b s}, X_{s}, w_{s}, θ) p (θ ∣ Y_{s, o b s}, X_{s}, w_{s}) d θ d Y_{m i s} \propto \\ \int \int p (Y_{m i s} ∣ Y_{n s, o b s}, X_{n s}, Y_{s, o b s}, X_{s}, w_{s}, θ) p (Y_{n s, o b s}, X_{n s} ∣ Y_{s, o b s}, X_{s}, w_{s}, θ) p (Y_{s, o b s}, X_{s}, w_{s} ∣ θ) p (θ) d θ d Y_{m i s} \end{matrix}

(3)

We can implement the integration in (3) by use of a standard Gibbs sampler for multiple imputation that iterates between draws of

p (θ ∣ Y_{n s, o b s}, X_{n s}, Y_{s, o b s}, X_{s}, w_{s}, Y_{m i s}) = p (θ ∣ Y, X, w_{s}) = p (θ ∣ Y, X)

(4)

and

p (Y_{m i s} ∣ Y_{n s, o b s}, X_{n s}, Y_{s, o b s}, X_{s}, w_{s}, θ)

(5)

Note that (4) follows from the fact that, conditional on the entire population, the observed weights are superfluous for the draws of θ, so that it is sufficient to develop a parametric model for Y that does not involve the weights together with a prior for θ (possibly conditional on X): p(θ∣Y,Xmw_s) = p(θ∣ Y,X)∝ p(Y∣θ,X)p(θ∣ X). The presence w_s in (5) indicates that the observed weights may still be important in the imputation of the missing elements of Y if missingness itself is a function of the probability of selection, as we note below.

2.1. Step 1: Undo Sampling Weights through Nonparametric Synthetic Data Generation

Here we briefly review the work of Dong, Elliott, and Raghunathan (2014) to obtain draws from a posterior predictive distribution of the population that is free of the effects of unequal probability of selection. This work builds on the work of Ghosh and Meeden (1983), Lo (1988) and Cohen (1997), where details of the derivations of the results can be found.

2.1.1. The Weighted Pólya Posterior.

The purpose of developing the weighted Polya posterior is to be able to draw from a posterior predictive distribution of a finite population based on an unequal probability-of-selection sampling design without making any parametric assumptions about the probability mechanism that generated the data. We begin by describing the Polya posterior developed by Ghosh and Meeden (1983) in the simple random sampling setting. Assume that a simple random sample of size n is drawn from a finite population of size N, denoted by y_s = {y₁,…,y_n}. Let Γ(•) denote the gamma function, {d₁,d₂,…d_K} denote the set of K distinct values in the sample and λ = {λ₁, λ₂,…, λ_K} denote the vector of probabilities that pr(y_i = d_k∣ λ) = λ_k, for i = 1,2,…,n, with $\sum_{j = 1}^{K} λ_{j} = 1$ . Let n_j and u_j be the number of units taking value d_j in the sample and in the nonsampled part of the population, respectively, for j = 1,2,…, K, and $\sum_{j = 1}^{K} n_{j} = n$ , $\sum_{j = 1}^{K} u_{j} = N - n$ . Assuming a noninformative Haldane prior of λ, λ ~ Dir(0,…,0), together with a multinomial distribution for the counts of sample data, n₁,..,n_K ∣ λ ~ Mult(n; λ), Ghosh and Meeden show that predictive distribution of counts in the nonsampled data is given by:

p (u_{1}, \dots, u_{K} ∣ n_{1}, \dots, n_{K}) = \frac{\prod_{j = 1}^{K} Γ (n_{j} + u_{j}) ∕ Γ (n_{j})}{Γ (N) ∕ Γ (n)}

(6)

Cohen (1997) generalized (6) to the case where the sample is selected with unequal probabilities.. We now assume that we have a sample of size n consisting of (Y_s, X_s, w_s, R_s) = {(Y_i, X_i, w_i, R_i), i = 1,…, n.} where R is a response indicator for Y, so that Y_i = Y_i,obs if R_i = 1 and Y_i = Y_i,mis if R_i = 0, X consists of fully observed covariates, and w_i denotes the sampling weight for the i^th unit in the sample, which is normalized to sum up to N, i.e. $\sum_{i = 1}^{n} w_{i} = N$ . Let { ${\tilde{d}}_{1}$ , ${\tilde{d}}_{2}$ ,…, ${\tilde{d}}_{K}$ } denote the set of K distinct vectors of (Y_i, X_i, w_i, R_i) in the sample and ς = {ς₁, ς₂,… ς_K} denote the vector of probabilities that $Pr ((Y_{i}, X_{i}, w_{i}, R_{i}) = {\tilde{d}}_{k} ∣ ζ) = ζ_{k}$ , for i = 1, 2,…, n, k = 1,…, K, and $\sum_{j = 1}^{K} ζ_{i} = 1$ . Let n_j and u_j be the number of units taking vector ${\tilde{d}}_{j}$ in the sample and in the nonsampled part of the population, respectively, for j = 1, 2,…, K, and $\sum_{j = 1}^{K} n_{j} = n$ , $\sum_{j = 1}^{K} u_{j} = N - n$ . Again assuming a noninformative Haldane prior of ς : ς ~ Dir(0,…0) together with multinomially distributed weighted counts in the data $p (w_{1}, \dots, w_{K} ∣ ζ) \propto \prod_{j = 1}^{K} {ζ_{j}}^{w_{j}}$ Cohen (1997) posits and Dong, Elliott, and Raghunathan (2014) prove that the posterior predictive distribution of counts in the nonsampled data is given by:

p (u_{1}, \dots, u_{K} ∣ w_{1}, \dots, w_{K}) = \frac{\prod_{j = 1}^{K} Γ (w_{j} + u_{j}) ∕ Γ (w_{j})}{Γ (2 N - n) ∕ Γ (N)} .

(7)

2.1.2. The Adapted-weighted FPBB method.

The adapted-weighted FPBB (Dong, Elliott, and Raghunathan 2014) consists of two stages. The first stage resamples the original sample using the standard Bayesian bootstrap assuming IID, and the second stage reverses/undoes the sampling weights using the weighted FPBB. This two-stage algorithm is analogous to the fully parametric Bayesian method, where the first stage is equivalent to drawing values of the parameter (ς) from its posterior distribution given the counts in sampled data (n₁,…, n_K) and the second stage draws the predicted counts in the nonsampled data (u₁,…, u_K) given the drawn parameter. The method is described as follows:

Resampling Using the Standard Bayesian Bootstrap (BB)

The standard Bayesian Bootstrap of Rubin (1981) assuming IID is used to generate L replicate BB samples each of size n, i.e. ${(Y_{s}^{(l)}, X_{s}^{(l)}, w_{s}^{(l)}, R_{s}^{(l)}), l = 1, \dots, L .}$ . This essentially generates the posterior joint distribution (denoted by f) of all the variables in the population given their realized values in the sample data set. Or equivalently, the posterior distribution of the parameter vector ς is drawn given the sample, i.e.

\begin{matrix} f^{(l)} (Y, X, w, R) ∣ (Y_{s}, X_{s}, w_{s}, R_{s}) \Leftrightarrow (ζ^{(l)} ∣ Y_{s}, X_{s}, w_{s}, R_{s}) ~ Dir (n_{1}, \dots, n_{K}), \\ for l = 1, \dots, L ., where ζ^{(l)} = (ζ_{1}^{(l)}, \dots, ζ_{K}^{(l)}) . \end{matrix}

(8)

This stage captures the sampling variability. The uncertainty in the posterior draws of the parameter ς^(l) is reflected in the varying counts of distinct units in the original sample being selected in different replicate BB samples. Let t_l(i) denote the number of times unit i is selected in the l^th replicate BB sample, for l = 1,…, L. We incorporate this source of uncertainty in computing “the l^th bootstrap weight for unit i”, i.e. $w_{i}^{(l)^{*}} = w_{i} \cdot t_{l} (i)$ , where w_i denotes the original sampling weight for unit i. The bootstrap weights are carried forward as input weights to the next stage.

Undo Sampling Weight using the weighted Polya posterior/ weighted FPBB

To capture the variability due to “imputing” the nonsampled units, the weighted Polya posterior in equation (7) is used to create S synthetic populations for each of the L BB sample obtained from the previous stage, i.e. ${(Y_{s}^{(l)}, X_{s}^{(l)}, R_{s}^{(l)}), (Y_{n s}^{(l s)}, X_{n s}^{(l s)}, R_{n s}^{(l s)})}$ , for s = 1,…, S, l = 1,…, L. The distribution in Equation (7) does not lend itself to direct calculation; however, draws from (7) can be obtained using Monte Carlo simulation. Specifically, we apply a procedure suggested by Cohen (1997), who extended the algorithm developed by Lo (1998) in the simple random sampling setting to a weighted sampling setting

Take a Pólya sample of size N – n denoted by ( $Y_{s}^{(l s)}$ , $X_{s}^{(l s)}$ , $R_{s}^{(l s)}$ ) from the urn ( $Y_{s}^{(l)}$ , $X_{s}^{(l)}$ , $R_{s}^{(l)}$ ) by selecting each element in the urn with probability
$\frac{w_{i}^{(l)^{*}} - 1 + l_{i, k - 1} \times (\frac{N - n}{n})}{N - n + (k - 1) \times (\frac{N - n}{n})}, k = 1, 2, \dots, N - n + 1 .$ (9)
where $w_{i}^{(l)^{*}}$ is the bootstrap weight for the i^th unit in the l^th replicate BB sample, and l_i,k–1 is the number of selections of unit i up to (k-1)^th selection, setting l_i,0 = 0.
Form the weighted FPBB synthetic population $P_{(s)}^{(l)} = {(Y_{s}^{(l)}, X_{s}^{(l)}, R_{s}^{(l)}), (Y_{n s}^{(l s)}, X_{n s}^{(l s)}, R_{n s}^{(l s)})}$ so that it has exact size N.

This results in the “unweighted” synthetic populations $P_{(s)}^{(l)} = (Y^{(l s)}, X^{(l s)}, R^{(l s)}) = (P_{(s) o b s}^{(l)}, Y_{m i s}^{(l s)})$ , s = 1,…, S, l = 1, 2,…, L, where L and S are the numbers of datasets generated from first- and second-stage, respectively, and $P_{(s) o b s}^{(l)} = ((Y_{s, o b s}^{(l)}, X_{s}^{(l)}, R_{s}^{(l)}), (Y_{n s, o b s}^{(l s)}, X_{n s}^{(l s)}, R_{n s}^{(l s)}))$ and $Y_{m i s}^{(l s)} = (Y_{s, m i s}^{(l)}, Y_{n s, m i s}^{(l s)})$ consist of the observed and unobserved data in the ls^th FPBB synthetic population dataset respectively.

2.2. Step 2: Multiply Impute Missing Data through Parametric Models

Now that we have effectively “undone” the sampling design, we are ready to perform conventional MI under an IID assumption. Following the standard MI procedure or approximations such as SRMI (Raghunathan, Lepkowski, Van Hoewyk, & Solenberger, 2001), we obtain draws from the posterior predictive distribution $p (Y_{m i s}^{(l s)} ∣ P_{(s) o b s}^{(l)})$ . Without the need to include weights in the imputation model due to a self-weighting FPBB population generated from previous step, our task can now be concentrated on correctly modeling the covariate variables. Note that the elimination of the weights from the self-weighting FPBB population does not obviate the need to account for the weights in the imputation process, if the probability of selection (I) and non-response (R) are associated with each other (i.e., p(R∣Y_obs, w) ≠ p(R∣Y_obs)). This step results in M imputed synthetic datasets for each of the L × S FPBB synthetic populations generated from the first step, $P_{s M}^{l} = (P_{(s) 1}^{(l)}, P_{(s) 2}^{(l)}, \dots, P_{(s) M}^{(l)})$ , for s = 1,2,…, S, l = 1,2,…, L.

2.3. Point and Variance Estimates for the Two-Step MI Procedure.

Conditional on $P^{i m p} = {P_{(11)}^{(1)}, \dots, P_{(1 M)}^{(1))}, \dots, P_{(S 1)}^{(1)}, \dots, P_{(SM)}^{(1)}, \dots, P_{(SM}^{(L))}}$ , the posterior predictive distribution of a scalar population statistic Q(Y) ≡ Q is given by

Q ∣ P^{i m p} \overset{\cdot}{~} t_{L - 1} ({\overset{‒}{Q}}_{L}, (1 + L^{- 1}) V_{L})

(10)

where ${\overset{‒}{Q}}_{L} = \frac{1}{L} \sum_{l} {\tilde{Q}}^{(l)}$ and $V_{L} = \frac{1}{L - 1} \sum_{l} {({\tilde{Q}}^{(l)} - {\overset{‒}{Q}}_{L})}^{2}$ , where ${\tilde{Q}}^{(l)} = \lim_{\begin{matrix} S \to \infty \\ M \to \infty \end{matrix}} \frac{1}{S M} \sum_{s} \sum_{m} q^{(l s m)}$ , where q^(lsm) is an estimate of Q obtained from the m^th imputation of the s^th synthetic population within the l^th Bayesian Bootstrap sample; in practice we estimate ${\tilde{Q}}^{(l)}$ by ${\tilde{Q}}^{(l)} = \frac{1}{S M} \sum_{s} \sum_{m} q^{(l s m)}$ . The result follows immediately from Section 4.1 of Raghunathan, Reiter, and Rubin (2003), and is based on the standard Rubin (1987) multiple imputation combining rules, where (Y_ns, X_ns, R_ns) and Y_s,mis are missing data and (Y_s,obs, X_s, R_s) is observed. The average “within” imputation variance is zero, since the entire population is being synthesized; hence the posterior variance of Q is entirely a function of the between-imputation variance, and the degrees of freedom is simply given by the number of BB samples. This result requires E(q^(lsm)) = Q, which implies that our imputation model for Y_mis is correctly specified, as well as the standard sufficiently large sample size for the t approximation to be reasonable. In addition, since we are imputing under the synthesized population, all weights are constant and equal to 1, so no covariance between the MI point estimator and the complete data estimator is induced (Kim et al. 2006; Seamen et al. 2012).

These results assume S → ∞ and M → ∞; in practice we have found that relatively modest values of S and M are needed for the imputation approximations to hold. In particular, we show below that S=20 and M=5 yield reasonable results in simulation studies, results that are also consistent with in Dong et al. (2014). In addition, in settings where N is very large, generating a synthetic population large enough to have a relatively trivial sampling fraction (e.g., N* = 10n) will generally be sufficient.

3. Simulation Study

A simulation study was designed to investigate the inferential properties of the proposed method. In particular, we are interested to see how the two-step MI procedure performs in comparison with the existing fully parametric methods under four simulation designs defined by crossing the following two factors:

Associations of the probabilities of selection with the mechanism generating the data. We call the design ‘outcome relevant’ if the probabilities of selection are correlated with the outcome variable Y, otherwise we term it an ‘outcome irrelevant’ design.
Associations of the probabilities of selection with the mechanism generating the missing values. We use ‘MAR_X’ (weight independent missingness) and ‘MAR_X, W (weight dependent missingness) respectively to denote respective situations where the missing data mechanism is dependent on fully-observed covariates only and where it depends on probabilities of selection as well as other fully-observed covariates.

We first generate a population of three variables: the outcome variable Y, a covariate X and a variable Z based on which probability proportionate to size without replacement (PPSWOR) sampling is conducted. The joint distribution of Z, X, and Y is given by:

\begin{matrix} \log Z ~ N (2, 1) \\ X ∣ Z ~ N ({0.1}^{*} \log Z, σ_{x}^{2}) \\ Y_{1} ∣ X, Z ~ N ({0.1}^{*} X + {0.5}^{*} \log Z + {0.6}^{*} X^{*} \log Z, σ_{y_{1}}^{2}) \\ Y_{2} ∣ X, Z ~ N ({0.2}^{*} X, σ_{y_{2}}^{2}) \end{matrix}

Thus (Y₁, X, Z) constitutes the “relevant design” population and (Y₂, X, Z) constitutes the “irrelevant design” population. Both populations have size N=4,000. For each population, we drew 500 independent samples of size n=200 without replacement, with inclusion probability for the i^th unit $π_{i} = n Z_{i} ∕ \sum_{j = 1}^{N} Z_{j}$ . We call the 500 PPSWOR samples “before deletion (BD) samples”.

Next, probit models were used as deletion functions to create missing data in the outcome variable Y for each of the 100 simulations. Both X and Z are assumed to be completely observed. We generate T₁ = −0.635 + 0.4X + e and T₂ = −0.55 + 0.4X − 0.5logZ + 0.4X * logZ + e, where $e \overset{i i d}{~} N (0, 1)$ , corresponding to the MAR _X condition and MAR _X, W condition respectively. The outcome is then missing if T_j > 0 (i.e., P(M = 1∣T_j) = Φ(E(T_j)), j = 1,2., where Φ(x) corresponds to the standard normal CDF). This yields a missingness fraction of approximately 30% in all four scenarios.

For each of the four simulation designs, we analyze the data using five imputation models. Model 1 ignores weights altogether in the imputation process, a procedure typically adopted. Model 2 includes log(Z) in the imputation model (Schenker et al., 2006). Model 3 includes both the log(Z) and its interactions with other covariates in the imputation model. Model 4 and Model 5 are equivalent to Model 2 and Model 3, except that log(Z) is replaced with 1/Z corresponding to the weight, as suggested in Kim et al. 2006 and Seaman et al. 2011. All five imputation models will be tested with both the fully parametric MI method and the proposed two-step synthetic MI procedure. The only difference is that we perform design-based analyses on the imputed data from the former, while with the new method we perform simple unweighted analyses instead. We implement the MI using the MICE package (R Core Team, 2013).

Finally, we focus on estimating the population mean of Y (i.e. $\overset{‒}{Y}$ ) and the population regression coefficients of Y on X : Y = β₀ + β₁_X. We used five quantities to evaluate the performance of the various methods under comparison: bias, empirical root mean square error (RMSE), empirical interval coverage, empirical variance, and the mean of the estimated variance (to compare with the empirical variance). For the standard parametric analysis, we use Rubin’s combining rules (Rubin 1987), using weighted point estimates and Taylor Series approximations (Binder 1983) to account for the weights in the variance estimation of the filled-in datasets. Population means and regression parameters are used to compute bias and mean square error.

3.1. Simulation Results

In deciding how many synthetic populations S are needed, we conducted a preliminary study based on the before deletion (BD) data. (We let L=100.) Simulation results are shown in Table 1. We observe that as we increase S, the variance estimate decreases, and stabilizes close to the actual sample variance when S ≥ 20. This is consistent with a similar result in Dong et al. (2014), which found that 20 synthetic populations were sufficient to yield appropriate coverage intervals in a complete data setting. Therefore, we use S=20 along with L=100 and M=5 in the after deletion (AD) simulation.

Table 1.

Before deletion study of the effects of the number of generated FPBB populations (S) on variance estimate

Parameters Of Interest	Performance Criteria	Weighted FPBB Method with S Synthetic Populations Created								Actual Sample
Parameters Of Interest	Performance Criteria	S=1	S=5	S=10	S=15	S=20	S=25	S=30	S=40	Actual Sample
Mean	Pt. est.	1.460	1.459	1.458	1.460	1.458	1.460	1.460	1.460	1.450
	Emp.Est.Var	0.048	0.036	0.034	0.034	0.033	0.033	0.033	0.033	0.033
	Emp.Var	0.031	0.031	0.032	0.031	0.032	0.031	0.031	0.031	0.032
	RMSE	0.176	0.178	0.177	0.177	0.178	0.175	0.177	0.177	0.178
	95% CI cov.	99%	97%	96%	97%	96%	96%	95%	96%	96%
Intercept	Pt. est.	1.251	1.250	1.249	1.250	1.249	1.250	1.250	1.250	1.241
	Emp.Est.Var	0.028	0.022	0.022	0.022	0.021	0.021	0.021	0.021	0.021
	Emp.Var	0.021	0.021	0.021	0.011	0.021	0.021	0.021	0.021	0.022
	RMSE	0.145	0.145	0.146	0.145	0.145	0.144	0.144	0.146	0.150
	95% CI cov.	96%	94%	94%	94%	94%	95%	94%	94%	94%
Slope	Pt. est.	1.280	1.281	1.280	1.281	1.280	1.281	1.280	1.280	1.264
	Emp.Est.Var	0.036	0.029	0.028	0.027	0.027	0.027	0.027	0.027	0.027
	Emp.Var	0.030	0.029	0.029	0.029	0.029	0.029	0.030	0.030	0.033
	RMSE	0.172	0.171	0.172	0.171	0.171	0.171	0.173	0.173	0.181
	95% CI cov.	95%	92%	91%	91%	91%	90%	91%	90%	89%

Open in a new tab

Table 2 and Table 3 present the results from our simulation study. Each table is divided into two parts, containing the results from MAR _X scenario and MAR _X,W scenario respectively. Within each scenario, we compare our new method with the fully parametric method, with the columns indicated by ‘X’, ’X,log(Z)’, ‘X*log(Z)’, ‘X,W’ and ‘X*W’ each corresponding to the estimates under the five imputation models described above.

Table 2.

Performance of the proposed method in contrast to the fully parametric method under the relevant design condition (true model italicized).

Actual Parameters	Performance Criteria	MAR_X										MAR_X,W
		Standard Parametric MI					Semi-Parametric MI					Standard Parametric MI					Semi-Parametric MI
		X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W
Mean=1.450	Bias	0.248	0.084	0.003	0.063	−0.078	0.019	0.012	0.005	−.011	−.056	0.211	0.032	0.000	0.001	−0.098	0.061	0.022	0.006	−.001	−.048
	MeanEst.Var	0.057	0.050	0.041	0.080	0.133	0.047	0.043	0.040	.059	.089	0.062	0.065	0.049	0.080	0.123	0.040	0.039	0.037	.057	.078
	Emp.Var	0.047	0.042	0.035	0.065	0.125	0.044	0.039	0.036	.051	.067	0.062	0.066	0.045	0.074	0.125	0.036	0.034	0.032	.048	.065
	RMSE	0.329	0.222	0.188	0.262	0.362	0.211	0.196	0.190	.224	.265	0.327	0.258	0.211	0.273	0.367	0.199	0.186	0.179	.219	.259
	95%Cov	84%	95%	96%	95%	98%	95%	95%	95%	96%	97%	84%	92%	96%	96%	97%	95%	95%	96%	97%	98%
Intercept=1.241	Bias	0.208	0.056	0.004	0.035	−0.064	0.019	0.010	0.007	−.013	−.043	0.181	0.003	−0.001	−0.019	−0.081	0.051	0.015	0.007	−.009	−.038
	MeanEst.Var	0.032	0.030	0.028	0.058	0.090	0.029	0.028	0.027	.045	.062	0.032	0.033	0.032	0.058	0.087	0.023	0.024	0.024	.043	.053
	Emp.Var	0.027	0.030	0.026	0.065	0.092	0.029	0.027	0.026	.043	.051	0.027	0.031	0.026	0.067	0.097	0.022	0.023	0.022	.039	.048
	RMSE	0.266	0.183	0.162	0.257	0.309	0.171	0.165	0.160	.205	.227	0.243	0.174	0.160	0.258	0.321	0.155	0.151	0.147	.195	.220
	95% Cov	76%	92%	94%	90%	95%	93%	93%	94%	95%	95.%	76%	92%	96%	94%	96%	92%	94%	95%	94%	95%
Slope=1.264	Bias	0.208	0.144	−0.009	0.156	−0.064	0.003	0.013	0.008	.019	−.041	0.155	0.165	−0.003	0.117	−0.083	0.060	0.047	0.015	.053	−.026
	MeanEst.Var	0.035	0.031	0.034	0.046	0.128	0.035	0.032	0.034	.039	.085	0.038	0.036	0.039	0.049	0.117	0.028	0.027	0.030	.038	.073
	Emp.Var	0.034	0.032	0.037	0.048	0.155	0.037	0.033	0.035	.036	.072	0.054	0.046	0.037	0.052	0.138	0.030	0.029	0.031	.037	.066
	RMSE	0.277	0.229	0.192	0.268	0.398	0.191	0.181	0.186	.191	.269	0.278	0.269	0.191	0.257	0.381	0.182	0.176	0.177	.198	.256
	95% Cov	75%	82%	94%	82%	94%	93%	93%	93%	94%	95%	78%	82%	97%	85%	93%	88%	89%	91%	90%	92%

Open in a new tab

Table 3.

Performance of the proposed method in contrast to the fully parametric method under the irrelevant design condition (true model italicized).

Actual Parameters	Performance Criteria	MAR_X										MAR_X,W
		Standard Parametric MI					Semi-Parametric MI					Standard Parametric MI					Semi-Parametric MI
		X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W	X	X, logZ	X*logZ	X,W	X*W
Mean=−0.0191	Bias	−0.025	−0.034	−0.037	0.004	−0.006	−0.000	−0.000	0.002	.001	−.006	−0.006	−0.025	−0.026	0.007	−0.001	0.007	0.006	0.004	.009	.003
	MeanEst.Var	0.020	0.020	0.021	0.025	0.032	0.018	0.019	0.020	.025	.034	0.024	0.029	0.029	0.021	0.027	0.015	0.016	0.016	.023	.030
	Emp.Var	0.014	0.019	0.018	0.023	0.026	0.018	0.018	0.019	.021	.024	0.013	0.021	0.022	0.019	0.022	0.015	0.016	0.016	.019	.022
	RMSE	0.122	0.141	0.140	0.154	0.168	0.133	0.134	0.136	.146	.155	0.114	0.147	0.151	0.143	0.154	0.121	0.127	0.128	.137	.148
	95%Cov	97%	94%	93%	96%	94%	94%	95%	96%	94%	95%	97%	94%	95%	95%	94%	95%	95%	95%	95%	95%
Intercept=−0.0569	Bias	−0.102	−0.109	−0.113	0.003	−0.003	−0.003	−0.003	−0.004	.001	−.002	−0.001	−0.021	−0.023	0.006	0.001	0.007	0.005	0.004	.008	.003
	MeanEst.Var	0.019	0.019	0.020	0.023	0.026	0.017	0.018	0.019	.023	.028	0.022	0.027	0.025	0.020	0.023	0.015	0.016	0.016	.022	.026
	Emp.Var	0.014	0.018	0.018	0.022	0.024	0.018	0.018	0.018	.020	.021	0.012	0.023	0.024	0.020	0.021	0.015	0.016	0.016	.019	.021
	RMSE	0.157	0.174	0.174	0.185	0.193	0.132	0.132	0.133	.140	.145	0.110	0.152	0.157	0.176	0.185	0.121	0.126	0.127	.136	.143
	95% Cov	88%	85%	86%	86%	85%	95%	95%	95%	94%	95%	98%	95%	97%	93%	94%	94%	95%	94%	95%	96%
Slope=0.222	Bias	0.006	0.003	0.003	0.005	−0.012	0.008	0.008	0.006	.001	−.008	−0.023	−0.022	−0.008	0.004	−0.014	0.000	−0.000	−0.003	.002	−.010
	MeanEst.Var	0.017	0.017	0.019	0.019	0.030	0.017	0.017	0.019	.019	.033	0.023	0.020	0.025	0.016	0.026	0.013	0.013	0.015	.015	.028
	Emp.Var	0.014	0.014	0.017	0.014	0.027	0.018	0.017	0.019	.017	.024	0.011	0.014	0.025	0.013	0.025	0.013	0.013	0.015	.014	.021
	RMSE	0.118	0.119	0.132	0.119	0.164	0.133	0.131	0.136	.128	.154	0.108	0.120	0.158	0.113	0.157	0.115	0.115	0.121	.116	.143
	95% Cov	95%	95%	94%	97%	95%	94%	95%	95%	94%	95%	98%	98%	92%	96%	95%	93%	94%	94%	94%	95%

Open in a new tab

When the design is relevant to the outcome variable Y yet uncorrelated with missingness (Table 2: MAR _X scenario), obvious advantages can be observed for the synthetic methods over the fully parametric method. For the fully parametric method to work properly under this condition, the imputation model has to be correctly specified, otherwise all inferences based on this method are invalid -- not only there is substantial bias attached to all three parameter estimates, but a corresponding disruption in coverage rates as well, which is particularly poor when the design is completely ignored in the model. In contrast, our proposed method results in nearly unbiased estimates and actual coverage that is closer to the nominal level under all five models, regardless of the misspecification. Substantial gains in terms of RMSE over the model-based method were also consistently observed in all scenarios considered. This indicates that the ‘unweighting’ procedure has actually played dual roles in the process: its effect is not limited to untying the unequal probability selection and saving the effort of design-based analyses afterwards, but it also captures much of the interactions between the design and the survey variable of interest so that ignoring the design in the imputation model does little harm. Incorporating the probability of selection in the imputation model unnecessarily has a modest impact, with some greater increase in variability and MSE when weights versus log MOS are included, since the weights are more variable.

With a relevant design that is also a correlate of missingness (Table 2: MAR _X,W scenario), the imputation models require use of the design variable (here the weight) to maintain an ignorable missing data mechanism. The model-based method behaves similarly to the case where the design is associated only with Y: failure to include the weight in the imputation model substantially biases all of the estimators considered, while including the weight as a covariate corrects for bias in the mean and intercept estimator but not in the slope. The synthetic model partially corrects for these biases by providing a correct estimate of the population distribution in the presence of missing data; however, unless the imputation model is correctly specified, some biases remain. Nevertheless, the synthetic model still has substantially reduced RMSE relative to the fully parametric approach for the mean and intercept estimator when the weight is ignored in the imputation model, and reduced RMSE when estimating the slope when the weight is included as a covariate but the interaction between the slope and the probability of selection is ignored. The synthetic model also has nearly exact to slightly conservative coverage properties, in contrast to the anti-conservative coverage of the fully parametric estimator when the model is misspecified for the estimator of interest. Misspecifying the functional form of the probability of selection in the imputation model (using weights instead of the log MOS) generally increases bias and MSE for both the fully parametric and semi-parametric approach, although the increased bias is not sufficient to reduce nominal coverage over the correctly specified functional form.

With an outcome irrelevant design (Table 3), there are very slight effects on the estimates when compared across methods and models. Including the irrelevant design variable in the imputation model results in negligible biases and introduces some modest inefficiencies, consistent with the findings in Reiter et al. (2006). The only impact of using weights rather than log MOS in the imputation is to modestly increase MSE.

It is also worth noting that the MI variance/standard errors under the new method are consistently lower than the fully parametric method, in addition to their better coverage properties. This is observed for all 12 scenarios considered.

4. Application to the Behavioral Risk Factor Surveillance System (BRFSS)

We next examine the effect of incorporating the survey weight in MI using data from one design stratum (n=388) of the 2009 Michigan BRFSS. This design stratum contains sampled households that belong to the medium-density (unlisted) telephone numbers group. The BRFSS is a telephone survey conducted with a random sample of adults living in telephone-equipped households in the US. An independent sample of telephone numbers are used as the sampling frame; thus case weights are constructed to account for the fact that the probability of selection is proportional to the number of telephone lines and inversely proportional to the number of adults in a household; in addition, poststratification weights are used to adjust age-sex-race/ethnic distributions to Census totals. A mix of categorical and continuous variables is selected for analysis. These include health insurance coverage (yes/no), body-mass index (BMI) in kg/m², high blood pressure (yes/no), and 5 demographic variables (age (in years), race (White vs. Nonwhite), annual household income (low= <$25,000, medium=$25,000 - $75,000, high>$75,000), and gender and employment status (yes/no/other)). All survey variables except gender have certain degrees of missing data: income has the highest missing rate (16.5%), while others are missing 0-6%.

4.1. Imputation Method

We compare results from the conventional fully parametric MI method with the proposed two-step semi-parametric MI method, with two imputation modeling strategies applied with each method: 1) assuming SRS, and 2) including the log of weights as a predictor in the model. We also include the weighted complete case analysis. Both imputation models used all available substantive covariates (health insurance, BMI, high blood pressure status, age, race, income, gender, and employment status). For the standard parametric analysis, as in the simulation study, we use Rubin’s combining rules (Rubin 1987) with weighted point estimates and Taylor Series approximations to account for the weights in the variance estimation. For the new method, we generated L=100 Bayesian bootstrap (BB) samples and created S=30 FPBB populations within each BB sample, with M=5 multiple imputations performed for each FPBB population. Since we do not know the population size in advance and the individual final weights sum up to nearly 200,000 cases which is unrealistic to generate, we assume that N=4500 is large enough to be treated as a synthetic population (corresponding to a sampling fraction of less than 10%). Since the degrees of freedom is L-1=99, a normal distribution was used for inference.

4.2. Analyses

We consider three different analyses: 1) the marginal distribution of income and health insurance accessibility (Table 4); 2) a linear regression model of BMI on key demographic variables (Table 4); and 3) a loglinear model of a four-way contingency table defined by four categorical variables with no second-or-higher-order interactions (Table 5). We consider an analysis using the full dataset, as well as a stratified analysis restricted to subjects identifying as white (“white domain”). Multivariate imputation by chained equations (MICE) in R was used to impute the missing data under both MI methods.

Table 4.

Estimation of marginal distributions for income and health insurance, and linear regression coefficients for the regression of BMI (dependent variable) on income, age and gender (independent variables). (Complete case analysis presented is weighted.)

			Methods
Sample	Estimation	Variable	Complete Case		Parametric MI (M=5)				Synthetic MI (L=100, S=30, M=5)
			Complete Case		Exclude weights		Include log(weights)		Exclude weights		Include log(weights)
			Pt.est.	SE	Pt.est.	SE	Pt.est.	SE	Pt.est.	SE	Pt.est.	SE
Full Sample	Marginal	Low Income	0.50	0.04	0.50	0.04	0.52	0.05	0.52	0.04	0.51	0.04
		Medium Income	0.38	0.04	0.36	0.04	0.36	0.04	0.36	0.04	0.36	0.04
		High Income	0.12	0.03	0.14	0.03	0.12	0.03	0.13	0.03	0.13	0.03
		No insurance	0.22	0.04	0.24	0.04	0.24	0.04	0.24	0.04	0.24	0.04
	Regression	Intercept	27.0	2.75	26.1	2.02	25.8	2.05	26.3	2.29	26.2	2.30
		Medium income	0.47	1.40	0.35	1.21	0.39	1.19	0.37	0.94	0.37	0.95
		High income	0.27	1.43	−0.47	1.32	−0.33	1.37	−0.36	1.40	−0.31	1.40
		Age	0.02	0.04	0.03	0.03	0.04	0.03	0.03	0.03	0.03	0.03
		Female	2.29	1.30	2.72	1.06	2.56	1.07	2.57	1.06	2.55	1.05
Whites Domain	Marginal	Low Income	0.30	0.07	0.36	0.07	0.35	0.06	0.34	0.06	0.34	0.06
		Medium Income	0.53	0.08	0.48	0.07	0.50	0.07	0.49	0.06	0.49	0.06
		High Income	0.17	0.06	0.16	0.06	0.15	0.05	0.17	0.06	0.17	0.06
		No insurance	0.24	0.07	0.21	0.06	0.21	0.06	0.19	0.06	0.19	0.06
	Regression	Intercept	31.1	3.9	32.4	4.7	31.0	4.1	31.0	4.2	31.0	4.1
		Medium income	−1.6	3.25	−2.8	2.83	−2.1	2.72	−1.8	2.96	−1.7	2.97
		High income	−3.1	3.60	−3.5	3.42	−3.2	3.18	−3.1	3.65	−3.0	3.62
		Age	0.02	0.06	−0.01	0.06	0.02	0.06	0.02	0.06	0.02	0.05
		Female	−1.7	2.39	−0.13	2.13	−0.68	2.11	−0.80	2.17	−0.75	2.17

Open in a new tab

Table 5.

Estimation of log-linear model for four categorical variables (collapse categories for medium and high income): 2009 Michigan BRFSS.

		Methods
Estimation	Variable Level	Complete Case		Parametric MI Exclude weights		Parametric MI Include log(weights)		Synthetic MI Exclude weights		Synthetic MI Include log(weights)
Estimation	Variable Level	Coef.	SE	Coef.	SE	Coef.	SE	Coef.	SE	Coef.	SE
Main effects	Low income	−0.01	0.12	0.08	0.13	0.04	0.12	0.04	0.13	0.02	0.13
	Has insurance	0.61	0.12	0.64	0.12	0.62	0.11	0.69	0.12	0.68	0.12
	White	−0.94	0.11	−1.0	0.10	−1.0	0.11	−1.1	0.12	−1.1	0.12
	Male	−0.11	0.12	−0.09	0.10	−0.07	0.10	−0.07	0.11	−0.07	0.11
Two-way Interactions	Low income x Has insurance	−0.36	0.12	−0.37	0.12	−0.33	0.13	−0.31	0.11	−0.30	0.11
	Low income x White	−0.28	0.10	−0.20	0.09	−0.20	0.09	−0.22	0.09	−0.22	0.09
	Low income x Male	−0.03	0.09	−0.03	0.09	−0.07	0.09	−0.05	0.08	−0.05	0.08
	Has insurance x White	−0.13	0.12	−0.02	0.12	−0.02	0.12	0.02	0.13	0.03	0.13
	Has insurance x Male	−0.01	0.13	−0.12	0.10	−0.14	0.10	−0.15	0.10	−0.15	0.10
	White x Male	−0.08	0.09	−0.11	0.08	−0.11	0.09	−0.11	0.08	−0.10	0.08

Open in a new tab

4.3. Results

Since the poststratification adjustment factor constitutes an important component of the final weight in BRFSS dataset, we presume that including the variables used to construct poststratification cells (age, race and gender in this case) in the imputation model should help in predicting the missing Y variable. A linear regression of final weights on age, sex, and race shows that these covariates explain 40% of the variance of the weights, suggesting that there are other design variables that contribute to the survey weights unknown to us. Thus we conclude that imputation approaches that condition only on the available design variables will be insufficient to account for the sampling weights.

Table 4 shows that under the fully parametric MI method, including survey weights in the imputation model has a large impact on the estimated proportions of income levels and the regression coefficients of BMI on income and gender. In fact, these differences are particularly significant for the whites-only analysis. Under the new method, however, all estimates are similar to those from the model-based method with weights accounted for. Moreover, there is essentially no difference whether or not we incorporate weights into the imputation model once the sample data are synthesized, indicating that, as we expect, the new method can adjust for the weight effects at the synthesizing step without the need to model survey weights at the imputation step. Similar results are obtained in Table 5 with respect to the log-linear model.

5. Discussion

We propose using weighted finite population Bayesian Bootstrap to account for one-stage sampling weights in MI for item missing data in the Behavioral Risk Factor Surveillance Survey. We also evaluate the performance of this method in a simulation setting: our findings in suggest that it can bring significant reductions in bias relative to the existing model-based methods with little loss in efficiency. Meanwhile, the weighted FPBB method potentially protects against model misspecification, for example, wrongly including or excluding interactions between design variables and other covariates in the imputation model, while also maintaining population-level multivariate relationships. A further advantage lies in that, unlike the fully parametric methods which include designs in the imputation model and still require complex survey packages to analyze the imputed datasets, the new method fully accounts for the unequal selection probabilities by unweighting them and restoring a population in a separate step; therefore, only simple, unweighted complete-data analysis techniques are needed for inferences with the combining rules. This potentially allows a much wider variety of models to be considered using existing software, which, despite recent improvements, often does not have straightforward methods for accounting for complex sample designs.

A limitation of the proposed method is the need for the weights to be included in the imputation model if the probability of item response is a function of selection probability. However, by separating the modeling of the weights in the complete data by use of a relatively easy-to-implement nonparametric algorithm from the modeling of the weights in the missingness mechanism, it a) reduces the impact of misspecified missingness mechanisms (as noted in Table 2, where the RMSEs and coverage of the misspecified models are greatly improved over the standard parametric approaches), and b) allows more careful inspection and modeling of the missingess mechanism as a function of the weights. In particular, this suggests that the imputation model be developed using the weighted FPBB datasets, to include appropriate functions of and interactions with the design weights.

The proposed two-stage semi-parametric multiple imputation approach has a number of possible extensions. First, while we have imputed the missing data in our second step using a model-based approach, a fully non-parametric approach using a Bayesian bootstrap (Rubin and Schenker 1986) can be used instead. Second, while our approach has focused on sampling weights, extensions that incorporate unit non-response into the synthetic population generation and multiple imputation to propagate uncertainty in unit-nonresponse weighting adjustments are possible. (However, when only final weights incorporating non-response adjustments are provided, treating the final weight as a sampling weight as we did in the BRFSS application may be the only practical alternative.) Third, while we made a missing at random assumption with a single missing outcome in our simulation study and application, it is certainly possible at the imputation stage to accommodate missingness in multiple covariates via sequential regression multiple imputation (Raghunathan et al. 2001), or even to consider not missing at random mechanisms (Little 2008). Fourth, we could extend the method to incorporate unit nonresponse by generating Bayesian bootstraps of the entire sample including the unit non-responders, applying standard unit nonresponse adjustments to the base weights to obtain the nonresponse-adjusted weights, and then applying the weighted Polya posterior with the non-response-adjusted weights as the input weight in the algorithm to create synthetic populations. Finally, our method developed here is for a one-stage design; extensions to account for multi-stage designs with clustering and stratification as part of the finite population Bayesian bootstrap are required as well, and are the focus of current research efforts.

Acknowledgements

This work was supported in part by Grant Number R01CA129101 from the National Cancer Institute. The authors would like to thank Rod Little, Brady West, and Richard Valliant for their review and helpful comments, as well as the Editor, Associate Editor, and two reviewers, whose careful review and comments substantially improved the manuscript.

BIBLIOGRAPHY

Binder DA On the variances of asymptotically normal estimators from complex surveys International Statistical Review 1983. 51 279–292 [Google Scholar]
Breidt J Claeskens G Opsomer J Model-assisted estimation for complex surveys using penalized splines Biometrika 2005. 92 831–846 [Google Scholar]
Chen Q Elliott MR Little RJA Bayesian penalized spline model-based inference for finite population proportions in unequal probability sampling Survey Methodology 2010. 36 22–34 [PMC free article] [PubMed] [Google Scholar]
Cohen MP the Bayesian bootstrap and multiple imputation for unequal probability sample designs ASA Proceedings of the Section on Survey Research Methods 1997. 635–638 [Google Scholar]
Dong Q Elliott MR Raghunathan TE A nonparametric method to generate synthetic populations to adjust for complex sample designs Survey Methodology 2014. 40 29–46 [PMC free article] [PubMed] [Google Scholar]
Elliott MR Little RJ Model-based alternatives to trimming survey weights Journal of Official Statistics 2000. 16 191–209 [Google Scholar]
Ericson WA Subjective Bayesian models in sampling finite populations Journal of the Royal Statistical Society 1969. B31 195–233 [Google Scholar]
Feller W. An Introduction to Probability Theory and Its Applications. I. Wiley; New York, NY: 1968. [Google Scholar]
Ghosh M Meeden G Estimation of the variance in finite population sampling Sankhyā: The Indian Journal of Statistics 1983. B45 362–375 [Google Scholar]
Heitjan DF Little RJA Multiple imputation for the Fatal Accident Reporting System Applied Statistics 1991. 40 13–29 [Google Scholar]
Kim JK Brick JM Fuller WA Kalton G On the bias of the multiple-imputation variance estimator in survey sampling Journal of the Royal Statistical Society 2006. B68 509–521 [Google Scholar]
Little RJA To model or not to model? Competing modes of inference for finite population sampling Journal of the American Statistical Association 2004. 99 546–556 [Google Scholar]
Little RJA Fitzmaurice G Davidian M Verbeke G Molenberghs G Selection and Pattern Mixture Models, in Longitudinal Data Analysis 2008. 409–430 Chapman & Hall/CRC Press; Boca Raton.FL: [Google Scholar]
Little RJA Calibrated Bayes, for statistics in general, and missing data in particular (with Discussion and Rejoinder) Statistical Science 2011. 26 162–186 [Google Scholar]
Little RJA Zheng H The Bayesian approach to the analysis of finite population surveys Bayesian Statistics 2007. 8 283–302 [Google Scholar]
Lo AY A Bayesian bootstrap for a finite population Annals of Statistics 1988. 16 1684–1695 [Google Scholar]
Meng XL Multiple imputation inferences with uncongenial sources of input Statistical Science 1994. 9 538–558 [Google Scholar]
Olkin I Tate RF Multivariate correlation models with mixed discrete and continuous variables Annals of Mathematical Statistics 1961. 32 448–465 [Google Scholar]
Raghunathan TE Lepkowski JM Van Hoewyk J Solenberger P A multivariate technique for multiply imputing missing values using a sequence of regression models Survey Methodology 2001. 27 85–95 [Google Scholar]
Raghunathan TE Reiter JP Rubin DB Multiple imputation for statistical disclosure limitation Journal of Official Statistics 2003. 19 1–16 [Google Scholar]
Reiter JP Raghunathan TE Kinney SK The importance of modeling the sampling design in multiple imputation for missing data Survey Methodology 2006. 32 143–149 [Google Scholar]
Rubin DB The Bayesian bootstrap Annals of Statistics 1981. 9 130–134 [Google Scholar]
Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; New York: 1987. [Google Scholar]
Rubin D Schenker N Multiple imputation for interval estimation from simple random samples with ignorable nonresponse Journal of the American Statistical Association 1986. 81 366–374 [Google Scholar]
Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall; London: 1997a. [Google Scholar]
Schafer JL Imputation of missing covariates under a multivariate linear mixed model Technical report 1997b. 97–04 Dept. of Statistics, The Pennsylvania State University; http://www.stat.psu.edu/reports/1997/tr9704.pdf [Google Scholar]
Schenker N Raghunathan TE Chiu P Makuc DM Zhang G Cohen AJ Multiple imputation of missing income data in the National Health Interview Survey Journal of the American Statistical Association 2006. 101 924–933 [Google Scholar]
Seaman SR White IR Copas AJ Li L Combining multiple imputation and inverse-probability weighting Biometrics 2012. 68 129–137 [DOI] [PMC free article] [PubMed] [Google Scholar]
van Buuren S Groothuis-Oudshoorn K mice: Multivariate Imputation by Chained Equations in R Journal of Statistical Software 2011. 45 1–67 [Google Scholar]
Zangeneh SZ Keener RW Little RJ Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units ASA Proceedings of the Section on Survey Research Methods 2011. 3429–3440 [Google Scholar]
Zheng H Little RJ Penalized Spline Nonparametric Mixed Models for Inference about a Finite Population Mean from Two-Stage Samples Survey Methodology 2004. 30 209–218 [Google Scholar]
Zheng H Little RJ Inference for the Population Total from Probability-Proportional-to-Size Samples Based on Predictions from a Penalized Spline Nonparametric Model Journal of Official Statistics 2005. 21 1–20 [Google Scholar]

[R1] Binder DA On the variances of asymptotically normal estimators from complex surveys International Statistical Review 1983. 51 279–292 [Google Scholar]

[R2] Breidt J Claeskens G Opsomer J Model-assisted estimation for complex surveys using penalized splines Biometrika 2005. 92 831–846 [Google Scholar]

[R3] Chen Q Elliott MR Little RJA Bayesian penalized spline model-based inference for finite population proportions in unequal probability sampling Survey Methodology 2010. 36 22–34 [PMC free article] [PubMed] [Google Scholar]

[R4] Cohen MP the Bayesian bootstrap and multiple imputation for unequal probability sample designs ASA Proceedings of the Section on Survey Research Methods 1997. 635–638 [Google Scholar]

[R5] Dong Q Elliott MR Raghunathan TE A nonparametric method to generate synthetic populations to adjust for complex sample designs Survey Methodology 2014. 40 29–46 [PMC free article] [PubMed] [Google Scholar]

[R6] Elliott MR Little RJ Model-based alternatives to trimming survey weights Journal of Official Statistics 2000. 16 191–209 [Google Scholar]

[R7] Ericson WA Subjective Bayesian models in sampling finite populations Journal of the Royal Statistical Society 1969. B31 195–233 [Google Scholar]

[R8] Feller W. An Introduction to Probability Theory and Its Applications. I. Wiley; New York, NY: 1968. [Google Scholar]

[R9] Ghosh M Meeden G Estimation of the variance in finite population sampling Sankhyā: The Indian Journal of Statistics 1983. B45 362–375 [Google Scholar]

[R10] Heitjan DF Little RJA Multiple imputation for the Fatal Accident Reporting System Applied Statistics 1991. 40 13–29 [Google Scholar]

[R11] Kim JK Brick JM Fuller WA Kalton G On the bias of the multiple-imputation variance estimator in survey sampling Journal of the Royal Statistical Society 2006. B68 509–521 [Google Scholar]

[R12] Little RJA To model or not to model? Competing modes of inference for finite population sampling Journal of the American Statistical Association 2004. 99 546–556 [Google Scholar]

[R13] Little RJA Fitzmaurice G Davidian M Verbeke G Molenberghs G Selection and Pattern Mixture Models, in Longitudinal Data Analysis 2008. 409–430 Chapman & Hall/CRC Press; Boca Raton.FL: [Google Scholar]

[R14] Little RJA Calibrated Bayes, for statistics in general, and missing data in particular (with Discussion and Rejoinder) Statistical Science 2011. 26 162–186 [Google Scholar]

[R15] Little RJA Zheng H The Bayesian approach to the analysis of finite population surveys Bayesian Statistics 2007. 8 283–302 [Google Scholar]

[R16] Lo AY A Bayesian bootstrap for a finite population Annals of Statistics 1988. 16 1684–1695 [Google Scholar]

[R17] Meng XL Multiple imputation inferences with uncongenial sources of input Statistical Science 1994. 9 538–558 [Google Scholar]

[R18] Olkin I Tate RF Multivariate correlation models with mixed discrete and continuous variables Annals of Mathematical Statistics 1961. 32 448–465 [Google Scholar]

[R19] Raghunathan TE Lepkowski JM Van Hoewyk J Solenberger P A multivariate technique for multiply imputing missing values using a sequence of regression models Survey Methodology 2001. 27 85–95 [Google Scholar]

[R20] Raghunathan TE Reiter JP Rubin DB Multiple imputation for statistical disclosure limitation Journal of Official Statistics 2003. 19 1–16 [Google Scholar]

[R21] Reiter JP Raghunathan TE Kinney SK The importance of modeling the sampling design in multiple imputation for missing data Survey Methodology 2006. 32 143–149 [Google Scholar]

[R22] Rubin DB The Bayesian bootstrap Annals of Statistics 1981. 9 130–134 [Google Scholar]

[R23] Rubin DB. Multiple Imputation for Nonresponse in Surveys. Wiley; New York: 1987. [Google Scholar]

[R24] Rubin D Schenker N Multiple imputation for interval estimation from simple random samples with ignorable nonresponse Journal of the American Statistical Association 1986. 81 366–374 [Google Scholar]

[R25] Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall; London: 1997a. [Google Scholar]

[R26] Schafer JL Imputation of missing covariates under a multivariate linear mixed model Technical report 1997b. 97–04 Dept. of Statistics, The Pennsylvania State University; http://www.stat.psu.edu/reports/1997/tr9704.pdf [Google Scholar]

[R27] Schenker N Raghunathan TE Chiu P Makuc DM Zhang G Cohen AJ Multiple imputation of missing income data in the National Health Interview Survey Journal of the American Statistical Association 2006. 101 924–933 [Google Scholar]

[R28] Seaman SR White IR Copas AJ Li L Combining multiple imputation and inverse-probability weighting Biometrics 2012. 68 129–137 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] van Buuren S Groothuis-Oudshoorn K mice: Multivariate Imputation by Chained Equations in R Journal of Statistical Software 2011. 45 1–67 [Google Scholar]

[R30] Zangeneh SZ Keener RW Little RJ Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units ASA Proceedings of the Section on Survey Research Methods 2011. 3429–3440 [Google Scholar]

[R31] Zheng H Little RJ Penalized Spline Nonparametric Mixed Models for Inference about a Finite Population Mean from Two-Stage Samples Survey Methodology 2004. 30 209–218 [Google Scholar]

[R32] Zheng H Little RJ Inference for the Population Total from Probability-Proportional-to-Size Samples Based on Predictions from a Penalized Spline Nonparametric Model Journal of Official Statistics 2005. 21 1–20 [Google Scholar]

PERMALINK

A TWO-STEP SEMIPARAMETRIC METHOD TO ACCOMMODATE SAMPLING WEIGHTS IN MULTIPLE IMPUTATION

Hanzhi Zhou

Michael R Elliott

Trviellore E Raghunathan

Summary:

1. Introduction

2. A Two-Step Semiparametric MI Procedure

2.1. Step 1: Undo Sampling Weights through Nonparametric Synthetic Data Generation

2.1.1. The Weighted Pólya Posterior.

2.1.2. The Adapted-weighted FPBB method.

Resampling Using the Standard Bayesian Bootstrap (BB)

Undo Sampling Weight using the weighted Polya posterior/ weighted FPBB

2.2. Step 2: Multiply Impute Missing Data through Parametric Models

2.3. Point and Variance Estimates for the Two-Step MI Procedure.

3. Simulation Study

3.1. Simulation Results

Table 1.

Table 2.

Table 3.

4. Application to the Behavioral Risk Factor Surveillance System (BRFSS)

4.1. Imputation Method

4.2. Analyses

Table 4.

Table 5.

4.3. Results

5. Discussion

Acknowledgements

BIBLIOGRAPHY

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A TWO-STEP SEMIPARAMETRIC METHOD TO ACCOMMODATE SAMPLING WEIGHTS IN MULTIPLE IMPUTATION

Hanzhi Zhou

Michael R Elliott

Trviellore E Raghunathan

Summary:

1. Introduction

2. A Two-Step Semiparametric MI Procedure

2.1. Step 1: Undo Sampling Weights through Nonparametric Synthetic Data Generation

2.1.1. The Weighted Pólya Posterior.

2.1.2. The Adapted-weighted FPBB method.

Resampling Using the Standard Bayesian Bootstrap (BB)

Undo Sampling Weight using the weighted Polya posterior/ weighted FPBB

2.2. Step 2: Multiply Impute Missing Data through Parametric Models

2.3. Point and Variance Estimates for the Two-Step MI Procedure.

3. Simulation Study

3.1. Simulation Results

Table 1.

Table 2.

Table 3.

4. Application to the Behavioral Risk Factor Surveillance System (BRFSS)

4.1. Imputation Method

4.2. Analyses

Table 4.

Table 5.

4.3. Results

5. Discussion

Acknowledgements

BIBLIOGRAPHY

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases