Summary
Missing data are common in longitudinal cohort studies and can lead to bias, particularly in studies with informative missingness. Many common methods for handling informatively missing data in survey samples require correctly specifying a model for missingness. Although doubly robust methods exist to provide unbiased regression coefficients in the presence of missing outcome data, these methods do not account for correlation due to clustering inherent in longitudinal or cluster-sampled studies. In this work, we developed a doubly robust method to estimate the regression of an outcome on a predictor in the presence of missing multilevel data on the outcome, which results in consistent estimation of regression coefficients assuming correct specification of either (1) the probability of missingness or (2) the outcome model. This method involves specification of separate hierarchical models for missingness and for the outcome, conditional on observed auxiliary variables and cluster-specific random effects, to account for correlation among observations. We showed this proposed estimator is doubly robust and derived its asymptotic distribution, conducted simulation studies to compare the method to an existing doubly robust method developed for independent data, and applied the method to data from the China Health and Nutrition Survey, an ongoing multilevel longitudinal cohort study.
Keywords: Clustering, Doubly robust, Hierarchical modeling, Longitudinal, Missing data
1 |. INTRODUCTION
The China Health and Nutrition Survey (CHNS) is an ongoing longitudinal cohort study, consisting of a diverse population-based sample1. The CHNS was implemented to study the effects of various government programs and the rapidly changing social and economic environments in China on the nutrition and health of the population. The original CHNS cohort included households from eight provinces in China, with households from additional provinces and municipalities added in later waves of data collection, resulting in a cohort of about 7,200 households with over 30,000 people from 15 provinces and municipalities, with nine waves of data collection that started in 1991. A diverse set of individual-level and household-level data was collected via household-based surveys and physical exams, in addition to the collection of community-level data.
The CHNS cohort includes entire households (i.e., all participants from a given household) and multiple households per community (i.e., geographic neighborhood), resulting in natural clusters in the data. In addition, repeated measures from multiple study visits are clustered within individuals. This natural clustering introduces correlation to the data, which can complicate statistical analysis. Similarly to most long-term, longitudinal studies, CHNS suffers from a substantial amount of missing data due to individual- or household-level non-response at particular years of data collection. It can be of interest to estimate the effect of an independent variable on an outcome variable via regression analysis. However, estimating this regression based only on the subset of the sample with non-missing outcome data may result in biased effect estimates if the missing values of the outcome variable differ systematically from the observed values after conditioning on the observed variables in the model, including observed model covariates and observed outcome data within the same cluster2. The CHNS collected a rich set of variables, and so if there is a set of observed (i.e., non-missing) variables that are related either to the outcome variable or to whether an individual provides outcome data, then it may be possible to use these auxiliary variables to adjust for the missing data in certain situations.
Many methods currently exist to accommodate missing data in regression analysis. One commonly used method is multiple imputation, which involves imputing (i.e., filling-in) the missing data multiple times, performing identical analyses on each set of imputed data using standard statistical methods for complete data, and combining the results3. Another common method for handling missing data is inverse probability weighting (IPW), which involves estimating the probability of data being non-missing for each record, and performing statistical analysis among the sub-sample with non-missing data, weighted by the inverse of this estimated probability4. However, the validity of multiple imputation depends on correct specification of the imputation model3, and IPW requires correct specification of a model for the probability of missingness4. Therefore, there is need for missing data methods that have the double robustness property, meaning that the method is unbiased if either the imputation model (i.e., model for the missing variable) or the probability of missingness model, but not necessarily both, is specified correctly. Several doubly robust methods have been developed to estimate a regression of an outcome variable on a predictor variable in the presence of missing outcome data for independent records. Scharfstein et al.5 proposed estimating the probability that the outcome is observed (i.e., non-missing) for each data record based on a specified working model for missingness, estimating the predicted outcome for each data record based on a specified working model for the outcome, and solving a set of estimating equations that depends on the predicted non-missing probabilities and predicted outcomes to estimate the regression coefficients of interest. In addition, Zeng and Chen6 also proposed estimating the non-missing probability and predicted outcome for each data record based on specified working models, and estimating the regression estimator of interest based on these predicted non-missing probabilities and predicted outcomes and a partition of the support of the predictor variable(s) of interest. When the working model for either missingness or the outcome (or both) is specified correctly, both the Scharfstein et al.5 estimator and the Zeng and Chen6 estimator are consistent for the true effect of the predictor on the outcome. However, both of these methods were developed for independent data, and therefore estimate the non-missing probabilities and predicted outcomes from working models that would ignore any multilevel data structure.
We build upon the Scharfstein et al.5 method to propose a new doubly robust approach to estimate the association between a predictor and outcome variable for multilevel data, when the outcome variable is missing for some records. Since the CHNS data contain natural clusters, it is expected that missingness for different records within the same cluster may be correlated, and similarly the outcome variable may be correlated among different records within the same cluster. Therefore, we propose estimating the probability of missingness and the mean of the outcome variable conditional on cluster-specific random effects (in addition to observed data), which differs from existing doubly robust methods5,6 that estimate these quantities conditional on observed variables only. Allowing cluster-specific random effects in the working models may improve the plausibility that one or both of the working models will be specified correctly, particularly when there is high within-cluster correlation in the missingness and/or the outcome variable beyond what can be explained by observed covariates.
The rest of the paper is organized in the following way. Section 2 provides details for regression estimation using our new approach, and shows that this approach is doubly robust. Section 3 provides results from a simulation study comparing the numerical performance of this new approach with other existing methods. Section 4 illustrates the use of this approach on data from the CHNS. Section 5 concludes with a discussion.
2 |. DOUBLY ROBUST METHOD FOR SEMIPARAMETRIC REGRESSION WITH MISSING DATA
2.1 |. Notation
Without loss of generality, let us focus on the case with two-level data (e.g., repeated measures data on independent individuals), where j = 1, …, m denotes the cluster (i.e., level 2), i = 1, …, nj denotes the data record within cluster j (i.e., level 1), and (a discussion of the extension to more than two levels is included in Section 5). Let Yij denote an outcome of interest, Rij denote an indicator that Yij is observed (i.e., non-missing), Xij denote a vector of predictors for Yij, and Zij denote a high-dimensional vector of auxiliary variables related to missingness and/or the outcome variable (including all variables in Xij). Let , , be a matrix of dimension nj by p where p equals the number of predictor variables, be a matrix of dimension nj by q where q equals the number of auxiliary variables, and the data (Rj, Yj, Zj) be independent and identically distributed for j = 1, …, m. Let Y = (Y1, …, Ym)′ and R = (R1, …, Rm)′ be vectors of length n, be a matrix of dimension n by p, and be a matrix of dimension n by q. Let
(1) |
be the semi-parametric regression model of interest, where β is an unknown vector of constant regression coefficients with dimension p, β* is the true value of β, and μ(·) is some known function of XTβ. Throughout the remainder of the paper, assume that Rij and Yij are independent conditional on the auxiliary variables Zij and independent cluster-specific random vectors aj and bj (i.e., Rij ⊥ Yij|Zij, aj, bj), that Rij depends on Zij and aj only (i.e., Rij ⊥ bj|Zij, aj) and Yij depends on Zij and bj only (i.e., Yij ⊥ aj|Zij, bj), and that the parameters for the joint distributions for (Rj, aj) and (Yj, bj) conditional on Zj are distinct (i.e., the model for the joint distribution for (Rj, aj) conditional on Zj and the model for the joint distribution for (Yj, bj) conditional on Zj do not share the parameters); see assumption (A1) in Web Appendix A in the Supplementary Materials. Note that taken together, these assumptions imply that the outcome data are missing at random (MAR)2; in other words, these assumptions imply that the outcome variable is independent of missingness, conditional on the observed data (i.e., Rij ⊥ Yij|Zij). These assumed relationships between the different variables described here are illustrated in Figure 1.
2.2 |. Proposed Doubly Robust Method for Multilevel Data
First, specify hierarchical working models for [Rj, aj|Zj] and [Yj, bj|Zj], where aj and bj are independent vectors of cluster-specific random effects to account for within-cluster correlation in missingness and the outcome variable respectively; let the cluster-specific random effects (aj, bj) be independent and identically distributed for j = 1, …, m. For example, generalized linear mixed effect models may be specified for [Rj, aj|Zj] and [Yj, bj|Zj], with linear predictors and and cluster-specific random intercepts aj and bj respectively. If the random effects were known, then the predicted values from these working models, and , could be substituted in the set of doubly robust estimating equations for independent data introduced by Scharfstein et al.5, resulting in the following estimating equations conditional on the random effects:
(2) |
Solving this set of conditional estimating equations would result in a doubly robust estimator for β if the random effects aj and bj were known5. However, the random effects aj and bj are unknown in practice, but rather are assumed to be randomly distributed according to some specified working distribution. Therefore, it is necessary to integrate the estimating equations over the posterior distribution of the random effects conditional on the observed data that is implied by the working models, , to obtain a revised set of estimating equations that depend on observed data only. Let be the working density for [Rij|Zij, aj], be the working density for [Yij|Zij, bj], be the working density for the random effects aj, and be the working density for the random effects bj. The posterior densities for the random effects (aj, bj) are conditioned on the observed data in cluster j. Assumption (A1) (see Web Appendix A in the Supplementary Materials) implies that the random effects aj and bj are independent conditional on the observed data, that the working posterior distribution for aj depends only on the working model [Rj, aj|Zj; α, τ], and that the working posterior distribution for bj depends only on the working model [Yj, bj|Zj; γ, ϕ]:
(3) |
where is the posterior density of aj conditional on the observed data based on the working model for [Rj, aj|Zj] and the estimated α and τ, and is the posterior density of bj conditional on the observed data based on the working model for [Yj, bj|Zj] and the estimated γ and ϕ.
Therefore, we obtain the following set of estimating equations:
(4) |
Since these estimating equations are a function of the observed data (R, RY, Z) only, we propose to estimate β by solving these estimating equations.
Since the working model parameters (α, τ, γ, ϕ) are unknown in practice, in order to solve the estimating equations, it is necessary to obtain estimates of these parameters, such as by maximizing the observed data likelihoods for Y and R. For example, if the models [Rj, aj|Zj] and [Yj, bj|Zj] are generalized linear mixed effects models, then the observed data likelihoods for R and Y are of the form and respectively, and the working model parameter estimates () are the values that maximize these observed data likelihoods. After estimating the working model parameters, these estimates are substituted into the estimating equations , and these estimating equations are solved for β to obtain the effect estimate of interest,. The observed data likelihood functions for the working models and the final set of estimating equations can be maximized or solved by EM algorithm, Markov Chain Monte Carlo, or Gauss-Hermite quadrature.
2.3 |. Asymptotic Properties
We now present arguments to justify the consistency and asymptotic normality of the proposed estimator when at least one of the hierarchical working models, [Rj, aj|Zj] and/or [Yj, bj|Zj], is specified correctly. For example, the working model for [Rj, aj|Zj] would be specified correctly if both the working model for Rij conditional on aj, , and the working density for aj, , are specified correctly; similarly, the working model for [Yj, bj|Zj] would be specified correctly if both the working model for Yij conditional on bj, , and the working density for bj, , are specified correctly. Let ∂α,τ denote a vector of partial derivatives and denote a matrix of second-order derivatives with respect to α and τ; let ∂γ,ϕ and denote the corresponding operations with respect to γ and ϕ.
Generally, note that the proposed estimating equations 4 can be re-written as the following:
(5) |
where . The estimating equations (5) are equivalent to a generalized estimating equations (GEE) model to estimate β. Therefore, it follows that this set of proposed estimating equations inherits the properties of standard GEE models (e.g., unique solution for β, identifiability of β from the observed data distribution) under some mild regularity conditions.
According to maximum likelihood theory for generalized linear mixed effect models7, converge in probability to a constant (α*, τ*, γ*, ϕ*), where (α*, τ*) are the true parameter values if the working model [Rj, aj|Zj; α, τ] is correct, and (γ*, ϕ*) are the true parameter values if the working model [Yj, bj|Zj; γ, ϕ] is correct. Let and based on the specified working models. It can be shown by Taylor series expansion of the log-likelihood for each working model around the limits of the maximum likelihood estimates of the parameters that
(6) |
Where , and
(7) |
where .
The following results show that the proposed estimating equations are unbiased for β if at least one of the working models is correct, and therefore is consistent for the true value β*. It is also shown that is asymptotically normally distributed.
Lemma 1.
Let , and either the working model [Rj, aj|Zj] or the working model [Yj, bj|Zj] be correct. Then E[Sm(β*; α*, τ*, γ*, ϕ*)] = 0.
An outline of the proof is presented in the Supplementary Materials (Web Appendix B). In particular, this proof follows from (1) the independence of the posterior distributions of the random effects aj and bj, (2) if working model [Rj, aj|Zj; α, τ] is correct, and (3) if working model is correct.
Theorem 1.
Let , and either the working model [Rj, aj|Zj] or the working model [Yj, bj|Zj] be correct. Under assumptions (A1)-(A8) (see Web Appendix A in the Supplementary Materials), converges in probability to the true parameter value β*, and converges to a normal distribution with mean zero and a covariance matrix that can be estimated by
(8) |
where indicates empirical mean, and u⊗2 = uuT for a px1 vector u.
An outline of the proof is presented in the Supplementary Materials (Web Appendix B). Essentially the proof involves applying the asymptotic properties of , Lemma 1, and a Taylor series expansion of around β* to obtain that
(9) |
The covariance estimator for can be obtained based on the empirical covariance of this expression for substituting for (β*; α*, γ*, τ*, ϕ*), and substituting empirical means for expected values.
3 |. SIMULATION STUDY
3.1 |. General Set-Up
We conducted simulation studies to examine the performance of our proposed estimator and to compare its performance to existing methods in finite samples. One thousand datasets were simulated, each with 1000 clusters with 2 data records each (i.e., 1000 individuals with data for 2 time-points each). Let j indicate the individual and i = 1, 2 indicate the time-point. One time-varying predictor variable of interest, , was generated for each cluster from a multivariate normal distribution, , where the first element of the random vector Xj corresponded to the first time-point and the second element corresponded to the second time-point. Similarly, three time-varying auxiliary variables were generated for each cluster based on the value of , , and Z3,ij ~ Exp(mean = |0.7 + 0.2Xij |). In addition, one time-invariant auxiliary variable was generated for each cluster: Z4,1j = Z4,2j ~ Bernoulli(0.5). Two random intercepts, aj (used to generate missingness Rij) and bj (used to generate the outcome Yij), were independently generated from a normal distribution with mean 0 and variance 1.We considered the outcome variable Yij to be continuous (presented in this section) and binary (see Web Appendix C in the Supplementary Materials).
For the proposed method, separate mixed effects models were fit for missingness Rij (logistic mixed effect model) and the outcome Yij (linear mixed effect model) to estimate the working model parameters, each with a cluster-specific intercept specified as normally distributed with mean zero; in other words, the working model for missingness was logit{P (Rij = 1|Zij, aj)} = Zijα + aj and aj ~ N(0, τ), and the working model for the outcome was E[Yij |Zij, bj ] = Zijγ + bj and bj ~ N(0, ϕ), where Zij is a vector of covariates for data record i from cluster j. Since the random effects were assumed to be normally distributed, all intractable integrals were evaluated using Gauss-Hermite quadrature. The Newton-Raphson algorithm was used to solve the estimating equations Sm(β) for β. Mixed effects models were fit using PROC GLIMMIX in SAS. An R function written by the authors was used to solve the estimating equations and estimate the covariance matrix, using the estimated working model parameters.
The proposed method was also compared to an available case analysis (i.e., dropping missing records from the dataset prior to statistical analysis) and the method introduced by Scharfstein et al.5 for independent data (marginal approach). For the available case analysis, generalized estimating equations with an exchangeable working correlation matrix were used to estimate the marginal effect of Xij on Yij (ignoring the additional information provided by the auxiliary variables Zij). For the marginal approach, independent-data regression models were fit for Rij (logistic regression) and Yij (linear regression) conditional on Zij (ignoring the clustering in the data), and β was estimated as the solution to the estimating equations introduced by Scharfstein et al.5, where was predicted from the estimated independent-data model for Rij and was predicted from the estimated independent-data model for Yij. Note that for the working models with a non-identity link function (e.g., logistic regression for Rij), even the marginal working model fit with the correct set of fixed effects was misspecified since Rij and Yij were generated based on models conditional on a random intercept.
3.2 |. Misspecification of Working Models by Omitting an Important Covariate
First, we considered the performance of the proposed method when either working model was misspecified by omitting an important covariate. The outcome variable Yij was generated from a normal distribution with mean (1 + Z1,ij + Z2,ij + γ3 ∗ Z3,ij + Z4,ij + Xij + bj) and variance 1, where γ3 equaled 0.2 (weak effect) or 1 (strong effect). An indicator that Yij was observed (Rij) was generated from a Bernoulli distribution with probability logit−1(α0 − Z1,ij + Z2,ij + α3 ∗ Z3,ij − Z4,ij + Xij + aj), where α0 equaled 0.5 (20% missing) or −1 (35% missing), and α3 equaled 0.2 (weak effect) or 1 (strong effect). For both the proposed multilevel approach and the marginal approach, each working model was either fit using the correct set of fixed effects, or by excluding Z3,ij from the model.
Table 1 presents bias, empirical standard deviation of the estimates (SDE), average estimated standard errors (ESE), mean square error (MSE), and coverage rates for 95% confidence intervals (CP) for the proposed multilevel approach. Table 2 presents ratios of the empirical variance and MSE for the multilevel approach to the available case and marginal approaches. The proposed multilevel approach exhibited essentially no bias when either the working model for [Rj, aj|Zj] and/or the working model for [Yj, bj|Zj] were specified correctly, confirming the double robustness property. Bias for the multilevel approach tended to decrease as the percent missing decreased. Also, bias when the [Yj, bj|Zj] working model was misspecified tended to decrease as the magnitude of the omitted effect decreased. The 95% confidence interval coverage rates were nearly at the nominal level when at least one working model was specified correctly.
TABLE 1.
β 0 | β 1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||
Effect strength | % Miss. | Ra | Ya | Bias | SDE | ESE | MSE | CP | Bias | SDE | ESE | MSE | CP |
| |||||||||||||
Weak | 20 | T | T | 0.003 | 0.104 | 0.103 | 0.011 | 95.0 | −0.002 | 0.055 | 0.054 | 0.003 | 95.0 |
T | F | 0.006 | 0.104 | 0.103 | 0.011 | 94.7 | −0.002 | 0.055 | 0.054 | 0.003 | 95.3 | ||
F | T | 0.003 | 0.104 | 0.103 | 0.011 | 94.7 | −0.002 | 0.055 | 0.054 | 0.003 | 95.2 | ||
F | F | 0.011 | 0.104 | 0.103 | 0.011 | 94.7 | −0.003 | 0.055 | 0.054 | 0.003 | 95.3 | ||
| |||||||||||||
35 | T | T | 0.005 | 0.127 | 0.123 | 0.016 | 95.4 | −0.002 | 0.064 | 0.062 | 0.004 | 95.6 | |
T | F | 0.011 | 0.128 | 0.124 | 0.016 | 95.3 | −0.003 | 0.065 | 0.062 | 0.004 | 95.4 | ||
F | T | 0.005 | 0.127 | 0.123 | 0.016 | 95.2 | −0.002 | 0.064 | 0.061 | 0.004 | 95.6 | ||
F | F | 0.019 | 0.127 | 0.123 | 0.017 | 95.0 | −0.003 | 0.065 | 0.062 | 0.004 | 95.1 | ||
| |||||||||||||
Strong | 20 | T | T | −0.001 | 0.107 | 0.105 | 0.011 | 95.1 | 0.000 | 0.059 | 0.057 | 0.003 | 94.3 |
T | F | 0.026 | 0.111 | 0.109 | 0.013 | 93.7 | −0.006 | 0.060 | 0.059 | 0.004 | 94.4 | ||
F | T | −0.001 | 0.106 | 0.105 | 0.011 | 95.1 | 0.000 | 0.058 | 0.057 | 0.003 | 94.6 | ||
F | F | 0.112 | 0.112 | 0.111 | 0.025 | 80.6 | −0.026 | 0.060 | 0.060 | 0.004 | 91.5 | ||
| |||||||||||||
35 | T | T | −0.004 | 0.125 | 0.163 | 0.016 | 95.1 | 0.001 | 0.065 | 0.080 | 0.004 | 94.7 | |
T | F | 0.060 | 0.142 | 0.195 | 0.024 | 92.8 | −0.011 | 0.073 | 0.084 | 0.006 | 93.5 | ||
F | T | −0.004 | 0.120 | 0.159 | 0.014 | 94.6 | 0.001 | 0.063 | 0.079 | 0.004 | 94.7 | ||
F | F | 0.246 | 0.137 | 0.151 | 0.079 | 58.5 | −0.045 | 0.070 | 0.077 | 0.007 | 89.6 |
T = Working model specified correctly. F = Working model misspecified by excluding the covariate Z3,ij.
TABLE 2.
β 0 | β 1 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
Emp var ratiob | MSE ratiob | Emp var ratiob | MSE ratiob | ||||||||
| |||||||||||
Effect strength | % Miss. | Ra | Ya | Available case approach | Marginal approach | Available case approach | Marginal approach | Available case approach | Marginal approach | Available case approach | Marginal approach |
| |||||||||||
Weak | 20 | T | T | 0.994 | 0.844 | 0.992 | 0.845 | 1.103 | 0.858 | 1.104 | 0.858 |
T | F | 0.993 | 0.847 | 0.994 | 0.850 | 1.104 | 0.861 | 1.106 | 0.862 | ||
F | T | 0.993 | 0.846 | 0.991 | 0.847 | 1.101 | 0.861 | 1.102 | 0.861 | ||
F | F | 0.993 | 0.848 | 1.002 | 0.851 | 1.102 | 0.863 | 1.105 | 0.864 | ||
| |||||||||||
35 | T | T | 1.002 | 0.671 | 0.971 | 0.672 | 1.042 | 0.667 | 1.040 | 0.667 | |
T | F | 1.009 | 0.679 | 0.983 | 0.684 | 1.054 | 0.676 | 1.053 | 0.677 | ||
F | T | 0.997 | 0.671 | 0.966 | 0.671 | 1.037 | 0.668 | 1.036 | 0.669 | ||
F | F | 1.003 | 0.679 | 0.991 | 0.684 | 1.050 | 0.676 | 1.049 | 0.677 | ||
| |||||||||||
Strong | 20 | T | T | 0.901 | 0.891 | 0.626 | 0.891 | 0.965 | 0.907 | 0.905 | 0.907 |
T | F | 0.964 | 0.935 | 0.707 | 0.985 | 1.023 | 0.954 | 0.968 | 0.963 | ||
F | T | 0.892 | 0.913 | 0.620 | 0.912 | 0.951 | 0.928 | 0.892 | 0.928 | ||
F | F | 0.996 | 0.937 | 1.385 | 0.973 | 1.021 | 0.953 | 1.136 | 0.965 | ||
| |||||||||||
35 | T | T | 0.877 | 0.739 | 0.398 | 0.738 | 0.915 | 0.762 | 0.824 | 0.762 | |
T | F | 1.123 | 0.857 | 0.602 | 1.010 | 1.151 | 0.866 | 1.061 | 0.886 | ||
F | T | 0.811 | 0.795 | 0.368 | 0.795 | 0.848 | 0.819 | 0.764 | 0.819 | ||
F | F | 1.044 | 0.890 | 2.016 | 1.007 | 1.058 | 0.896 | 1.338 | 0.946 |
T = Working model specified correctly. F = Working model misspecified by excluding the covariate Z3,ij.
Ratio comparing multilevel approach to corresponding comparison method.
The proposed standard error estimator for the multilevel approach approximated the SDE well in most cases. One exception was the scenario with 35% missing data and a strong omitted effect, where the estimated standard errors largely over-estimated the SDE for a few simulated datasets, resulting in an ESE that was considerably larger than the SDE; when the simulated datasets with the largest estimated standard errors were removed (≤5 datasets), the ESE approximated the SDE similarly well as the other scenarios. Additionally, when the proportion of missing data in each dataset was reduced to 30% (results not shown), the ESE approximated the SDE well, suggesting that although the proposed standard error estimator performs well for moderate amounts of missing data (e.g., ≤30%), it may be over-estimated in some situations when the proportion of missing data is particularly high. Both the empirical variance and MSE were almost always smaller for the proposed multilevel approach than the marginal approach that ignored the clustering, and this difference in the empirical variance and MSE between those two approaches was generally greater for increased percent missing. Analogous simulation results considering a binary outcome variable (Y) are presented in the Supplementary Materials (Web Tables 1 and 2); results from simulations with a binary outcome variable were similar to simulations with the continuous outcome presented here except that the reductions in the empirical variance and MSE for the proposed method compared to the marginal method were smaller for the binary outcome than for the continuous outcome.
3.3 |. Misspecification of Working Models by Omitting a Non-Linear Effect
We also considered the performance of the proposed method when either working model was misspecified by omitting a quadratic term. The outcome variable Yij was generated from a normal distribution with mean (1 + Z1,ij + Z2,ij + Z3,ij + Z4,ij + γ5 ∗ + Xij + bj) and variance 1, where γ5 equaled 0.1 (weak effect) or 0.5 (strong effect). An indicator that Yij was observed (Rij) was generated from a Bernoulli distribution with probability logit−1(α0 − Z1,ij + Z2,ij + Z3,ij − Z4,ij + α5 ∗ + Xij + aj), where α0 equaled 0.5 (20% missing) or −1 (35% missing) and α5 equaled −0.1 (weak effect) or −0.5 (strong effect). For both the proposed multilevel approach and the marginal approach, each working model was either fit using the correct set of fixed effects, or by excluding the quadratic term from the model.
Table 3 presents bias, SDE, ESE, MSE, and CP for the proposed multilevel approach. Table 4 presents ratios of the empirical variance and MSE for the multilevel approach to the available case and marginal approaches. The proposed multilevel approach exhibited essentially no bias when either the working model for [Rj, aj|Zj] and/or the working model for [Yj, bj|Zj] were specified correctly, confirming the double robustness property. Bias for the multilevel approach tended to decrease as the percent missing decreased. Also, bias when the [Yj, bj|Zj] working model was misspecified tended to decrease as the magnitude of the omitted effect decreased. The 95% confidence interval coverage rates were nearly at the nominal level for almost all cases where at least one working model was specified correctly. The proposed standard error estimator for the multilevel approach approximated the SDE well in most cases. Both the empirical variance and MSE were almost always smaller for the proposed multilevel approach than the marginal approach, and this difference in the empirical variance and MSE between those two approaches was generally greater for increased percent missing. Analogous simulation results considering a binary outcome variable (Yij) are presented in the Supplementary Materials (Web Tables 3 and 4); results from simulations with a binary outcome variable were similar to simulations with the continuous outcome presented here except that the reductions in the empirical variance and MSE for the proposed method compared to the marginal method were smaller for the binary outcome than for the continuous outcome.
TABLE 3.
β 0 | β 1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||
Effect strength | % Miss. | Ra | Ya | Bias | SDE | ESE | MSE | CP | Bias | SDE | ESE | MSE | CP |
| |||||||||||||
Weak | 20 | T | T | −0.001 | 0.110 | 0.109 | 0.012 | 95.2 | 0.001 | 0.060 | 0.059 | 0.004 | 94.9 |
T | F | −0.010 | 0.111 | 0.109 | 0.012 | 95.2 | 0.004 | 0.061 | 0.060 | 0.004 | 94.8 | ||
F | T | −0.001 | 0.110 | 0.108 | 0.012 | 95.2 | 0.001 | 0.060 | 0.059 | 0.004 | 95.2 | ||
F | F | −0.014 | 0.110 | 0.109 | 0.012 | 95.0 | 0.006 | 0.060 | 0.060 | 0.004 | 95.1 | ||
| |||||||||||||
35 | T | T | −0.002 | 0.128 | 0.124 | 0.016 | 95.0 | 0.001 | 0.067 | 0.065 | 0.004 | 94.8 | |
T | F | −0.017 | 0.128 | 0.125 | 0.017 | 94.5 | 0.006 | 0.067 | 0.066 | 0.005 | 94.2 | ||
F | T | −0.002 | 0.126 | 0.122 | 0.016 | 95.0 | 0.001 | 0.066 | 0.065 | 0.004 | 94.6 | ||
F | F | −0.024 | 0.127 | 0.123 | 0.017 | 94.4 | 0.008 | 0.067 | 0.065 | 0.005 | 94.2 | ||
| |||||||||||||
Strong | 20 | T | T | 0.001 | 0.134 | 0.130 | 0.018 | 94.3 | −0.001 | 0.075 | 0.073 | 0.006 | 94.2 |
T | F | −0.059 | 0.148 | 0.143 | 0.025 | 91.1 | 0.015 | 0.081 | 0.079 | 0.007 | 93.6 | ||
F | T | 0.000 | 0.132 | 0.129 | 0.017 | 94.8 | −0.000 | 0.074 | 0.072 | 0.005 | 94.7 | ||
F | F | −0.163 | 0.133 | 0.131 | 0.044 | 75.1 | 0.037 | 0.075 | 0.073 | 0.007 | 91.6 | ||
| |||||||||||||
35 | T | T | 0.008 | 0.156 | 0.149 | 0.024 | 93.8 | −0.004 | 0.083 | 0.080 | 0.007 | 94.9 | |
T | F | −0.091 | 0.186 | 0.170 | 0.043 | 88.8 | 0.019 | 0.098 | 0.091 | 0.010 | 93.0 | ||
F | T | 0.008 | 0.150 | 0.144 | 0.022 | 94.7 | −0.004 | 0.080 | 0.078 | 0.006 | 95.2 | ||
F | F | −0.215 | 0.156 | 0.149 | 0.071 | 67.4 | 0.038 | 0.083 | 0.080 | 0.008 | 91.1 |
T = Working model specified correctly. F = Working model misspecified by excluding the quadratic term .
TABLE 4.
β 0 | β 0 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
Emp var ratiob | MSE ratiob | Emp var ratiob | MSE ratiob | ||||||||
| |||||||||||
Effect strength | % Miss. | Ra | Ya | Available case approach | Marginal approach | Available case approach | Marginal approach | Available case approach | Marginal approach | Available case approach | Marginal approach |
| |||||||||||
Weak | 20 | T | T | 0.825 | 0.905 | 0.612 | 0.905 | 0.873 | 0.911 | 0.842 | 0.911 |
T | F | 0.835 | 0.898 | 0.625 | 0.906 | 0.889 | 0.903 | 0.862 | 0.907 | ||
F | T | 0.818 | 0.920 | 0.606 | 0.920 | 0.865 | 0.925 | 0.835 | 0.925 | ||
F | F | 0.825 | 0.917 | 0.622 | 0.929 | 0.875 | 0.921 | 0.852 | 0.927 | ||
| |||||||||||
35 | T | T | 0.838 | 0.740 | 0.434 | 0.740 | 0.859 | 0.758 | 0.824 | 0.758 | |
T | F | 0.845 | 0.727 | 0.445 | 0.740 | 0.870 | 0.744 | 0.841 | 0.750 | ||
F | T | 0.814 | 0.775 | 0.422 | 0.775 | 0.837 | 0.791 | 0.803 | 0.791 | ||
F | F | 0.825 | 0.776 | 0.442 | 0.795 | 0.852 | 0.792 | 0.828 | 0.800 | ||
| |||||||||||
Strong | 20 | T | T | 0.829 | 0.891 | 0.307 | 0.891 | 0.883 | 0.910 | 0.818 | 0.910 |
T | F | 1.007 | 0.765 | 0.433 | 0.888 | 1.042 | 0.791 | 0.996 | 0.816 | ||
F | T | 0.796 | 0.936 | 0.295 | 0.936 | 0.864 | 0.943 | 0.800 | 0.943 | ||
F | F | 0.819 | 0.950 | 0.754 | 1.004 | 0.884 | 0.959 | 1.020 | 0.982 | ||
| |||||||||||
35 | T | T | 0.767 | 0.774 | 0.235 | 0.774 | 0.791 | 0.807 | 0.735 | 0.808 | |
T | F | 1.087 | 0.680 | 0.412 | 0.838 | 1.117 | 0.712 | 1.073 | 0.736 | ||
F | T | 0.703 | 0.857 | 0.216 | 0.857 | 0.739 | 0.880 | 0.686 | 0.880 | ||
F | F | 0.766 | 0.893 | 0.677 | 1.008 | 0.789 | 0.903 | 0.887 | 0.940 |
T = Working model specified correctly. F = Working model misspecified by excluding the quadratic term .
Ratio comparing multilevel approach to corresponding comparison method.
4 |. APPLICATION TO CHNS
We applied the proposed method to data from the CHNS. Starting in 2009, CHNS collected fasting blood samples on participants age seven years or older, providing data on a variety of cardiovascular and nutrition biomarkers8. For our analysis, we are interested in estimating the mean trajectory of triglycerides (mg/dL) as measured via fasting blood samples, both continuous (mg/dL) and dichotomized (high triglycerides defined as ≥ 150 mg/dL), adjusted for sex and age in 2009 (the first wave that collected fasting blood samples). Fasting blood samples were collected during 2009 and 2015 study waves, and so analysis was restricted to those two study years. The analysis was restricted to participants who were from the nine provinces included in the study in 2009, were adults (at least 18 years old) in 2009, and participated in the 2009 and/or 2015 wave of data collection (13,370 study participants). However, 7,225 individuals from this sample did not participate in either the 2009 or 2015 wave, and an additional 1,444 individuals who participated in both waves did not provide a valid fasting blood sample for either 2009 and/or 2015. In particular, there were 7,141 individuals in the analytic sample who were missing biomarker data for one wave, and 1,528 individuals missing biomarker data for both the 2009 and 2015 waves, resulting in missing biomarker data for 38.1% of data records in the analytic sample.
For the purposes of this example analysis, the proposed method accounted for within-individual clustering only (i.e., resulting in clusters with a size of 2 data records per cluster), and ignored higher levels of clustering (Section 5 will discuss the extension of the proposed method for more than one level of clustering). For the working models for missingness and the outcome, we considered individual-level variables, household-level variables, community-level variables, and study design variables; see Table 5 for a complete list of covariates included in the working models. While some covariates, such as time-invariant variables and variables that could be calculated based on the study design, were available for all records for all individuals in the analytic sample, time-varying variables were missing for waves for which the individual, household, or community of residence did not participate in data collection. In addition, some participants refused to provide data for variables at some waves (i.e., variable-level missingness). Since the focus of this example is to handle missing data on the response variable (i.e., triglycerides), and since most individuals had data for these covariates from other waves, missing data on time-varying auxiliary variables were handled in the following way: (1) if the covariate was reported in other waves of data collection, then the missing covariate was imputed (i.e., filled-in) as the value of the variable from the closest wave in which the variable was observed, and (2) if the individual did not report the covariate at any wave, then the individual was dropped from the analytic sample; 381 individuals (2.8% of the analytic sample) were dropped due to having no observed data for at least one of the time-varying auxiliary variables at any wave, resulting in a final sample size of 12,989 participants.
TABLE 5.
Individual | Household | Community | Design |
---|---|---|---|
| |||
• Time since 2009 • Sex • Age in 2009 • Education level • Current marital status • Current employment status • Body mass index • Waist circumference • Current smoking status • Current alcohol consumption • Average dietary intake of nutrients from 3 daily 24-hour dietary recalls • MET-hours per week of physical activity from different lifestyle activities (including interactions with time) |
• Total gross household income • Total household expenses • Household income from different sources • Composite score summarizing assets owned by at least one household member |
• Components of the urbanization index9 – Population density – Economic activity – Traditional markets – Modern markets – Transportation infrastructure – Sanitation – Communications – Housing – Education – Diversity – Health infrastructure – Social services |
• Province • City/county |
The final regression models of interest were the linear regression model E[triglycerides|time, sex, age] = β0 + β1time + β2sex + β3age and the logistic regression model logit{P(high triglycerides|time, sex, age)} = β0 + β1time + β2sex + β3age, where continuous triglycerides (mg/dL) and dichotomized triglycerides (≥ 150 mg/dL) were the outcome variables for the linear and logistic regressions respectively, and the predictor variables were time since 2009, sex, and age in 2009. These final regression models were estimated using available case analysis, the marginal approach ignoring clustering, and the proposed multilevel approach (i.e., using the same methods as in the simulation study). For the proposed multilevel approach, the working models for missingness and the outcome variables were specified as a logistic mixed effect model and generalized linear mixed effect model (linear mixed effect model for continuous triglycerides and logistic mixed effect model for dichotomized triglycerides) respectively, with a random intercept for the individual, and fixed effects for the covariates listed in Table 5 and the specified interactions; standard errors were estimated using the proposed estimator for the asymptotic covariance matrix introduced in (8). For the marginal approach (i.e., ignoring clustering), the marginal working model for missingness was specified as a logistic regression model with only fixed effects including the same fixed effects as included in the multilevel working model, and the marginal working model for the outcome was specified as a linear regression model for continuous triglycerides and logistic regression model for dichotomized triglycerides with only fixed effects including the same fixed effects as included in the multilevel working models; standard errors were estimated using a bootstrap procedure. Table 6 presents the regression coefficients and standard errors based on all three methods for both outcomes. All methods suggest that triglycerides were higher for older individuals, higher for men, and lower in 2015 than 2009, based on models for both continuous triglycerides (mg/dL) and high triglycerides (≥ 150 mg/dL). Most estimated associations for the available case approach were attenuated compared to the proposed multilevel approach, especially for the estimated associations of age and time with continuous triglycerides, and for the estimated association of time with dichotomized triglycerides. In addition, standard errors for the marginal approach were similarly or less precise than the standard errors for the proposed multilevel approach, which was consistent with the results from the simulation study in Section 3.
TABLE 6.
Triglycerides (mg/dL) | High triglycerides (≥ 150 mg/dL) | ||||
---|---|---|---|---|---|
| |||||
Method | Effect | Estimate | SE | Estimate | SE |
| |||||
Multilevel | Intercept | 163.29 | 2.120 | −0.563 | 0.030 |
Age in 2009 (years) | 0.195 | 0.066 | 0.006 | 0.001 | |
Women | −24.744 | 2.208 | −0.311 | 0.038 | |
Time since 2009 (years) | −2.863 | 0.293 | −0.037 | 0.005 | |
| |||||
Marginal | Intercept | 162.15 | 2.942 | −0.569 | 0.042 |
Age in 2009 (years) | 0.188 | 0.071 | 0.005 | 0.001 | |
Women | −23.929 | 2.311 | −0.308 | 0.039 | |
Time since 2009 (years) | −2.856 | 0.390 | −0.037 | 0.007 | |
| |||||
Available Case | Intercept | 162.04 | 1.738 | −0.586 | 0.030 |
Age in 2009 (years) | 0.129 | 0.072 | 0.006 | 0.001 | |
Women | −23.536 | 2.160 | −0.293 | 0.037 | |
Time since 2009 (years) | −2.255 | 0.267 | −0.026 | 0.005 |
Abbreviations: SE, standard error.
5 |. DISCUSSION
This research extended the doubly robust approach for handling missing outcome data in semi-parametric regression introduced by Scharfstein et al.5 to the case with clustered, and thereby correlated, data. The new approach estimates separate hierarchical working models for the missingness mechanism and the outcome, with random effects specified to account for within-cluster correlation for each model. A set of estimating equations were proposed, where the estimating equations are averaged across unknown random effects. This approach was shown to have the double robustness property, and was shown in simulation studies to be generally more precise than the approach ignoring clustering in the data.
We derived the asymptotic covariance matrix for the proposed doubly robust estimator , and proposed an empirical covariance estimator based on these asymptotic results. However, an alternative approach could be to estimate the covariance matrix using a bootstrap approach. Nonparametric bootstrap sampling approaches for variance estimation with clustered data have been described elsewhere, where the recommended bootstrap sampling approach involves randomly sampling the highest level clusters with replacement, and then selecting all data records within each randomly sampled cluster from this highest level10. Further research is needed to verify the performance of a bootstrap approach for variance estimation for this proposed doubly robust estimator.
The proposed method accounts for within-cluster correlation by including random cluster-specific effects in the working models, with an assumed distribution for the random effects. However, an alternative approach could have instead incorporated cluster-specific fixed effects11. There are a couple key advantages for using the random effects modeling approach employed in the proposed method. First, the CHNS data considered in Section 4 contained a large number of clusters with comparatively few data records per cluster (i.e., data were clustered within 12,989 individuals with 2 records per cluster). However, maximum likelihood estimation may be unstable and may not be statistically consistent for a model with fixed effects for such a large number of relatively small clusters12. On the other hand, using a random effects modeling approach reduces the number of unknown parameters estimated by maximum likelihood estimation, and allows prediction of cluster-specific random effects to “borrow” information from the average (i.e., marginal) distribution, where the amount of “borrowing” for each cluster depends on cluster size (with smaller clusters “borrowing” more)13,14. Also, a random effects framework can easily accommodate cluster-specific regression coefficients in addition to a cluster-specific intercept (e.g., a cluster-specific slope for time when modeling longitudinal data). One notable disadvantage of the random effects modeling approach employed in the proposed method is that it requires additional distributional assumptions about the random effects, which are not required for a comparable fixed effects modeling approach. Although it is not possible to verify the assumption that the working distribution for the random effects is specified correctly, the sensitivity of the working model to these distributional assumptions for the random effects can be tested15,16. In practice, a convenient choice for the working distribution of the random effects would be a normal distribution, since assuming normally distributed random effects would allow the use of Gauss-Hermite quadrature to estimate integrals (e.g., for maximizing the observed data likelihoods, for solving the proposed estimating equations in (4)), which is generally simpler and faster to implement than alternative approaches such as the EM algorithm or Markov Chain Monte Carlo. However, note that the asymptotic properties of the estimator hold regardless of whether the true random effects are normally distributed (as long as at least one of the working models [Rj, aj|Zj] and/or [Yj, bj|Zj] is specified correctly). Exploratory simulation studies have shown a benefit in precision for the effect estimates of interest when both the random effects distribution for the missingness and outcome model were specified correctly, but future research should further explore this in detail.
Hierarchical working models have been employed elsewhere to adjust for bias due to informative missing data or confounding in statistical analyses of clustered data. Kasim and Raudenbush17 have developed a two-level linear imputation model for normally-distributed data that can be used with fully conditional specification imputation methods, which has been implemented in the mice package in R18. In addition, random effects models have been used to estimate propensity scores to adjust for unmeasured cluster-level confounding when estimating causal effects19,20. The methods proposed in this research have further contributed to this growing literature by employing hierarchical working models to model both missingness and a regression outcome in a doubly robust approach to adjust for bias due to informatively missing data.
The approach proposed here addresses missing data in the outcome variable for a semi-parametric regression. However, there may be cases with missing data on the predictor variables for the regression model of interest and/or auxiliary variables used in the hierarchical working models. One possibility for handling this situation could be to impute any missing predictor or auxiliary variables using a model that accounts for correlation due to clustering, use these imputed data to estimate the doubly robust regression estimator described here in the presence of missing outcome data, and then obtain standard errors using multiple imputation3 or a bootstrap approach21. However, although this approach would be robust to misspecification of the imputation model for the outcome variable (i.e., the working model for Y ), it would not be robust to misspecification of the imputation model for the other missing variables.
For illustration, we specified the hierarchical working models for the missingness mechanism and the outcome as generalized linear mixed effect models with a cluster-specific random intercept. More general models could be used to increase the chance that the working model(s) are specified correctly. For example, higher-order polynomial terms, interaction terms, or splines could be included as fixed effects in the working models. Additional random effects could also be included in the working models. However, specification of higher-dimensional random effects will generally be more computationally intensive, both for estimating the parameters of the hierarchical working models and for averaging the final set of estimating equations across the random effects to estimate Sm(β). Generally, specification of the working models can be viewed as a trade-off between specifying models that are general enough to make it more likely that the model is correctly specified and simple enough to be computationally feasible.
Although doubly robust estimators, such as the estimator presented here, can protect against bias due to a misspecified outcome model if the missingness model is specified correctly (and vice versa), it should be noted that simulation results from previous research22 have illustrated situations where methods that depend on only a missingness model (e.g., inverse probability weighting) or only an outcome model (e.g., imputation) performed better than doubly robust estimators when both models are misspecified, which highlights the importance of carefully specifying working models that are as plausible as possible. In this research, we used fully parametric working models for both missingness and the outcome variable. However, an alternative could be to instead specify non-parametric working models for both missingness and the outcome variable. In the case of independent data, previous research has shown that if the estimators for the non-parametric working models converge at faster than n−1/4 rates, then the asymptotic behavior of the resulting doubly robust estimator would be the same as if the working models were correctly specified parametric models (e.g., -consistent and asymptotically normal)23,24. Therefore, assessing the statistical properties of the proposed method for doubly robust estimation with missing multilevel data using non-parametric working models is an important topic for future research.
The statistical derivations and simulation studies presented in this paper involve two-level data (e.g., longitudinal data for independently sampled study participants, cross-sectional data collected on all individuals within a sample of households). However, many large cohort studies contain data consisting of more than two levels. For example, the CHNS collected longitudinal data on all people living in households included in the cohort, and these households were clustered within neighborhoods, which were further clustered within cities/counties. Extending the approach described in this paper to data with an arbitrary number of levels of clustering is straightforward in theory. One could fit hierarchical working models with a vector of random effects for each level of clustering. Then the final set of estimating equations Sm(β) would be obtained by averaging across all random effects from the hierarchical working models. Assuming that the entire set of random effects (for all levels of clustering) are independent between the two working models, then the double robustness property would still hold. However, in practice increasing the dimension of the random effects in either working model would increase the computational burden, both for estimating the hierarchical working models and for averaging the final set of estimating equations across the random effects to estimate Sm(β). Therefore, for datasets with many levels of clustering (e.g., CHNS), it may be more reasonable to carefully select just a few levels of clustering that are most important to account for in either working model (e.g., the levels of clustering that are hypothesized to induce the most correlation after conditioning on the observed covariates), and/or to include fixed effects for observed covariates that help explain the within-cluster correlation for levels of clustering for which no random effects are included (e.g., include fixed effects for household income to help account for within-household correlation, include community-level variables to help account for within-community correlation).
One key assumption required for this method to be doubly robust is that the random effects for the working models for the missingness mechanism and the outcome need to be independent of each other. If the random effects from both working models are correlated, then misspecification of one of the working models (e.g., ignoring an important covariate, specifying the wrong functional form for a covariate, specifying the wrong link function) would necessarily misspecify the other working model. Therefore, when the random effects from both models are in fact correlated, the proposed method would only produce unbiased results when both working models are specified correctly, and therefore would no longer possess the double robustness property. Extending this methodology to the case with correlated random effects is a topic for future research.
In the simulation studies presented in Section 3 and Web Appendix C in the Supplementary Materials, the proposed method was generally less precise when the missingness model was correctly specified and the outcome model was misspecified than when the outcome model was correctly specified. In addition, previous research has shown that the doubly robust estimator of Scharfstein et al.5, from which our proposed method was derived, is inefficient when the outcome model is misspecified25. For the case with independent data, some doubly robust methods have been proposed that have improved efficiency, particularly for this scenario where the missingness model is correctly specified and the outcome model is misspecified26,27. Extending our proposed methodology to improve the efficiency in this case is not straightforward, and so this is another topic for future research.
Supplementary Material
ACKNOWLEDGMENTS
This research uses data from China Health and Nutrition Survey (CHNS). The authors are grateful to research grant funding for CHNS from the National Institute for Health (NIH), the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) for R01 HD30880, National Institute on Aging (NIA) for R01 AG065357, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) for R01DK104371 and R01HL108427, the NIH Fogarty grant D43 TW009077 since 1989, and the China-Japan Friendship Hospital, Ministry of Health for support for CHNS 2009, Chinese National Human Genome Center at Shanghai since 2009, and Beijing Municipal Center for Disease Prevention and Control since 2011. We thank the National Institute for Nutrition and Health, China Center for Disease Control and Prevention, Beijing Municipal Center for Disease Control and Prevention, and the Chinese National Human Genome Center at Shanghai. The authors are also grateful for support from the Carolina Population Center (P2C HD050924, T32 HD007168), the University of North Carolina at Chapel Hill. This research including the analysis was supported by grant R01DK104371 from the NIDDK, NIH, by grant U01DK098246 from the NIDDK, NIH, for the Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness (GRADE) Study, and by grant P01CA142538 from the National Cancer Institute (NCI), NIH.
Data availability statement
The data that support the findings of this study are available from the China Health and Nutrition Survey (CHNS). Public use versions of the data from CHNS can be downloaded from https://www.cpc.unc.edu/projects/china/.
Footnotes
Conflict of interest
The authors declare no potential conflict of interests.
Financial disclosure
None reported.
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article: A list of assumptions needed for Lemma 1 and Theorem 1 to hold (Web Appendix A), proofs for Lemma 1 and Theorem 1 (Web Appendix B), additional simulation studies considering a binary outcome (Web Appendix C), sample SAS and R code to implement the proposed method in statistical practice (Web Appendix D), and an R function written by the authors (DoublyRobust_Multilevel.R) to solve the estimating equations for β (equation (4) in the main text) and estimate the covariance matrix (based on equation (8) in the main text).
References
- 1.Popkin BM, Du S, Zhai F, Zhang B. Cohort profile: The China Health and Nutrition Survey - monitoring and understanding socio-economic and health change in China, 1989–2011. International Journal of Epidemiology 2009; 39(6): 1435–1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken, NJ: John Wiley & Sons, Inc. 2nd ed. 2002. [Google Scholar]
- 3.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc. . 1987. [Google Scholar]
- 4.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer. 2006. [Google Scholar]
- 5.Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder to adjusting for non-ignorable drop-out using semiparametric non-response models. Journal of the American Statistical Association 1999; 94(448): 1135–1146. [Google Scholar]
- 6.Zeng D, Chen Q. Adjustment for missingness using auxiliary information in semiparametric regression. Biometrics 2010; 66: 115–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cnaan A, Laird NM, Slasor P. Using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Statistics in Medicine 1997; 16: 2349–2380. [DOI] [PubMed] [Google Scholar]
- 8.Yan S, Li J, Li S, et al. The expanding burden of cardiometabolic risk in China: The China Health and Nutrition Survey. Obesity Reviews 2012; 13(9): 810–821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jones-Smith J, Popkin BM. Understanding community context and adult health changes in China: Development of an urbanicity scale. Social Science & Medicine 2010; 71(8): 1436–1446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ren S, Lai H, Tong W, Aminzadeh M, Hou X, Lai S. Nonparametric bootstrapping for hierarchical data. Journal of Applied Statistics 2010; 37(9): 1487–1498. [Google Scholar]
- 11.Allison PD. Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: SAS Institute. 2005. [Google Scholar]
- 12.Neyman J, Scott EL. Consistent estimates based on partially consistent observations. Econometrica 1948; 16: 1–32. [Google Scholar]
- 13.Stein C Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability 1955; 1: 197–206. [Google Scholar]
- 14.Naumova EN, Must A, Laird NM. Evaluating the impact of “critical periods” in longitudinal studies of growth using piecewise mixed effects models. International Journal of Epidemiology 2001; 30: 1332–1341. [DOI] [PubMed] [Google Scholar]
- 15.Hausman JA. Specification tests in econometrics. Econometrica 1978; 46: 1251–1271. [Google Scholar]
- 16.Tchetgen EJ, Coull BA. A diagnostic test for the mixing distribution in a generalized linear mixed model. Biometrika 2006; 93: 1003–1010. [Google Scholar]
- 17.Kasim RM, Raudenbush SW. Application of Gibbs Sampling to Nested Variance Components Models with Heterogeneous Within-Group Variance. Journal of Educational and Behavioral Statistics 1998; 23(2): 93–116. [Google Scholar]
- 18.van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 2011; 45(3): 1–67. [Google Scholar]
- 19.Arpino B, Mealli F. The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis 2011; 55(4): 1770–1780. [Google Scholar]
- 20.Li F, Zaslavsky AM, Landrum MB. Propensity score weighting with multilevel data. Statistics in Medicine 2013; 32(19): 3373–3387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Efron B Missing data, imputation, and the bootstrap. Journal of the American Statistical Association 1994; 89(426): 463–475. [Google Scholar]
- 22.Kang JDY, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 2007; 22(4): 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.van der Laan MJ, Rubin D. Targeted maximum likelihood learning. The International Journal of Biostatistics 2006; 2(1): Article 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kennedy EH, Balakrishnan S. Discussion of “Data-driven confounder selection via Markov and Bayesian networks” by Jenny Haggstrom. Biometrics 2018; 74(2): 399–402. [DOI] [PubMed] [Google Scholar]
- 25.Rubin D, Laan v. dMJ. Empirical efficiency maximization: Improved locally efficient covariate adjustment in randomized experiments and survival analysis. The International Journal of Biostatistics 2008; 4(5): Article 5. [PMC free article] [PubMed] [Google Scholar]
- 26.Bounded Tan Z., efficient and doubly robust estimation with inverse weighting. Biometrika 2010; 97(3): 661–682. [Google Scholar]
- 27.Rotnitzky A, Lei Q, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika 2012; 99(2): 439–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available from the China Health and Nutrition Survey (CHNS). Public use versions of the data from CHNS can be downloaded from https://www.cpc.unc.edu/projects/china/.