Abstract
Background:
Missing data is common in studies using electronic health records (EHR)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness.
Methods:
We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random).
Results:
Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data.
Conclusions:
We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
Keywords: electronic health records, missing data, multiple imputation, chained equations, random forests, denoising autoencoders, plasmode simulation
Introduction
Electronic health records (EHR)-derived data represent an enormous resource that include exposures and outcomes for a large, diverse population and reflect care and outcomes as they are experienced in routine practice. These data facilitate the study of rare diseases such as uncommon cancers through pooling of EHR databases across multiple institutions. Despite this potential, prior research has identified numerous deficiencies in EHR-derived data including measurement error, missing data, and informative observation patterns.1–3
Appropriately accounting for missing data in EHR-based studies is challenging because of the complexity of the missing data mechanism. In traditional studies making use of primary data collection, missing data is defined as failure to observe a data element that was intended to be collected.4 Since EHR data are not captured according to any prespecified design, the concept of missingness itself requires re-evaluation.5,6 Data needed for a given analysis may fail to be observed for a variety of reasons including lack of patient interaction with the healthcare system, lack of clinical or administrative need to record the data element of interest, and patient or provider preference.6 In aggregate, the complexity of this missing data mechanism motivates the need for flexible approaches to missing data in studies using EHR data.
Many epidemiologic studies make use of multiple imputation to address missingness.4 Multiple imputation is appealing because it leads to unbiased estimates under missing completely at random and missing at random missingness mechanisms, appropriately accounts for uncertainty introduced by missing data, and can be implemented in the context of a wide variety of study designs and analytic approaches. Multiple imputation via chained equations7 is frequently used in EHR-based studies because it allows for imputation of multiple variable types, can accommodate clustered data, and does not require specification of the joint distribution of the variables under study.
Despite the popularity of multiple imputation via chained equations, it relies on specification of the conditional distribution of each variable, typically specified via a regression function assuming linear, additive relationships, which may lead to inadequate flexibility to model relationships among variables in EHR data. Prior research has investigated random forests as an alternative, more flexible approach to specification of the conditional distribution of covariates with missingness and found that multiple imputation using random forests resulted in parameter estimates with lower bias than traditional multiple imputation via chained equations, particularly in the setting of non-linearity and interaction among variables.8,9 Denoising autoencoders have also been proposed as a flexible approach for imputation in the machine learning literature.10,11 Past studies found that denoising autoencoders outperformed multiple imputation approaches in terms of the root mean squared error of predictions for the missing values, even under missing not at random. However, denoising autoencoders have not been evaluated in terms of bias in downstream regression parameter estimates using imputed data sets. Studies of deep learning for missing data imputation have primarily quantified performance using measures of the accuracy of the imputed values, which may not correspond well to bias and inferential properties of parameter estimates generated using the imputed data.9 Often the goal of clinical and epidemiologic research is to understand relationships between individual covariates and outcomes of interest. Thus, the quality of downstream statistical inference based on the imputed data takes precedence over the accuracy of the imputed values.
The objectives of the current study were: (1) to conduct a head to head comparison of multiple imputation using multiple imputation with chained equations, random forests, and denoising autoencoders, (2) to use plasmode simulations based on real-world EHR-derived data in order to evaluate methods in a setting that reflects the relationship among variables in EHR data, and (3) to evaluate each method’s performance using measures relevant to epidemiologic studies, including bias, variance, and confidence interval coverage probability. Sampling from a cohort of patients with metastatic urothelial carcinoma from a real-world oncology EHR-derived database, we conducted simulation studies to compare these alternative multiple imputation approaches.
Methods
Study sample and data
We analyzed data from a cohort of patients with metastatic urothelial carcinoma included in the nationwide Flatiron Health EHR-derived database. This longitudinal database is comprised of de-identified patient-level structured and unstructured data, curated via technology-enabled abstraction.12,13 During the study period, the de-identified data originated from approximately 280 cancer clinics (∼800 sites of care). The study included patients diagnosed with stage IV urothelial carcinoma (bladder, renal pelvis, ureter, or urethra) and those diagnosed with early-stage urothelial carcinoma who subsequently developed metastatic disease and initiated first-line therapy. The study excluded patients who did not receive systemic therapy for advanced urothelial cancer, received first-line agents as part of a clinical trial, or had a >90-day gap between diagnosis and first structured EHR activity.
The exposure variable was a binary indicator of first-line treatment with immunotherapy. We defined immunotherapy treatment as single-agent nivolumab, pembrolizumab, atezolizumab, durvalumab, or avelumab. The primary outcome was overall survival, defined as the time from the start of first-line treatment to the date of death. The analytic data set included nine covariates: gender (Male or Female), age, race and ethnicity (White, Black, Hispanic/Latino, or Other), smoking status (yes/no), practice type (community vs. academic), surgery (yes/no), disease grade (1–4), primary site of disease (bladder, ureter, renal, or urethra), and performance status (0–4). Disease grade was dichotomized as grade 2–4 versus grade 1. We measured performance status using the ordinal Eastern Cooperative Oncology Group (ECOG) Scale, which describes a patient’s level of functioning in terms of self-sufficiency, daily activity, and physical ability. Prior to conduct of this analysis, we obtained Institutional Review Board approval of the study protocol which included a waiver of informed consent.
All imputation models described below incorporated the exposure, all covariates, and the survival outcome by using the cumulative baseline hazard of survival and event status indicator, as recommended by prior research.14
Plasmode Simulation
We used plasmode simulations to sample with replacement from observations in the EHR-derived database. Plasmode simulation allows for evaluation of bias of parameter estimates because outcome data are simulated based on a model specified by the researcher, while empirical relationships among other variables are preserved by resampling from the original data set.15 In our case, the outcome model we wished to estimate was an adjusted Cox proportional hazards model for the association between overall survival and treatment (immunotherapy versus other systemic therapy). In addition to the nine covariates listed in the previous section, this model included a non-linear transformation of age and the interaction between a transformation of age, surgery and ECOG performance score (see eTable 1). Coefficients for covariates included in the outcome model were motivated by the estimated hazard ratios observed in the data and are provided in eTable 1. After imputing missing data as described below, we estimated the outcome model using all covariates included in the data generating model, including non-linear and interaction terms, to ensure that bias in estimated coefficients was attributable to imputation and not model misspecification. Censoring times were simulated according to a Cox proportional hazards model including treatment and the same set of covariates, including the non-linear and interaction terms. Coefficients are provided in eTable 1.
In simulated data, we explored performance of alternative missing data methods under varying missingness mechanisms and proportions of observations with missing data. Missingness in ECOG performance score was introduced under missing completely at random, missing at random, and two forms of missing not at random mechanisms. Under missing completely at random, missingness in ECOG performance score was not dependent on any covariates. Under missing at random, missingness in ECOG performance score was dependent only on fully observed covariates, not including performance score itself. Under the first missing not at random setting, missingness in ECOG performance score was dependent on performance score itself. In the second missing not at random setting, missingness in the performance score was dependent on performance score itself and an unobserved covariate. We introduced missing completely at random, missing at random, and missing not at random missingness in ECOG performance score by varying the strength of association between each variable in the data set and the probability of missingness in performance score in a logistic regression model (see eTables 2–4). More details are given in eAppendix 1.
In addition to missingness in ECOG performance score, missing completely at random was introduced into all other covariates in the data set except for the non-linear and interaction terms. In the missing completely at random setting, we simulated missing completely at random for the non-linear and interaction terms. In the missing at random and not at random settings, we induced missingness at random in the non-linear and interaction terms, where the probability of missingness was dependent on all other covariates including ECOG performance score. We set to missing a random sample consisting of 1% (for the low missingness setting), 3% (medium), or 6% (high) of each covariate. We selected these missingness proportions such that approximately 10%, 30%, or 50% of observations in the data set had missing data for at least one covariate.
Multiple Imputation with Chained Equations
We implemented multiple imputation with chained equations using the R package mice16 in R version 4.0.2. We used predictive mean matching to impute variables treated as continuous (age and ECOG performance score) and all binary variables. Predictive mean matching selects the five observations with predictions closest to the prediction for an observation with a missing value and chooses one at random. The chosen observation’s true value is then used to impute the missing value. For race/ethnicity and primary site we used polytomous regression, i.e., the default method for unordered categorical variables.
We repeated the entire process described above for each variable with missing values. We considered a single imputation for all variables to be one cycle. We repeated the process for the default of five cycles using the updated imputed values from the previous cycle. We repeated the complete process 10 times to create 10 imputed data sets.
Multiple Imputation using Random Forests
We used multiple imputation with imputed values generated using random forests as implemented in the R package CALIBERrfimpute.17 This imputation approach is a modified version of multiple imputation with chained equations in which missing values for each variable are imputed using a random forest. For continuous variables, this approach begins by fitting a random forest to bootstrap samples of the observations where the variable being imputed is observed. Then, missing values are imputed by random draws from independent normal distributions centered on the conditional means predicted by the model. Based on the assumption that the random forest prediction errors are normally distributed, the out-of-bag mean squared error of the predictions is used as the variance of the normal distribution used for imputation. For categorical variables, the random forest method fits a set of classification trees to the bootstrap sample and chooses each imputed value as the prediction of a randomly chosen tree. We considered age and ECOG performance score as continuous variables and the remaining covariates as categorical variables. In all simulations, we used the CALIBERrfimpute default of 10 trees for each variable with missingness.
Multiple Imputation using Overcomplete Denoising Autoencoder
Autoencoders are neural networks that learn an encoded representation of input data using a corrupted version of the input and then attempt to reconstruct the original input as accurately as possible. The architecture consists of a series of encoder layers followed by decoder layers with each adjacent layer connected by activation functions. A denoising autoencoder is considered “overcomplete” if the encoding layers are of equal or higher dimension than the input layer. The activation functions are transformations of the information in the previous layer. The input layer is a corrupted version of the observed data and the output layer is the reconstructed data. Intermediate layers between the input and output are referred to as hidden layers. A more detailed description of the architecture and training of a denoising autoencoder is provided in the online Supplement.
To preserve comparability with prior research, we sought to mirror the implementation of denoising autoencoder used in a prior study.11 Therefore, we used a denoising autoencoder with five hidden layers, with input dropout ratio of 0.5, which corresponds to replacing a randomly selected half of the input values with 0. Consistent with previous research using overcomplete denoising autoencoders for multiple imputation, we used a batch size of 256 and applied the tanh activation function at all layers except the last encoder and last decoder layers, where a linear activation function was used to ensure that the output vector could include negative values. We trained the denoising autoencoder using 500 epochs, where an epoch is the number of times the entire data set passes forward and backward through the neural network.
We trained the autoencoder using only the complete cases. Once the autoencoder was trained, we replaced all missing values with 0 and used the trained denoising autoencoder model to obtain predictions for each missing value from the reconstructed data set. We repeated the model training and reconstruction process to create 10 imputed data sets for each simulation iteration. Since the denoising autoencoder does not distinguish variable types, we split the unordered categorical variables race–ethnicity and primary site into dummy variables prior to imputation. As a result, it is possible for an individual’s imputed data to consist of more than one race–ethnicity or primary site in the downstream analysis with the imputed data set. To study sensitivity to the number of layers and nodes, we trained a lower-dimensional denoising autoencoder with one fewer hidden layer and only 4 added nodes at each layer. Results from the lower-dimensional denoising autoencoder were qualitatively similar and are provided in eFigures 1–2 and eTables 5–6.
We implemented denoising autoencoders in python version 3.8.1 using MIDA-pytorch, a custom library built on the tensor-based, deep learning platform PyTorch. The MIDA-pytorch library is available on GitHub: https://github.com/Harry24k/MIDA-pytorch.
Evaluation
We implemented the following procedure for each combination of missingness proportion (10%, 30%, 50%) and missingness mechanism (missing completely at random, missing at random, missing not at random 1, missing not at random 2). After simulating 1000 data sets, we performed multiple imputation using multiple imputation with chained equations, random forests, and denoising autoencoders to create 10 imputed data sets for each method at each simulation iteration. Next, we fit a Cox model with treatment, ECOG performance score, the remaining eight covariates, a non-linear function of age, and an interaction between age, ECOG performance score and surgery to each of the imputed data sets and pooled the results using Rubin’s rules.4 We focused our evaluation on estimation and inference for the treatment and ECOG performance score hazard ratio estimates. While the treatment coefficient was considered the estimand of interest, we also examined bias in the ECOG performance score coefficient to assess the potential for residual confounding of the treatment estimate due to incomplete confounder control.
We compared the results to a complete case analysis and analysis of the original data set with no missingness (the Oracle). We then compared known coefficients of interest (treatment, ECOG performance score) from the true outcome model used to simulate the data to the coefficients estimated using each missing data method. We calculated the difference between true and observed coefficients and standard error at each iteration and summarized these with the mean across iterations. We treated the mean difference between true and observed coefficients as an empirical estimate of bias. We also calculated boxplots of the difference between true and observed coefficients obtained from the 1000 simulation iterations. Last, we computed the coverage of 95% confidence intervals for the coefficients of interest based on the proportion of confidence intervals that covered the true value.
Results
In the original, EHR-derived dataset 6,102 patients met study inclusion criteria, among whom 1,328 were treated with immunotherapy and the remainder with other systemic therapy. In this dataset, 614 deaths were observed in the immunotherapy arm and 1,971 deaths in the other systemic therapy arm over a median (interquartile range) follow-up of 9 months (4, 18). After excluding patients with missing data, the data set used to generate inputs for our simulation studies included 967 patients treated with immunotherapy and 2,891 patients treated with other systemic therapy for a total of 3,858 patients (Table 1).
Table 1:
Distribution of potential confounders by treatment arm in the EHR-derived cohort of patients with metastatic urothelial carcinoma.
| Chemotherapy | Immunotherapy | |
|---|---|---|
|
| ||
| N | 2891 | 967 |
| Died (N, %) | 1971 (68) | 614 (64) |
| Age (mean, SD) | 70 (8.9) | 75 (8.3) |
| Sex N (%) | ||
| Female | 746 (26) | 259 (27) |
| Male | 2145 (74) | 708 (73) |
| Race N (%) | ||
| White | 2305 (80) | 753 (78) |
| Black | 123 (4) | 34 (4) |
| Hispanic or Latino | 108 (4) | 33 (3) |
| Other | 355 (12) | 147 (15) |
| Primary Site N (%) | ||
| Bladder | 2188 (76) | 737 (76) |
| Ureter | 255 (9) | 100 (10) |
| Renal | 415 (14) | 121 (13) |
| Urethra | 33 (1) | 9 (1) |
| Practice Type N (%) | ||
| Community | 2812 (97) | 953 (99) |
| Non-community | 79 (3) | 14 (1) |
| Smoking Status N (%) | ||
| History of smoking | 2110 (73) | 722 (75) |
| No history of smoking | 781 (27) | 245 (25) |
| Grade N (%) | ||
| High (G2/G3/G4) | 2416 (84) | 815 (84) |
| Low | 475 (16) | 152 (16) |
| Surgery N (%) | ||
| Surgery | 1444 (50) | 487 (50) |
| No Surgery | 1447 (50) | 480 (50) |
| ECOG N (%) | ||
| ECOG PS = 0 | 1071 (37) | 258 (27) |
| ECOG PS = 1 | 1298 (45) | 414 (43) |
| ECOG PS = 2 | 424 (15) | 225 (23) |
| ECOG PS > 2 | 98 (3) | 70 (7) |
Figures 1 and 2 display boxplots of point estimates minus the true parameter value across iterations for each missing data method applied to simulated data under each of the three missingness mechanisms and proportions of missingness investigated. Under missing completely at random, the distribution of the ECOG performance score effect estimate minus the true value was mostly contained within −0.1 to 0.1 for multiple imputation with chained equations and random forests for all amounts of missingness (Figure 1). The distribution of denoising autoencoder results shifted more noticeably away from zero with increasing amounts of missingness. Since the true ECOG performance score effect was positive, negative values of the parameter estimate minus the true value correspond to attenuation towards the null. For multiple imputation with chained equations and random forests, the spread of the distribution was largely constant as the percentage of missing data increased. However, the spread of the complete case and denoising autoencoder distributions increased with larger amounts of missing data, reflecting a loss of efficiency. This was also reflected in a greater increase in the average SE for the complete case and denoising autoencoder compared with multiple imputation with chained equations and random forests as missingness increased (Table 2, missing completely at random results). Coverage of 95% confidence intervals for the ECOG performance score effect declined for the three imputation methods as the proportion of missing data increased. Coverage was close to the nominal level for the complete case across all amounts of missingness.
Figure 1:
Boxplots of the distribution of coefficient point estimates minus the true parameter value for ECOG performance score under missing completely at random (MCAR; top left), missing at random (MAR; top right), and missing not at random (MNAR; bottom) for three levels of missingness (10, 30, 50 percent). Oracle = estimate based on data with no missingness, CC = complete case, MICE = multiple imputation via chained equations, RF = random forest, DAE = denoising autoencoder.
Figure 2:
Boxplots of the distribution of coefficient point estimates minus the true parameter value for treatment under missing completely at random (MCAR; top left), missing at random (MAR; top right), and missing not at random (MNAR; bottom) for three levels of missingness (10, 30, 50 percent). Oracle = estimate based on data with no missingness, CC = complete case, MICE = multiple imputation via chained equations, RF = random forest, DAE = denoising autoencoder.
Table 2:
Bias, average standard errors (SEs), and coverage of 95% confidence intervals for the effect of ECOG performance score. Oracle = estimate based on data with no missingness, CC = complete case, MICE = multiple imputation via chained equations, RF = random forest, DAE = denoising autoencoder, MCAR = missing completely at random, MAR = missing at random, MNAR = missing not at random.
| Bias |
SE |
Coverage |
|||||||
|---|---|---|---|---|---|---|---|---|---|
| Low |
Medium |
High |
Low |
Medium |
High |
Low |
Medium |
High |
|
| ORACLE | 0.00 | 0.00 | 0.00 | 0.02 | 0.02 | 0.02 | 0.94 | 0.94 | 0.95 |
| MCAR | |||||||||
| CC | 0.00 | 0.00 | 0.01 | 0.03 | 0.03 | 0.05 | 0.95 | 0.94 | 0.95 |
| MICE | 0.00 | −0.01 | −0.02 | 0.02 | 0.03 | 0.03 | 0.95 | 0.94 | 0.91 |
| RF | 0.00 | −0.01 | −0.03 | 0.02 | 0.03 | 0.03 | 0.95 | 0.94 | 0.89 |
| DAE | −0.01 | −0.04 | −0.08 | 0.03 | 0.05 | 0.06 | 0.95 | 0.92 | 0.82 |
| MAR | |||||||||
| CC | 0.00 | 0.00 | 0.01 | 0.03 | 0.03 | 0.04 | 0.95 | 0.94 | 0.94 |
| MICE | −0.01 | −0.03 | −0.04 | 0.02 | 0.03 | 0.04 | 0.94 | 0.87 | 0.83 |
| RF | 0.00 | −0.01 | −0.01 | 0.02 | 0.03 | 0.04 | 0.95 | 0.95 | 0.96 |
| DAE | −0.01 | −0.03 | −0.06 | 0.03 | 0.04 | 0.06 | 0.95 | 0.95 | 0.93 |
| MNAR 1 | |||||||||
| CC | 0.00 | 0.00 | 0.01 | 0.03 | 0.04 | 0.05 | 0.96 | 0.95 | 0.95 |
| MICE | −0.01 | −0.02 | −0.05 | 0.03 | 0.03 | 0.04 | 0.93 | 0.90 | 0.76 |
| RF | 0.00 | −0.03 | −0.06 | 0.03 | 0.03 | 0.04 | 0.95 | 0.88 | 0.77 |
| DAE | −0.02 | −0.04 | −0.06 | 0.04 | 0.06 | 0.08 | 0.97 | 0.97 | 0.96 |
| MNAR 2 | |||||||||
| CC | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 0.05 | 0.94 | 0.94 | 0.95 |
| MICE | −0.01 | −0.02 | −0.04 | 0.03 | 0.03 | 0.04 | 0.94 | 0.90 | 0.84 |
| RF | 0.00 | −0.01 | −0.03 | 0.03 | 0.03 | 0.04 | 0.94 | 0.94 | 0.88 |
| DAE | −0.02 | −0.05 | −0.07 | 0.03 | 0.06 | 0.08 | 0.96 | 0.98 | 0.95 |
Under missing completely at random, the distribution of the treatment effect estimate minus the true value was mostly contained within −0.2 to 0.2 for multiple imputation with chained equations and random forests for all amounts of missingness (Figure 2). As the amount of missingness increased, the distribution of denoising autoencoder estimates began to deviate from the true value in the direction of the null. The bias was close to zero for the complete case regardless of the amount of missingness, but this method was less efficient than others (Table 3, missing completely at random results). Average SEs for the three imputation methods were similar. The denoising autoencoder under-covered slightly under low missingness and increasingly with increasing amounts of missingness. The complete case and multiple imputation with chained equations covered close to the nominal level across all levels of missingness. Random forests had nominal or close to nominal coverage in the low and medium settings but only covered the true parameter value approximately 91% under high missingness.
Table 3:
Bias, average standard errors (SEs), and coverage of 95% confidence intervals for the true treatment effect. Oracle = estimate based on data with no missingness, CC = complete case, MICE = multiple imputation via chained equations, RF = random forest, DAE = denoising autoencoder, MCAR = missing completely at random, MAR = missing at random, MNAR = missing not at random.
| Bias |
SE |
Coverage |
|||||||
|---|---|---|---|---|---|---|---|---|---|
| Low |
Medium |
High |
Low |
Medium |
High |
Low |
Medium |
High |
|
| ORACLE | −0.01 | 0.00 | 0.00 | 0.05 | 0.05 | 0.05 | 0.94 | 0.96 | 0.96 |
| MCAR | |||||||||
| CC | −0.01 | 0.00 | −0.01 | 0.06 | 0.07 | 0.10 | 0.94 | 0.95 | 0.95 |
| MICE | 0.00 | 0.01 | 0.02 | 0.05 | 0.05 | 0.05 | 0.94 | 0.95 | 0.95 |
| RF | 0.00 | 0.01 | 0.03 | 0.05 | 0.05 | 0.05 | 0.94 | 0.95 | 0.91 |
| DAE | 0.01 | 0.05 | 0.08 | 0.05 | 0.05 | 0.05 | 0.93 | 0.85 | 0.63 |
| MAR | |||||||||
| CC | 0.00 | −0.01 | −0.01 | 0.06 | 0.08 | 0.11 | 0.95 | 0.95 | 0.95 |
| MICE | 0.00 | 0.02 | 0.04 | 0.05 | 0.05 | 0.05 | 0.95 | 0.94 | 0.89 |
| RF | 0.00 | 0.02 | 0.05 | 0.05 | 0.05 | 0.06 | 0.95 | 0.93 | 0.88 |
| DAE | 0.03 | 0.08 | 0.11 | 0.05 | 0.06 | 0.06 | 0.93 | 0.77 | 0.60 |
| MNAR 1 | |||||||||
| CC | 0.00 | 0.00 | −0.01 | 0.06 | 0.08 | 0.12 | 0.94 | 0.95 | 0.94 |
| MICE | 0.04 | 0.06 | 0.08 | 0.05 | 0.05 | 0.05 | 0.89 | 0.76 | 0.64 |
| RF | 0.04 | 0.07 | 0.08 | 0.05 | 0.05 | 0.05 | 0.88 | 0.70 | 0.65 |
| DAE | 0.05 | 0.09 | 0.11 | 0.06 | 0.06 | 0.06 | 0.87 | 0.72 | 0.57 |
| MNAR 2 | |||||||||
| CC | 0.00 | −0.01 | −0.01 | 0.06 | 0.08 | 0.12 | 0.95 | 0.96 | 0.95 |
| MICE | 0.02 | 0.05 | 0.06 | 0.05 | 0.05 | 0.05 | 0.93 | 0.87 | 0.79 |
| RF | 0.02 | 0.05 | 0.07 | 0.05 | 0.05 | 0.05 | 0.93 | 0.83 | 0.72 |
| DAE | 0.04 | 0.08 | 0.11 | 0.05 | 0.06 | 0.06 | 0.90 | 0.74 | 0.58 |
Under missing at random, the distribution of the ECOG performance score estimates minus the true value under multiple imputation with chained equations shifted in the direction of the null increasingly with the amount of missingness (Figure 1). This reflects the misspecified multiple imputation with chained equations models that only included linear main effect terms for each covariate. The random forest distribution did not shift away from zero as much, demonstrating the value of a more flexible imputation model that can capture non-linear effects and interactions. With increasing missingness, the denoising autoencoder distribution shifted away from zero slightly more than the multiple imputation with chained equations distribution. Average SEs for all methods increased with increasing missingness, with the denoising autoencoder having the largest average SEs amongst the imputation methods. At low missingness, most methods achieved approximately nominal coverage. At medium and high missingness, coverage decreased slightly in the complete case while coverage of the multiple imputation with chained equations method declined more noticeably. The random forest approach retained nominal coverage at medium and high missingness, while the denoising autoencoder slightly under-covered in the high missingness setting (Table 2, missing at random results).
Results for the treatment effect under missing at random in ECOG performance score were similar to the results in the missing completely at random setting 2). In all settings, the denoising autoencoder had the largest magnitude of bias (Table 3, missing at random results). Average SEs for the complete case increased with the amount of missingness, while the average SEs for the imputation methods were similar to each other across all missingness levels. The complete case achieved nominal coverage at all levels of missingness, while coverage of multiple imputation with chained equations and random forests achieved the nominal level for low missingness but decreased for medium and high missingness. The denoising autoencoder did not reach the nominal level under missing at random of any amount (Table 3).
Under the missing not at random setting 1, the distribution of the ECOG performance score estimate minus its true value was mostly contained within −0.1 to 0.1 except at the high missingness setting, where all three methods appeared to shift away from zero in the direction of the null (Figure 1). The denoising autoencoder had the largest average SEs of the three imputation methods and the complete case, which translated into slight over-coverage at all missingness levels. The complete case achieved nominal coverage at all missingness levels, whereas multiple imputation with chained equations and random forests demonstrated decreasing coverage rates as missingness increased. Results for the missing not at random 2 setting were qualitatively similar to missing not at random 1 (Table 2).
Under the missing not at random setting 1, the distribution of the treatment effect estimate minus its true value noticeably shifted away from zero with increasing amount of missingness (Figure 2). All three imputation methods displayed a large magnitude of bias with the denoising autoencoder having the largest. Average SEs were comparable between multiple imputation with chained equations and random forests. The denoising autoencoder had slightly higher average SEs that also increased more than those of multiple imputation with chained equations and random forests with increasing missingness. The complete case had close to nominal coverage for all amounts of missingness. The three imputation methods under-covered at low missingness, with poorer coverage at higher levels. Results for the missing not at random 2 setting were qualitatively similar to missing not at random 1 (Table 3).
Discussion
In the setting of a real-world investigation using EHR data, we found that using multiple imputation with chained equations or random forest-based imputation in the presence of missingness at random or completely at random in confounder variables resulted in efficient estimation of the treatment effect with minimal bias. Denoising autoencoder-based multiple imputation also performed reasonably well, although with slightly more bias, particularly under the highest amount of missingness investigated. When a confounder variable featured missingness not at random, all three multiple imputation approaches returned biased treatment effect estimates, notably exacerbating bias relative to the complete case analysis. Given that our simulations assumed that exposure and outcome information was available without missingness, this suggests that bias in the missing not at random setting resulted from poor confounder control due to inadequate handling of missingness in confounder variables. This is consistent with expectations given that multiple imputation relies on a missing at random assumption but is contrary to prior results suggesting that denoising autoencoder may improve estimation even under missing not at random.10,11 Estimates for the coefficient of the variable with missing not at random, ECOG performance score, further confirmed that missing not at random induces substantial bias in regression parameter estimates that are not mitigated by more flexible imputation approaches. Complete case analysis was relatively unbiased across the range of scenarios investigated. This result is intuitive given that the missingness in ECOG performance score was conditionally independent of the outcome.18 However, complete case analysis suffers from loss of efficiency, which can be severe under high levels of missingness. In EHR settings with large datasets where efficiency may be a less important concern than bias, complete case analysis may be a reasonable approach.
Although denoising autoencoders have recently been popularized for imputing missing data in the machine learning literature, few studies to our knowledge have evaluated this approach in the context of multiple imputation. In one prior study using the method to conduct multiple imputation in a pooled clinical trials database including about 2000 patients, the denoising autoencoder was found to outperform multiple imputation using singular value decomposition and k-nearest neighbors.10 This study evaluated relative performance of multiple imputation approaches through comparison of root mean squared error (RMSE) of the imputed values and RMSE of predictions for a clinical outcome based on a random forest constructed using the imputed data. In a second study using a variety of data sets commonly used for evaluating machine learning methods, the denoising autoencoder outperformed multiple imputation with chained equations in terms of RMSE of the imputed values and RMSE and classification accuracy of a prediction model constructed using a random forest.11 While these prior studies do not differ substantially from the current study in terms of the size of the data sets investigated or the number of features included in the imputation model, neither evaluated the performance of denoising autoencoders with respect to measures that are important in epidemiologic studies, including bias in a treatment effect estimate in the presence of missing data in confounders. In contrast to these prior analyses, our results suggest that denoising autoencoders may not outperform other multiple imputation methods in terms of reducing bias in regression parameters of interest in analyses with characteristics similar to this investigation.
Prior studies have also compared multiple imputation using random forest to multiple imputation with chained equations. In simulations including missing at random, random forest-based imputation was more efficient than multiple imputation with chained equations and also decreased bias when the latter model was misspecified.8,19 However, this advantage of random forests was eliminated when interaction terms were incorporated into the parametric specification of the multiple imputation with chained equations model.20 Additionally, a recent study found that, while random forest-based imputation had low RMSE for predicting missing values, it had much higher bias for estimating regression coefficients than multiple imputation with chained equations using predictive mean matching.9 This study along with the current investigation suggest that results from multiple imputation using flexible imputation methods should be interpreted cautiously, particularly in settings with sample sizes in the thousands.
Overall, our study found little or no benefit to using more flexible imputation approaches including random forests and denoising autoencoders relative to multiple imputation with chained equations in the setting of an EHR data set with a modest number of continuous and categorical variables. Denoising autoencoders also compared poorly to other methods in terms of computational intensity and additionally appeared to introduce some bias in settings with a high proportion of data missing at random. In the case of a high proportion of observations with missing values, the greater flexibility of a denoising autoencoder may lead to overfitting, resulting in spurious precision and below nominal confidence interval coverage probabilities. Advantages of this approach may become evident in settings with very large sample sizes, but in the context of a sample size in the thousands, which is particularly common in EHR-based oncology studies, we found no benefit of using denoising autoencoders compared to more traditional imputation approaches.
This study has several limitations. First, as an empirical investigation based on simulation, our conclusions are dependent on the simulation scenarios we have investigated. Specifically, we used the plasmode simulation approach to resample data from an oncology EHR-derived data set. While the resultant simulation study reflects the data structure of this context well, results may not be generalizable to other settings, particularly those with much larger sample sizes or larger numbers of variables. Nonetheless, many real-world epidemiologic investigations use data similar in size, complexity and data source to the data used in this study. We thus believe our results have relevance to a variety of epidemiologic investigations. Second, modern machine learning methods are sensitive to the precise specification of hyperparameters. Our implementation of denoising autoencoders was based on the approach used by Gondara and Wang.11 Performance using alternative hyperparameter specifications may vary. Similarly, we have implemented random forests using the CALIBERrfimpute R package17 because it has been previously investigated in the context of multiple imputation and found to perform well.8 However, a previous study found varying performance across a variety of random forest implementations for imputation.21
In conclusion, we found that multiple imputation with chained equations, random forest and denoising autoencoder-based multiple imputation approaches all provide treatment effect estimates with low bias in the presence of missingness completely at random or at random in confounder variables, but denoising autoencoders suffer from sub-nominal confidence interval coverage when the proportion of observations with missing data is high. None of these imputation approaches are able to provide unbiased regression parameter estimates in the setting of missing not at random in confounders. These results suggest that multiple imputation is a valuable tool for increasing efficiency in the presence of missing data in EHR-based studies, but use of more flexible imputation approaches does not mitigate bias induced by missing not at random and can produce estimates with spurious precision.
Supplementary Material
Acknowledgments
Research reported in this publication was supported by the National Institutes of Health under award numbers R21CA227613 and R21AG075574. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
The authors have no conflicts of interest.
Data and Code Availability:
The data that support the findings of this study have been originated by Flatiron Health, Inc. These de-identified data may be made available upon request, and are subject to a license agreement with Flatiron Health; interested researchers should contact <DataAccess@flatiron.com> to determine licensing terms.
All analytic code is available at https://github.com/kalinn/modern_mi_simulations
References
- [1].Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc, 2013;20(1):144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Shortreed SM, Cook AJ, Coley RY, Bobb JF, Nelson JC. Challenges and Opportunities for Using Big Health Care Data to Advance Medical Science and Public Health. American Journal of Epidemiology, 2019; 188(5):851–861. Publisher: Oxford Academic. [DOI] [PubMed] [Google Scholar]
- [3].Beesley LJ, Salvatore M, Fritsche LG, Pandit A, Rao A, Brummett C, Willer CJ, Lisabeth LD, Mukherjee B. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Statistics in medicine, 2020;39(6):773–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Little RJ, Rubin DB. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019. [Google Scholar]
- [5].Haneuse S, Daniels M. A general framework for considering selection bias in ehr-based studies: what data are observed and why? eGEMs, 2016;4(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Haneuse S, Arterburn D, Daniels MJ. Assessing missing data assumptions in ehr-based studies: A complex and underappreciated task. JAMA Network Open, 2021;4(2):e210184–e210184. [DOI] [PubMed] [Google Scholar]
- [7].White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 2011;30(4):377–399. [DOI] [PubMed] [Google Scholar]
- [8].Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. American journal of epidemiology, 2014;179(6):764–774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC medical research methodology, 2020;20(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Beaulieu-Jones BK, Moore JH, CONSORTIUM PROAACT. Missing data imputation in the electronic health record using deeply learned autoencoders. In Pacific symposium on biocomputing 2017. World Scientific, 2017; pages 207–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Gondara L, Wang K. Mida: Multiple imputation using denoising autoencoders. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 2018; pages 260–272. [Google Scholar]
- [12].Ma X, Long L, Moon S, Adamson BJ, Baxi SS. Comparison of population characteristics in real-world clinical oncology databases in the us: Flatiron health, seer, and npcr. medRxiv, 2020;. [Google Scholar]
- [13].Birnbaum B, Nussbaum N, Seidl-Rathkopf K, Agrawal M, Estevez M, Estola E, Haimson J, He L, Larson P, Richardson P. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the ehr for oncology research. arXiv preprint arXiv:200109765, 2020;. [Google Scholar]
- [14].White IR, Royston P. Imputing missing covariate values for the cox model. Statistics in medicine, 2009; 28(15):1982–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Franklin JM, Schneeweiss S, Polinski JM, Rassen JA. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Computational statistics & data analysis, 2014;72:219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 2011;45:1–67. [Google Scholar]
- [17].Shah A CALIBERrfimpute: Imputation in MICE using Random Forest, 2021. R package version 1.0–5. [Google Scholar]
- [18].Bartlett JW, Carpenter JR, Tilling K, Vansteelandt S. Improving upon the efficiency of complete case analysis when covariates are mnar. Biostatistics, 2014;15(4):719–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 2010;172(9):1070–1076. [DOI] [PubMed] [Google Scholar]
- [20].Slade E, Naylor MG. A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Statistics in medicine, 2020;39(8):1156–1166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Tang F, Ishwaran H. Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2017;10(6):363–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study have been originated by Flatiron Health, Inc. These de-identified data may be made available upon request, and are subject to a license agreement with Flatiron Health; interested researchers should contact <DataAccess@flatiron.com> to determine licensing terms.
All analytic code is available at https://github.com/kalinn/modern_mi_simulations


