Sequential BART for imputation of missing covariates

Dandan Xu; Michael J Daniels; Almut G Winterstein

doi:10.1093/biostatistics/kxw009

. 2016 Mar 15;17(3):589–602. doi: 10.1093/biostatistics/kxw009

Sequential BART for imputation of missing covariates

Dandan Xu ¹, Michael J Daniels ^2,^*, Almut G Winterstein ³

PMCID: PMC4915613 PMID: 26980459

Abstract

To conduct comparative effectiveness research using electronic health records (EHR), many covariates are typically needed to adjust for selection and confounding biases. Unfortunately, it is typical to have missingness in these covariates. Just using cases with complete covariates will result in considerable efficiency losses and likely bias. Here, we consider the covariates missing at random with missing data mechanism either depending on the response or not. Standard methods for multiple imputation can either fail to capture nonlinear relationships or suffer from the incompatibility and uncongeniality issues. We explore a flexible Bayesian nonparametric approach to impute the missing covariates, which involves factoring the joint distribution of the covariates with missingness into a set of sequential conditionals and applying Bayesian additive regression trees to model each of these univariate conditionals. Using data augmentation, the posterior for each conditional can be sampled simultaneously. We provide details on the computational algorithm and make comparisons to other methods, including parametric sequential imputation and two versions of multiple imputation by chained equations. We illustrate the proposed approach on EHR data from an affiliated tertiary care institution to examine factors related to hyperglycemia.

Keywords: Bayesian additive regression trees, Congenial models, Multiple imputation

1. Introduction

Incomplete covariate data is commonly encountered in regression analysis and in comparative effectiveness analysis using electronic health records (EHR). We focus on inference in a regression model with missingness in (some) covariates. We consider the case when the missing data are missing at random (MAR) (Rubin, 1976), i.e., the missingness only depends on what has been observed. When the missing data mechanism (MDM) for the covariates also does not depend on the (fully observed) response, a complete case (CC) analysis can provide unbiased inference if the inference regression model is correctly specified. However, when the number of covariates is large, even a small proportion of missingness can result in substantial loss of cases and big efficiency losses; in addition, if missingness depends on the response, there will be bias. In both situations, explicit (or implicit) multiple imputation (MI) (Rubin, 1978, 1987) is recommended to avoid losing information and/or to avoid bias. MI approaches impute the missing data by sampling from its predictive distribution given the observed data. The missing data are imputed multiple times to incorporate the uncertainty in the imputation. Regression analysis is then performed on each imputed dataset and the parameter estimates and standard errors (SE) are calculated by combining the estimates from the multiply imputed datasets using Rubin's formula (Rubin, 1987).

One commonly used approach in practice for MI is multiple imputation via chained equations (MICE) (Van Buuren and Oudshoorn, 1999; Raghunathan and others, 2001). This approach specifies univariate conditional distribution for each variable subject to missingness given all other variables Inline graphic . The univariate distribution is usually assumed to be a parametric model, a nonparametric model such as recursive partitioning (Burgette and Reiter, 2010; Doove and others, 2014), or the actual distribution “sampled” using an ad hoc method such as predictive mean matching (Little, 1988). Although this chained approach has been widely used, it has theoretical drawbacks since there is often not a joint distribution compatible with the univariate conditionals. (Arnold and Press, 1989; Li and others, 2012; Liu and others, 2013). To alleviate the problem, Li and others (2014) proposed an MI approach that combines MICE and the sequential imputation method for monotone missing data by decomposing any missing data pattern into a collection of constructed monotone patterns referred to as “ordered monotone blocks”.

Other approaches directly specify a joint model for the variables, Inline graphic . A common choice is a parametric multivariate model, for example, a multivariate normal model, a multinomial log–linear model, or a general location model (Schafer, 1997). Since a parametric model is often not flexible enough for complex data structures, mixture models have been used recently for MI to capture complex dependence structures and distributional shapes. For example, finite mixtures and Bayesian Dirichlet process (DP) mixtures of multivariate normals were used to model continuous variables (Di Zio and others, 2007; Kim and others, 2014) and DP mixtures of product multinomial distributions to model categorical variables (Si and Reiter, 2013). The two models were further hierarchically coupled with local dependence by an infinite tensor factorization prior to model continuous and categorical covariates together (Murray and Reiter, 2014). However, the performance of such models in large sample, high-dimensional settings is unclear.

An alternative approach to specify a (valid) joint model is via specification of univariate sequential conditional distributions, Inline graphic . A drawback of this approach is that the specification is not invariant to the order of the conditioning. Thus, different orderings can lead to different joint distributions. There are many ways to model the univariate sequential conditional distributions parametrically (Lipsitz and Ibrahim, 1996; Ibrahim and others, 1999). Chen and Ibrahim (2006) proposed a semiparametric model using a generalized additive model (GAM) for the univariate conditional models. To our knowledge, there is no flexible “default” fully-Bayesian method used which allows for additional complexity, including interactions, for this sequential approach.

One important consideration in constructing the MI model is to make sure that the imputation model is congenial with the inference model (Meng, 1994). Uncongenial imputation model can lead to inconsistent parameter estimates and biased variance estimates. Let Inline graphic be the inference regression model of the response on covariates . One source of the uncongeniality can be that there exists no joint model with the imputation model and inference regression model as its conditionals. This is a greater issue for MICE than the sequential approach since the latter can always specify the imputation model as Inline graphic . Another common source of incompatibility is the inclusion of auxiliary variables in the imputation model, which does not marginalize to the inference regression model (Daniels and others, 2014). For simplicity, we do not consider including auxiliary variables in the imputation model in this paper. We discuss extensions in Section 5.

In this paper, we propose a Bayesian nonparametric joint model as a product of sequential univariate conditional models for continuous and categorical covariates. We propose to use Bayesian treed models for the univariate conditional distributions. Specifically, we use Bayesian classification and regression trees (CART) (Chipman and others, 1998; Denison and others, 1998) for categorical covariates (with more than two categories), and Bayesian additive regression trees (BART) (Chipman and others, 2010) for continuous and binary covariates. Our approach offers several advantages: (i) the imputation model corresponds to a valid joint model for the covariates; (ii) the imputation model is compatible with the inference regression model; and (iii) the approach is computationally efficient and simple to implement in practice.

The motivating application is a hyperglycemia study using EHR data from an affiliated tertiary care institution at the University of Florida. Hyperglycemia is known to be associated with increased morbidity and mortality during inpatient hospital stays. The study aims to understand the risk factors associated with hyperglycemia. While a variety of clinical parameters are ascertained regularly for every admitted patient, it is common that there is an important amount of missing values in these variables. Since clinical variables typically interact with each other and have nonlinear relationships, a flexible nonparametric MI approach with theoretical justification is proposed.

The remainder of the paper is organized as follows. In Section 2, we first briefly review Bayesian CART and BART models, followed by the introduction of the proposed model including the model specification and (MCMC) posterior analysis. In Section 3, we present numerous simulation results that compare the performance of our approach with parametric approach and two MICE approaches. In Section 4, we apply the proposed approach to a logistic regression with missing covariates in a study of hyperglycemia using EHR data. We conclude in Section 5 with comments and further extensions.

2. Multiple imputation using Bayesian CART and BART

2.1. Review of Bayesian CART and BART model

CART models can be used to specify the conditional distribution of Inline graphic given , where are a vector of predictors. CART partitions the predictor space into regions via a binary tree (a parent node has one left- and right-child node) denoted by with a set of interior node decision rules and terminal nodes. Each terminal node represents a predictor subspace and associates with it a parameter Inline graphic in . For a given value in the predictor space, the assignment function denoted by returns the parameter associated with the terminal node of the predictor subspace in which falls. If is continuous, the conditional distribution is specified as normal with mean function ; this is called a regression tree model. If Inline graphic is a categorical variable with categories , where , the conditional distribution is specified as multinomial distribution with parameters , where ; this is called a classification tree model. are the parameters associated with the terminal nodes with and for . If is binary, the conditional distribution is Bernoulli. We can either specify a classification tree model or a regression tree model for a latent normal variable Inline graphic , which can be useful for binary responses (more in what follows).

The Bayesian version of CART places a prior over Inline graphic and . The specification of the prior on includes three key components: (i) the prior on the tree generation and the number and the structure of nodes; (ii) the prior on the splitting variable at each interior node, usually a uniform prior on available variables which could be further split; (iii) the prior on the splitting threshold conditional on the splitting variable, usually a uniform prior on the available splitting values. There were several proposals in the literature for the tree-generating prior (Denison and others, 1998; Chipman and others, 1998; Wu and others, 2007). We use the approach of Chipman and others (1998) who define a stochastic process such that a node at depth Inline graphic is nonterminal with probability . The prior for assumes independent normal distributions for each . The prior for assumes independent Dirichlet distributions for each .

Chipman and others (2010) later proposed BART, which uses sums of (simple) regression trees instead of a single (complex) tree as the mean function Inline graphic . The number of trees is assumed to be large, so the individual tree effect is small and the trees will not be overly complex. The tree components are independent of each other and each component is given the prior described earlier (Chipman and others, 1998). Using the sum of trees, BART has minimal mixing issues unlike Bayesian CART.

The “bcart” function in the R package tgp can implement Bayesian CART for continuous response; however, to our knowledge, there is no R package or programs available to implement Bayesian CART for categorical response (more than two levels). The R package BayesTrees can implement BART for continuous and binary responses.

2.2. Notation and inference regression model

For the Inline graphic th subject, , let be the response which, in what follows, we assume is always observed. Let be the covariates where is a set of covariates that are always observed and is a vector of covariates with potentially missing values. We can partition it into . Let be the inference regression model with parameters Inline graphic of primary interest. Let be the missing indicator for . We assume that the missing data is MAR and the MDM either depends on the response, i.e., , or not, i.e., . Note that even in the case when the MDM does not depend on the response, we include the response in the imputation model to avoid biased estimates (Moons and others, 2006).

2.3. Imputation model

Since the subjects are independent, we omit Inline graphic for simplicity in what follows. Given the parameters and , the joint distribution of is specified as

from which we can derive the imputation model, Inline graphic

If Inline graphic is continuous, where , the univariate conditional distribution is specified as normal with the sum of trees as the mean function:

where Inline graphic .

If Inline graphic is binary, the univariate conditional distribution is a Bernoulli distribution. Let be the latent normal variable with the sum of trees as the mean function, i.e., if and if ,

where Inline graphic is cdf of a standard normal random variable and .

All categorical covariates with more than two categories are combined into one (big) categorical variable with categories Inline graphic , where . Each category in the big variable corresponds to a unique combination of all the original categorical covariates. If any of the original categorical covariates have missing values, the big variable is missing. The univariate conditional distribution is then specified as a multinomial distribution with probabilities,

where Inline graphic ; this is a Bayesian CART model.

There is typically not a default order for the covariates. We propose to order the covariates in terms of specifying the joint distribution based on covariate type and proportion of missingness. The distribution of the combined categorical variable is specified first, followed by binary variables, and then continuous variables. Within the same data type, variables with less missingness are specified before those with more missingness. This default order will facilitate the efficiency of the MCMC algorithm (see Section 2.4).

The priors of parameters from different univariate conditional distribution are independent, i.e., Inline graphic . The parameters are given the default prior based on the R package BayesTrees for continuous and binary variables. The default prior for Bayesian CART (Chipman and others, 1998; Wu and others, 2007) is used for categorical variables. If a prior for is given in the inference regression analysis, the same prior is adopted for the imputation model.

2.4. Posterior and MCMC algorithm

We use a data augmentation (DA) algorithm to generate samples from the posterior distribution Inline graphic . Given the model specification in Section 2.3, the posteriors of the parameters from each univariate conditional distributions are independent (conditional on the data-augmented missing covariates), i.e., Each conditional distribution can be sampled simultaneously using MCMC algorithms for either Bayesian CART or BART and Inline graphic can be sampled based on the assumed regression model.

The conditional distribution for the missing data Inline graphic can be sampled as follows. Specifically, the posterior for a categorical (including binary) variable with category , given all the other variables is a multinomial distribution. For continuous covariates, we use Metropolis–Hastings to sample from the full conditionals. The steps of the MCMC algorithm are given below with details in the supplementary material available at Biostatistics online.

Sample and from .
Sample from .
Sample from

Since the conditionals for the categorical and binary variables can be sampled exactly, they are placed before continuous variables in specifying the joint distribution. The candidate distribution for the missing continuous covariates (see the supplementary material available at Biostatistics online) only uses the previous covariates, and so to increase the acceptance rate, we place them later in the order. Also, in terms of efficiency we order the continuous (and binary covariates) from less to more missingness. The above algorithm is coded in C++; the code for Step 1 is written on the basis of parallel BART code from http://www.rob-mcculloch.org/. Parallel computation can be implemented in several components of the MCMC algorithm to speed up computations (details can be found in the supplementary material available at Biostatistics online).

3. Simulations

We examine the operating characteristics of our approach (which we call sequential BART) via simulations. We compare sequential BART to a sequential parametric approach, default MICE using predictive mean matching for continuous variables and logistic regression for binary variables, and MICE–CART using CART as the univariate conditional imputation model. The primary goal is to compare the parameter estimates from the inference regression model obtained using the missing data imputed under the four methods.

For each scenario, we generate 1000 replicated datasets. For the sequential BART approach, we use the default priors and initialize the missing data by the sample mean of the observed data for each variable. We run an MCMC algorithm for 2000 iterations with the first 1000 iterations burned in and create 5 imputed datasets from the posterior sample of missing data at every 200 iterations afterwards. We assess MCMC convergence by running multiple chains using different initial values. We also monitor the autocorrelation and the acceptance rate for the Metropolis–Hastings algorithm. The MCMC algorithms for our proposed method converge very fast and mix well in the simulation examples.

For the sequential parametric approach, we use normal linear models for continuous covariates and logistic regression models for binary covariates as the univariate sequential conditional distributions. All the parameters are given diffuse priors. JAGS (Plummer, 2003) is used to fit the models, which uses an appropriate DA procedure for missing data. We create 5 imputed datasets evenly from 1000 iterations after 1000 burn-in. For MICE, we use both the default settings of the “mice” function in R package mice and an option of “CART” as the univariate method in the “mice” function. After the imputation, we run the inference regression model on the five imputed complete datasets and calculate the combined parameter estimates Inline graphic , SEs and CIs by Rubin's formula. We calculate the percent of times that CIs cover the true value (CI coverage), average width of CIs, and the average bias of from across all the replicates. We also compute an overall measure, a sum of squared errors summary statistic, .

The five simulations have the following focus: (i) the covariates are missing completely at random (MCAR); (ii) the covariates are MAR with MDM not depending on the response; (iii) the covariates are MAR with MDM depending on the response; (iv) exploration of variable ordering in specifying the imputation model; (v) a simple true multivariate normal model (in the supplementary material available at Biostatistics online).

3.1. Example 1: Covariates are MCAR

Each simulated dataset has Inline graphic observations. There are 10 continuous covariates including one fully observed and a response variable from a linear regression model on the covariates. We use five latent variables to define a relatively complex joint model for the covariates with nonlinearity, nonadditivity, and skewed residual distributions.

Figure 1 shows the operating characteristics for the estimation of the regression parameters graphically. In particular, it shows that the CI coverages are around 95% for all approaches. Sequential BART has the smallest width of CI, followed by MICE–CART, sequential parametric, and MICE. Sequential BART and sequential parametric have smaller bias than MICE and MICE–CART. The Monte Carlo SEs of the biases tend to be around 0.005, so that the differences between MICE–CART and other approaches apparent in Figure 1 are unlikely to be the result of Monte Carlo error. The estimated Inline graphic is 122 (SE: 1.7) for sequential BART, compared with 127 (SE: 1.7) for MICE–CART, 137 (SE: 2.0) for MICE, and 139 (SE: 2.0) for sequential parametric.

Fig. 1. — Simulation Example 1: coverage of 95% CIs, average width of CIs, and average bias of the 10 regression parameter estimates based on 1000 replicates using the four imputation methods.

3.2. Example 2: Covariates are MAR with MDM not depending on response

Each simulated dataset has Inline graphic observations with 17 continuous and 9 binary covariates where 6 covariates are fully observed. The true distribution of the covariates with missing values is composed of nonlinear and nonadditive functions of the observed covariates. The response variable is from a linear regression model on the covariates. The true data model can be found in the supplementary material available at Biostatistics online.

Each of the 20 covariates has around 19% missing values. Larger values of Inline graphic and larger absolute values of , , and are more likely to be missing. Figure 2 shows that sequential BART has good CI coverage for all the covariates, whereas the other three approaches have poor CI coverage for at least some covariates. Sequential BART has smallest width of CI, followed by MICE–CART and the other two approaches. Sequential BART has bias around 0 for all the covariates, whereas the other approaches have large bias for some covariates. The Monte Carlo SEs of the empirical biases tend to be around 0.002 for regression parameters for continuous covariates and 0.005 for binary covariates for the two MICE approaches where sequential BART and parametric have smaller Monte Carlo SEs, so that the differences between BART and other approaches apparent in Figure 2 are unlikely to be the result of Monte Carlo error. The estimated summary statistic Inline graphic is 75 (SE: 1.1) for sequential BART, compared with 286 (SE: 2.4) for MICE–CART, 243 (SE: 2.4) for MICE, and 253 (SE: 2.5) for sequential parametric.

Fig. 2. — Simulation Example 2: coverage of 95% CIs, average width of CIs, and average bias of the 26 regression parameter estimates based on 1000 replicates using the four imputation methods.

3.3. Example 3: Covariates are MAR with MDM depending on response

This example has the MDM depending on the response. We have Inline graphic observations and 3 continuous covariates. One covariate is fully observed. We sequentially specify the joint distribution of covariates. The covariates depend on the previous covariates both nonlinearly and nonadditively. We present two MDMs where one depends only on the response and the other depends on both the response and the fully observed covariate. The following is the true data model:

For MDM depending on Inline graphic , the average percentage of missingness is 16%. Larger values of and have higher probability of missingness. For MDM depending on and , the average percentage of missingness is 14%. For , large values are less likely to be missing. For , values around 0 are less likely to be missing and larger absolute values are more likely to be missing. Figure 3 shows that sequential BART and sequential parametric perform similarly and better than MICE and MICE–CART in terms of CI coverage, width of CI and bias of parameter estimates. For MDM depending on Inline graphic , the is 7 (SE: 0.2) for sequential BART, 13 (SE: 0.4) for sequential parametric, 33 (SE: 1.2) for MICE–CART, and 34 (SE: 1.6) for MICE. The average Monte Carlo SE for biases is 0.004. For MDM depending on and , the is 5 (SE: 0.1) for both sequential BART and sequential parametric, 7 (SE: 0.2) for MICE, and 14 (SE: 0.4) for MICE–CART. The average Monte Carlo SE for biases is 0.003. The poor performance of both MICE approaches is likely related to the lack of congeniality between the imputation models and the inference regression model.

Fig. 3. — Simulation Example 3: coverage of 95, 90, and 80% CIs, average width of CIs, and average bias of the 3 regression parameter estimates based on 1000 replicates using the four imputation methods.

3.4. Example 4: Covariate ordering

To explore the effect of variable ordering in sequential specification of the joint distribution of covariates, we simulate a small example where the true joint distribution is sequentially specified and interaction and nonlinear terms are added to define a complex joint model. In specifying the imputation model, we try all possible orderings of covariates including the true ordering. We generate Inline graphic observations, 3 continuous and 1 binary covariates with one fully observed, and a response variable from a linear regression on the covariates. The true data model can be found in the supplementary material available at Biostatistics online.

For the sequential parametric, each univariate conditional model is specified as a normal regression with the mean including all linear and interaction terms of previously ordered variables if it is continuous, or logistic regression with all linear and interaction terms of previously ordered variables if it is binary. For example, if the imputation model use the variable ordering from the true model, the parametric model is specified as Inline graphic , , and with .

Figure 4 shows that all the approaches have CI coverage around 95% for all the variable orderings. The results of sequential parametric fluctuate most dramatically with the variable ordering in terms of the width of CI and bias. The Monte Carlo SEs of biases for the four parameter estimates are 0.002, 0.006, 0.003, and 0.006, respectively; thus, the fluctuation in the biases for the sequential parametric approach is not due to Monte Carlo error. Unlike the sequential parametric, variable ordering does not have much effect on sequential BART. However, there are situations that the performance of sequential BART can be affected by the ordering of variables; for example when the conditional means have nonlinear relationships with previous covariates and the error term has skewed distribution as shown in the simulation example in the supplementary material available at Biostatistics online.

Fig. 4. — Simulation Example 4: coverage of 95% CIs, average width of CIs, and average bias of the 4 regression parameter estimates based on 1000 replicates using the six possible covariate orderings and four imputation methods. The six orderings are: (i) , (ii) , (iii) , (iv) , (v) and (vi) .

Inline graphic — Simulation Example 4: coverage of 95% CIs, average width of CIs, and average bias of the 4 regression parameter estimates based on 1000 replicates using the six possible covariate orderings and four imputation methods. The six orderings are: (i) , (ii) , (iii) , (iv) , (v) and (vi) .

4. EHR hyperglycemia data example

We illustrate our proposed approach using a logistic regression analysis of hyperglycemia with missing covariates using inpatient EHR data from a tertiary care institution affiliated to the University of Florida. The data includes the demographic, diagnostic, and procedural information extracted from the discharge record as well as lab results, medication administration, and other clinical variables obtained for inpatient adults admitted to UF Health Shands Hospital from July 1, 2011 to September 30, 2011. The binary response is defined based on the glucose values taken on the second day of admission. In particular, hyperglycemia ( Inline graphic ) is defined as having one glucose value mg/dL if only one value was taken on that day or two glucose values mg/dL if they were taken at least 6 h apart. The covariates include medical information extracted on the admission day. The goal of the regression analysis is to find significant risk factors associated with hyperglycemia. The final analysis dataset has a total of 4947 observations with 24 covariates. Fourteen covariates have missing data ranging from 0.4% to 28% (Table 1 has the % of missingness for these covariates). The CC analysis only uses 65% of the observations. We use sequential BART, sequential parametric, MICE, and MICE–CART to impute the missing covariates. We also perform a CC analysis. Table 1 reports the estimated odds ratios from the logistic regression under the four imputation approaches and CC analysis. As expected, CC analysis has slightly wider CIs than the other imputation approaches. The estimates are close for different imputation approaches in this example, but slightly different for CC analysis. Diabetes, anti-diabetic drug administration, hyperglycemia on previous day, systemic corticosteroid administrations, high pulse rates, low Inline graphic saturation (for MICE–CART), low chloride (for sequential BART and MICE), high systolic blood pressure on admission day (except CC analysis) and low WBC (for sequential parametric) are significantly associated with hyperglycemia on the second day of admission. The observation that the results are similar for all the methods is likely related to the fact that the covariates with the most missingness turn out to be less important predictors of the response. Therefore the impact of the imputation model on the regression parameter estimates was lessened.

Table 1.

Odds ratio estimates from fitting the hyperglycemia logistic regression with missing covariates imputed using the four approaches and a CC analysis.

	Odds ratio with CIs
Covariates (unit, % missing)	Sequential BART	Sequential parametric	MICE	MICE–CART	CC analysis
Gender: female vs male	1.21 (0.93, 1.58)	1.22 (0.94, 1.59)	1.21 (0.93, 1.57)	1.20 (0.93, 1.56)	1.29 (0.96, 1.75)
Admission type: surgical vs medical admission	0.83 (0.63, 1.09)	0.83 (0.63, 1.09)	0.83 (0.63, 1.09)	0.82 (0.62, 1.08)	0.80 (0.58, 1.10)
Transplant: yes vs no	1.11 (0.35, 3.52)	1.07 (0.34, 3.37)	1.14 (0.36, 3.59)	1.16 (0.37, 3.66)	1.12 (0.34, 3.42)
Care unit: intensive vs acute care unit	1.12 (0.82, 1.53)	1.09 (0.80, 1.49)	1.13 (0.83, 1.54)	1.09 (0.80, 1.48)	1.10 (0.79, 1.54)
Diabetes: yes vs no	5.03 (3.62, 6.99)	5.01 (3.62, 6.95)	5.00 (3.61, 6.94)	5.05 (3.65, 7.01)	4.46 (3.09, 6.46)
Dialysis: yes vs no	0.71 (0.36, 1.41)	0.73 (0.37, 1.45)	0.76 (0.38, 1.51)	0.75 (0.37, 1.50)	0.89 (0.42, 1.82)
Anti-diabetic drug administration: yes vs no	3.43 (2.47, 4.75)	3.40 (2.45, 4.71)	3.48 (2.51, 4.83)	3.44 (2.48, 4.76)	3.16 (2.18, 4.58)
Hyperglycemia on previous day: yes vs no	6.63 (4.96, 8.88)	6.93 (5.20, 9.22)	6.66 (4.97, 8.94)	6.66 (5.00, 8.87)	6.13 (4.42, 8.54)
Systemic corticosteroid administration: yes vs no	3.77 (2.74, 5.18)	3.73 (2.72, 5.13)	3.73 (2.72, 5.13)	3.79 (2.76, 5.20)	3.78 (2.65, 5.39)
Age (year)	1.00 (0.99, 1.01)	1.00 (0.99, 1.01)	1.00 (0.99, 1.01)	1.00 (0.99, 1.01)	1.00 (0.99, 1.01)
Body temperature (F, 0.5%)	0.96 (0.87, 1.07)	0.96 (0.87, 1.07)	0.96 (0.87, 1.07)	0.96 (0.87, 1.07)	0.94 (0.84, 1.06)
Respiration rate (breaths/min, 0.4%)	1.01 (0.98, 1.04)	1.01 (0.98, 1.04)	1.01 (0.98, 1.04)	1.01 (0.97, 1.04)	1.03 (0.99, 1.07)
Pulse rate (beats/min, 2%)	1.01 (1.00, 1.02)	1.01 (1.00, 1.02)	1.01 (1.00, 1.02)	1.01 (1.00, 1.02)	1.01 (1.00, 1.02)
saturation (%, 5%)	0.97 (0.94, 1.00)	0.99 (0.95, 1.02)	0.97 (0.94, 1.00)	0.97 (0.94, 1.00)	0.98 (0.93, 1.03)
Sodium (mEq/L, 27%)	0.99 (0.93, 1.04)	1.01 (0.97, 1.05)	0.98 (0.94, 1.03)	0.99 (0.94, 1.03)	0.98 (0.94, 1.03)
Hemoglobin (g/dL, 15%)	0.83 (0.66, 1.06)	0.83 (0.67, 1.03)	0.85 (0.68, 1.07)	0.87 (0.69, 1.10)	0.85 (0.67, 1.09)
Hematocrit (%, 16%)	1.06 (0.98, 1.16)	1.07 (0.99, 1.16)	1.06 (0.98, 1.15)	1.05 (0.97, 1.14)	1.06 (0.98, 1.16)
Chloride (mEq/L, 28%)	0.97 (0.94, 1.00)	0.97 (0.94, 1.00)	0.97 (0.94, 1.00)	0.98 (0.94, 1.01)	0.97 (0.93, 1.00)
Serum creatinine (mg/dL, 27%)	1.01 (0.89, 1.14)	1.00 (0.88, 1.13)	0.99 (0.88, 1.12)	1.00 (0.88, 1.13)	0.99 (0.86, 1.12)
Weight (lb, 5%)	1.00 (0.96, 1.04)	1.00 (0.97, 1.04)	1.00 (0.96, 1.04)	1.00 (0.96, 1.04)	0.99 (0.95, 1.04)
Systolic blood pressure (mmHg, 1%)	1.01 (1.00, 1.01)	1.01 (1.00, 1.01)	1.01 (1.00, 1.01)	1.01 (1.00, 1.01)	1.08 (0.96, 1.22)
Diastolic blood pressure (mmHg, 1%)	0.99 (0.98, 1.00)	0.99 (0.98, 1.00)	0.99 (0.98, 1.00)	0.99 (0.98, 1.00)	0.99 (0.98, 1.01)
Potassium (mEq/L, 27%)	0.86 (0.66, 1.11)	0.90 (0.72, 1.13)	0.89 (0.71, 1.13)	0.88 (0.71, 1.11)	0.81 (0.63, 1.03)
WBC (/L, 17%)	0.98 (0.96, 1.01)	0.98 (0.96, 1.00)	0.98 (0.96, 1.01)	0.99 (0.97, 1.02)	0.99 (0.96, 1.01)

Open in a new tab

5. Conclusions

We have proposed a Bayesian nonparametric model as a flexible imputation tool for missing covariates. We show that the proposed method is able to capture complex dependence between covariates and competes favorably with MICE via simulations. We prefer our method over MICE since it always corresponds to a valid joint distribution and is compatible with the (inference) regression model (Daniels and others, 2014).

In our simulations, in simple linear cases (see simulation in the supplementary material available at Biostatistics online), our proposed method has comparable performance with MICE and parametric models. In more complex settings, our approach works better. Our proposed method, like MICE, is simple to implement, in that one only needs to specify (default) univariate conditional distribution and requires minimal tuning; we are currently finalizing an R package to implement our approach.

The issue of the order of variables in specifying the sequentially univariate conditional distribution is alleviated in our proposed method compared to parametric models, since our method flexibly models the conditional mean function and easily adapts to data when the order changes; in fact, MICE is not actually invariant to the ordering when the conditionals do not correspond to a valid joint distribution (though the impact of the order typically seems small) (Li and others, 2012). However, the ordering can be more important when the true conditionals are very nonadditive with skewed residuals (see simulation example in supplementary material available at Biostatistics online). In any case, we still recommend our ordering of the variables generally for most efficient computations.

Computations for our method can be further optimized using parallel computation as discussed in Section 2.4 (see details in supplementary material available at Biostatistics online). The parallel BART presented by Pratola and others (2014) can scale nearly linearly in the number of processor cores and handle massive datasets.

We can extend our approach to include auxiliary variables in the imputation model. To avoid uncongeniality issues, we can adopt the approach proposed by Daniels and others (2014) by specifying a model for Inline graphic with a constraint that gives the desired form for . Alternatively, we can extend our sequential conditional specification as so that the imputation model is always congenial with the inference model.

We are also currently working on extending our model for settings, where the covariates are MNAR and/or covariates can have different types of missingness. Also, as the number of covariates ( Inline graphic ) grows, better proposal distributions for the Metropolis-Hastings algorithms detailed in the supplementary material available at Biostatistics online might be required.

Supplementary materials

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Supplementary Data

supp_17_3_589__index.html^{(741B, html)}

Acknowledgments

This work was partially supported by NIH R01 CA183854. We thank Gigi Lipori and Jennifer Mylers for providing the EHR data from UF Health Shands Hospital for the study. Conflict of Interest: None declared.

References

Arnold B. C., Press S. J. (1989). Compatible conditional distributions. Journal of the American Statistical Association 84405, 152–156. [Google Scholar]
Burgette L. F., Reiter J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology 1729, 1070–1076. [DOI] [PubMed] [Google Scholar]
Chen Q., Ibrahim J. G. (2006). Semiparametric models for missing covariate and response data in regression models. Biometrics 621, 177–184. [DOI] [PubMed] [Google Scholar]
Chipman H. A., George E. I., McCulloch R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association 93443, 935–948. [Google Scholar]
Chipman H. A., George E. I., McCulloch R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics 41, 266–298. [Google Scholar]
Daniels M., Wang C., Marcus B. (2014). Fully Bayesian inference under ignorable missingness in the presence of auxiliary covariates. Biometrics 701, 62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Denison D. G., Mallick B. K., Smith A. F. (1998). A Bayesian CART algorithm. Biometrika 852, 363–377. [Google Scholar]
Di Zio M., Guarnera U., Luzi O. (2007). Imputation through finite Gaussian mixture models. Computational Statistics & Data Analysis 5111, 5305–5316. [Google Scholar]
Doove L., Van Buuren S., Dusseldorp E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis 72, 92-–104. [Google Scholar]
Ibrahim J. G., Lipsitz S. R., Chen M.-H. (1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 611, 173–190. [Google Scholar]
Kim H. J., Reiter J. P., Wang Q., Cox L. H., Karr A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. Journal of Business and Economic Statistics 323, 375–386. [Google Scholar]
Li F., Baccini M., Mealli F., Zell E. R., Frangakis C. E., Rubin D. B. (2014). Multiple imputation by ordered monotone blocks with application to the anthrax vaccine research program. Journal of Computational and Graphical Statistics 233, 877–892. [Google Scholar]
Li F., Yu Y., Rubin D. B. (2012). Imputing missing data by fully conditional models: some cautionary examples and guidelines. Duke University Department of Statistical Science Discussion Paper.
Lipsitz S. R., Ibrahim J. G. (1996). A conditional model for incomplete covariates in parametric regression models. Biometrika 834, 916–922. [Google Scholar]
Little R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 63, 287–296. [Google Scholar]
Liu J., Gelman A., Hill J., Su Y.-S., Kropko J. (2013). On the stationary distribution of iterative imputations. Biometrika 1011, 155–173. [Google Scholar]
Meng X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science 94, 538–558. [Google Scholar]
Moons K. G., Donders R. A., Stijnen T., Harrell F. E. (2006). Using the outcome for imputation of missing predictor values was preferred. Journal of Clinical Epidemiology 5910, 1092–1101. [DOI] [PubMed] [Google Scholar]
Murray J. S., Reiter J. P. (2014). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association (Preprint). arXiv:1410.0438.
Plummer M. (2003). JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Volume 124 Technische Universit at Wien, p. 125. [Google Scholar]
Pratola M. T., Chipman H. A., Gattiker J. R., Higdon D. M., McCulloch R., Rust W. N. (2014). Parallel Bayesian additive regression trees. Journal of Computational and Graphical Statistics 233, 830–852. [Google Scholar]
Raghunathan T. E., Lepkowski J. M., Van Hoewyk J., Solenberger P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 271, 85–96. [Google Scholar]
Rubin D. B. (1976). Inference and missing data. Biometrika 633, 581–592. [Google Scholar]
Rubin D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In: Proceedings of the Section on Survey Research Methods p. 20. [Google Scholar]
Rubin D. B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. [Google Scholar]
Schafer J. (1997) Analysis of Incomplete Multivariate Data. New York: Chapman Hall. [Google Scholar]
Si Y., Reiter J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 385, 499–512. [Google Scholar]
Van Buuren S., Oudshoorn K. (1999) Flexible Multivariate Imputation by MICE. Leiden, The Netherlands: TNO Prevention Center. [Google Scholar]
Wu Y., Tjelmeland H., West M. (2007). Bayesian CART: prior specification and posterior simulation. Journal of Computational and Graphical Statistics 161, 44–66. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_17_3_589__index.html^{(741B, html)}

supp_kxw009_kxw009supp.pdf^{(303.8KB, pdf)}

[kxw009C1] Arnold B. C., Press S. J. (1989). Compatible conditional distributions. Journal of the American Statistical Association 84405, 152–156. [Google Scholar]

[kxw009C2] Burgette L. F., Reiter J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology 1729, 1070–1076. [DOI] [PubMed] [Google Scholar]

[kxw009C3] Chen Q., Ibrahim J. G. (2006). Semiparametric models for missing covariate and response data in regression models. Biometrics 621, 177–184. [DOI] [PubMed] [Google Scholar]

[kxw009C4] Chipman H. A., George E. I., McCulloch R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association 93443, 935–948. [Google Scholar]

[kxw009C5] Chipman H. A., George E. I., McCulloch R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics 41, 266–298. [Google Scholar]

[kxw009C6] Daniels M., Wang C., Marcus B. (2014). Fully Bayesian inference under ignorable missingness in the presence of auxiliary covariates. Biometrics 701, 62–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[kxw009C7] Denison D. G., Mallick B. K., Smith A. F. (1998). A Bayesian CART algorithm. Biometrika 852, 363–377. [Google Scholar]

[kxw009C8] Di Zio M., Guarnera U., Luzi O. (2007). Imputation through finite Gaussian mixture models. Computational Statistics & Data Analysis 5111, 5305–5316. [Google Scholar]

[kxw009C9] Doove L., Van Buuren S., Dusseldorp E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis 72, 92-–104. [Google Scholar]

[kxw009C10] Ibrahim J. G., Lipsitz S. R., Chen M.-H. (1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 611, 173–190. [Google Scholar]

[kxw009C11] Kim H. J., Reiter J. P., Wang Q., Cox L. H., Karr A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. Journal of Business and Economic Statistics 323, 375–386. [Google Scholar]

[kxw009C12] Li F., Baccini M., Mealli F., Zell E. R., Frangakis C. E., Rubin D. B. (2014). Multiple imputation by ordered monotone blocks with application to the anthrax vaccine research program. Journal of Computational and Graphical Statistics 233, 877–892. [Google Scholar]

[kxw009C13] Li F., Yu Y., Rubin D. B. (2012). Imputing missing data by fully conditional models: some cautionary examples and guidelines. Duke University Department of Statistical Science Discussion Paper.

[kxw009C14] Lipsitz S. R., Ibrahim J. G. (1996). A conditional model for incomplete covariates in parametric regression models. Biometrika 834, 916–922. [Google Scholar]

[kxw009C15] Little R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 63, 287–296. [Google Scholar]

[kxw009C16] Liu J., Gelman A., Hill J., Su Y.-S., Kropko J. (2013). On the stationary distribution of iterative imputations. Biometrika 1011, 155–173. [Google Scholar]

[kxw009C17] Meng X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science 94, 538–558. [Google Scholar]

[kxw009C18] Moons K. G., Donders R. A., Stijnen T., Harrell F. E. (2006). Using the outcome for imputation of missing predictor values was preferred. Journal of Clinical Epidemiology 5910, 1092–1101. [DOI] [PubMed] [Google Scholar]

[kxw009C19] Murray J. S., Reiter J. P. (2014). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association (Preprint). arXiv:1410.0438.

[kxw009C20] Plummer M. (2003). JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Volume 124 Technische Universit at Wien, p. 125. [Google Scholar]

[kxw009C21] Pratola M. T., Chipman H. A., Gattiker J. R., Higdon D. M., McCulloch R., Rust W. N. (2014). Parallel Bayesian additive regression trees. Journal of Computational and Graphical Statistics 233, 830–852. [Google Scholar]

[kxw009C22] Raghunathan T. E., Lepkowski J. M., Van Hoewyk J., Solenberger P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 271, 85–96. [Google Scholar]

[kxw009C23] Rubin D. B. (1976). Inference and missing data. Biometrika 633, 581–592. [Google Scholar]

[kxw009C24] Rubin D. B. (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. In: Proceedings of the Section on Survey Research Methods p. 20. [Google Scholar]

[kxw009C25] Rubin D. B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. [Google Scholar]

[kxw009C26] Schafer J. (1997) Analysis of Incomplete Multivariate Data. New York: Chapman Hall. [Google Scholar]

[kxw009C27] Si Y., Reiter J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 385, 499–512. [Google Scholar]

[kxw009C28] Van Buuren S., Oudshoorn K. (1999) Flexible Multivariate Imputation by MICE. Leiden, The Netherlands: TNO Prevention Center. [Google Scholar]

[kxw009C29] Wu Y., Tjelmeland H., West M. (2007). Bayesian CART: prior specification and posterior simulation. Journal of Computational and Graphical Statistics 161, 44–66. [Google Scholar]

PERMALINK

Sequential BART for imputation of missing covariates

Dandan Xu

Michael J Daniels

Almut G Winterstein

Abstract

1. Introduction

2. Multiple imputation using Bayesian CART and BART

2.1. Review of Bayesian CART and BART model