Summary:
Missing data is a pervasive issue in observational studies using electronic health records or patient registries. It presents unique challenges for statistical inference, especially causal inference. Inappropriately handling missing data in causal inference could potentially bias causal estimation. Besides missing data problems, observational health data structures typically have mixed type variables - continuous and categorical covariates - whose joint distribution is often too complex to be modeled by simple parametric models. The existence of missing values in covariates and outcomes makes the causal inference even more challenging, while most standard causal inference approaches assume fully observed data or start their works after imputing missing values in a separate pre-processing stage. To address these problems, we introduce a Bayesian nonparametric causal model to estimate causal effects with missing data. The proposed approach can simultaneously impute missing values, account for multiple outcomes, and estimate causal effects under the potential outcomes framework. We provide three simulation studies to show the performance of our proposed method under complicated data settings whose features are similar to our case studies. For example, Simulation 3 assumes the case where missing values exist in both outcomes and covariates. Two case studies were conducted applying our method to evaluate the comparative effectiveness of treatments for chronic disease management in juvenile idiopathic arthritis and cystic fibrosis.
Keywords: Average causal effect, Mixed covariate structures, Multiple imputation, Nonparametrics, Observational health data, Regression trees
1. Introduction
Data collected from electronic health records and patient registries are routinely used to conduct comparative effectiveness research on different treatment approaches (Miriovsky et al., 2012; Szczesniak et al., 2017). Employing a causal inference method is crucial to address confounding bias and yield an accurate estimate of the treatment effect when analyzing observational data. However, a common challenge in observational research is dealing with incomplete data. In chronic disease settings of our research focus, the analysis is often conducted on encounter-level data records with missing values for many reasons, such as patients’ missing visits or the failure of the health service provider’s recording data. The missingness in collected health records is often informative (Groenwold, 2020), as the missingness is associated with disease progression. For example, we can expect sicker patients may pay more visits to the doctor’s office. A convenient practice using the complete case analysis results in inefficiency due to the decreased sample size and may lead to bias under informative missingness.
There has been an increase in the literature that focuses on causal inference considering incomplete data. A primary focus of this research area was in the context of propensity score analysis under the (uncheckable) assumption of ignorable missingness (Lu and Ashmead, 2018). One way is to use a generalized propensity score to incorporate the missing data patterns in the propensity score estimation (Rosenbaum and Rubin, 1984; D’Agostino Jr and Rubin, 2000). Another approach suggested by Leyrat et al. (2019) is first to multiply impute the missing covariate, then apply a propensity score weighting method to each completed dataset whose results are combined by Rubin’s rule (Rubin, 1987). Both propensity-score approaches assume a correct treatment propensity model and may suffer if the propensity model is misspecified. In addition to the propensity model misspecification, the imputation model may also be subject to misspecification, which can be relieved by adopting a data-dependent flexible model. Kapelner and Bleich (2016) suggested a modified version of the Bayesian additive regression trees method to handle the missing covariates, which can be implemented by an R package bartMachine. However, to our knowledge, the performance of bartMachine with missing covariates has not formally been evaluated for estimating causal effects. Roy et al. (2018) adopted an enriched Dirichlet process approach to causal inference with missing at random (MAR) covariates. Enjoying the full benefits of the flexible nonparametric Bayesian modeling, they showed that the enriched Dirichlet process causal model works well where the missingness occurs to covariates. This model is similar in spirit to our suggested nonparametric approach but somewhat limited in our application example, where missing values exist in covariates and multiple continuous outcome variables. Mayer et al. (2020) proposed a doubly robust method in causal inference with missing covariates, but their method may yield biased causal estimates if one misspecifies both outcome and propensity score models (Kang et al., 2007).
Our work is motivated by clinical effectiveness studies of different therapeutic approaches using data collected from routine clinical care in chronic disease contexts. Our first motivational study uses data extracted from electronic health records from patients with juvenile idiopathic arthritis within a single healthcare center. The second motivational study uses the Cystic Fibrosis Foundation Registry, a multi-center national patient registry (Knapp et al., 2016). As typically found in electronic health records or registries, both applications present complex data features such as mixed continuous and categorical covariates, missing values in covariates and outcomes, and complex data distribution. The cystic fibrosis study has two outcomes of interest with substantial missing values.
Motivated by the two case studies, we propose a Bayesian nonparametric causal model relying on the dependent Dirichlet process (MacEachern, 1999) by extending the hierarchically coupled mixture model of Murray and Reiter (2016). The model captures the joint distribution of outcomes and covariates consisting of observed and missing values given the treatment indicator and the missingness indicator , in which the population average treatment effect is explicitly incorporated as model parameters. The adopted nonparametric model is known to be flexible, so the concern for misspecification of the imputation model which we use (and the concern for the propensity score model which we do not use) is relieved compared to existing causal inference approaches. Moreover, the joint model is well adapted to our real data where covariates are a mix of continuous and categorical variables, missing values exist in covariates as well as outcomes, and the treatment effect estimation is required for multiple outcomes.
The article is outlined as follows. Section 2 introduces the Bayesian nonparametric causal model with the prior distribution construction, derives the conditional posterior distributions used for Bayesian inference, and describes how to use our model to estimate the causal effects. Section 3 describes the details of three simulation studies and applies our proposed model. We also introduce some existing causal inference methods and compare their performance with our suggested approach in the presence of missing data. Section 4 applies our model to juvenile idiopathic arthritis and cystic fibrosis studies. Finally, Section 5 concludes with a summary of our results and some discussion topics. The R code for the proposed method is available on https://github.com/huaiyu0112/BNP-causal.
2. Bayesian Nonparametric Causal Model and Causal Effect Estimation
2.1. Notations and Assumptions
For a patient record , let denote a vector of continuous outcomes. For example, our cystic fibrosis study has outcome variables, which are lung function measurements at two time points. Let denote a binary treatment indicator with the value one for a unit assigned to the treatment group and zero for a unit assigned to the control group. Let represent a vector of baseline covariates. Following the potential outcome framework (Rubin 1974), let and denote vectors of potential outcomes that could be observed if a unit would be assigned to the treatment and the control, respectively. A main inferential issue in this setting is that a researcher observes only one outcome, either or , not both in practice. To address this issue, statistical causal inference has been actively studied (Imbens and Rubin, 2015). Our application studies focus on estimating the population average treatment effect (ATE) on multiple outcomes, .
We made three standard causal assumptions to identify the ATE: consistency, with for all ; positivity, if ; and unconfoundedness, . For the causal inference with incomplete data, we assumed ignorable missingness, which means that the missing mechanism depends on observed data and can be ignored in the inference. Formally, we require that . Our joint modeling approach does not distinguish the missingness between outcomes and covariates and can automatically impute their missing values within the algorithm. For the cases regarding nonignorable missing data, we refer interested readers to the studies (e.g., Lu and Ashmead, 2018; Yang et al., 2019; Josefsson and Daniels, 2021) for the impact of missing not at random mechanism on causal estimation.
2.2. Proposed Model
To make a causal inference with a complex form of observational data with missing values in outcomes and covariates, we propose a Bayesian nonparametric (BNP) causal model by refining the hierarchically coupled mixture imputation model of Murray and Reiter (2016). The flexibility of the hierarchically coupled mixture model we adopted is from the data-dependent learning of mixed type variables by re-grouping into continuous variables and categorical variables . After learning the joint distribution, it is used to impute the missing values by , where denotes all model parameters, and we estimate the ATE directly from a parameter in . In the section below, details of the rather complicated model are explained with this overarching scheme.
2.2.1. Data model.
For individual , let the vector of all baseline covariates consist of categorical covariates and continuous covariates. Let denote a vector of continuous variables, which consists of outcome variables and continuous covariates. After re-grouping the incomplete data into continuous variables and categorical variables, the joint distribution of our interest is now modelled by where and . In our data model, and both rely on flexible nonparametric modeling introduced below.
For regression modeling , the original categorical variables are transformed to a vector of indicator variables of size to avoid an identifiability issue. In the juvenile idiopathic arthritis study, the gender and the arthritis subtype comprise categorical covariates , while the disease activity score (outcome of ) and the age of diagnosis are examples of continuous variables .
Let be a treatment indicator matrix of size and be the corresponding regression coefficients of size . For a treated unit such that , is defined by stacking the identity matrix of size on the zero matrix of size . If the -th unit is assigned to the control group such that , is the zero matrix of size . For all , let and be the mixture component index of continuous variables and categorical variables, respectively. Then, the data model is written as
| (1) |
where denotes a categorical distribution, i.e., means for , denotes a multivariate normal distribution, and the matrices and are of sizes and , respectively. Web Figure 1 shows the graphical representation of the proposed model’s structure. The complicated relationships within or between categorical and continuous variables can be captured in such a mixture model. The joint distribution among categorical variables , , is captured via the mixture of categorical distributions with the mixture component , and the joint distribution of continuous variables (including the outcomes ) is conditionally specified given via the mixture of regression models, , with the components to capture heterogeneous regression effects.
The estimates of have the desired ATE interpretation following three causal assumptions in Section 2.1, relying on the explanation in Imbens and Rubin (2015). From Equation (1), the response surface of the -th outcome of unit can be expressed as which is a flexible functional form in capturing nonlinear relationships.
2.2.2. Dirichlet process priors for mixture components.
A hierarchical prior is assigned for mixture components and . Let be another mixture component incorporating new forms of local dependence to capture the joint distribution of and . In Web Appendix B, we introduced a simulation example that explains the benefit of having the top-level component. The illustrative simulation study shows that removing the top-level component from the model assumes the (unconditional) independence between and , which results in bias or inefficiency in the posterior inference of the full joint distribution. For unit , we write the hierarchical model by , where and , and . This hierarchical construction lets the model represent a wide variety of shapes for the regression surface of the continuous variables on the baseline covariates and captures the interactions among categorical covariates. Generally, our model is insensitive to specific choices of provided that they allow for unoccupied components.
For the prior of mixture components, a truncated version of the stick-breaking construction (Ishwaran and James, 2001) is used: for and ; , for and ; and , for and , where Beta denotes the beta distribution. We assume independent conjugate Gamma prior distributions with shape and rate parameters equal to 0.5 for , and . The details of prior distributions for the data model parameters are discussed in the Web Appendix C, which includes guidelines helpful for practitioners to understand the role of specified prior distributions in inference and adjust them if needed.
2.2.3. Posterior computation.
The posterior samples are efficiently obtained through Gibbs sampling. Imputation within the Gibbs sampler is straightforward – we sample predictive values from the full conditional distributions for the missing values in the Gibbs steps. The details of the MCMC algorithm are presented in Web Appendix D. To estimate ATE given the posterior samples of , we compute a posterior mean as the point estimate and the 95% equal-tailed credible interval as the interval estimate for the ATE on the -th outcome. It is often not straightforward to check the convergence of the MCMC chain for the mixture model parameters; however, we can indirectly access convergence with the trace plot of the ATE estimate and its Gelman–Rubin diagnostic.
3. Simulation Studies
In this section, three simulation studies with incomplete data are introduced to compare the BNP causal model to existing causal inference approaches with 1000 repeated simulations for each scenario. We designed the simulation studies so that their features are similar to our case studies whose data show an irregular distribution with missing values in mixed-type covariates and (multiple) outcomes.
3.1. Existing Causal Methods
We compared four existing causal methods to our suggested approach. The first approach for comparison is the Bayesian additive regression trees (BART), a nonparametric Bayesian causal inference method to estimate the ATE (Hill, 2011). We implemented the R package bartCause with default settings (Dorie and Hill, 2018) as well as another package bartMachine, which can natively handle missing covariate data via incorporating missingness into BART trees (Kapelner and Bleich, 2016). Second, we used a linear regression model (LM) adjusting for all the covariates. Third, we implemented the inverse probability of treatment weighting (IPTW) with the propensity score estimated using a logistic model assuming an additive, linear form of the covariates. The R package WeightIt (Greifer, 2019) was used to produce the propensity score weights, and the survey package (Lumley et al., 2004) was used to incorporate these estimated weights to get the ATE and its standard error. The last method is the targeted maximum likelihood estimation (TMLE) which is a two-stage semi-parametric approach for causal inference (Van der Laan and Rubin, 2006). To estimate the outcome and propensity score model in the TMLE approach, we used an ensemble machine learning method of Van der Laan et al. (2007) that uses cross-validation to weigh different prediction algorithms. The default prediction algorithms, such as step, glm, and glm.interaction were implemented using the R package tmle (Van der Laan and Rose, 2011).
Similar to our application studies introduced in the next section, all simulation settings assume that an analyst is interested in causal inference with incomplete data where outcome variables and covariates have missing values. If one wants to use a standard causal inference method that was not designed to handle missing values, a common practice is to use the standard causal inference methods with incomplete data in a two-stage process: imputing the missing values in the first stage, then applying the standard causal inference method to the imputed datasets. In the simulation studies, we mimicked this practice. Ten imputed datasets were created using multiple imputation by chained equations (MICE) algorithm implemented in R’s MICE package (Van Buuren and Groothuis-Oudshoorn, 2011), then existing causal methods were applied to the multiply imputed datasets. We used the predictive mean matching method for MICE and set the imputation model to include outcome(s), treatment indicator, and covariates. The final causal inference was made using Rubin’s combining rule to produce a pooled estimate of the treatment effect and its confidence interval that accounts for uncertainty introduced by the imputation process.
3.2. Simulation Study 1 (Outcome with a Bimodal Distribution)
Simulation Study 1 assumes a single outcome case where missing values exist in covariates only. We simulated records, each consisting of four continuous covariates, for with for , and two categorical covariates, and where Bern denotes the Bernoulli distribution. The treatment indicator was generated from the conditional distribution given some covariates, . To generate a potential outcome , we mimicked the simulation setting in Roy et al. (2018) which assumes the bimodal distribution of the outcome: , , , where if the statement inside the parentheses is true and otherwise. By setting the simulation true causal effect as 1.5, we produced the potential outcome . The outcome was then generated with the treatment indicator by .
We assumed that two continuous and two categorical covariates have missing values. We generated the missing values under the MAR assumption, resulting in about 90% of records having at least one missing item. The details of the missing data mechanism used in this simulation study are introduced in Web Appendix E.
3.2.1. Causal inference results.
For the causal inference with the BNP causal model in all simulation studies of this paper, we set the upper bound of mixture components as , and , confirming with MCMC results that the upper bounds are large enough to account for all occupying components. The ATE estimate was obtained by averaging over 10,000 MCMC iterations after 2,000 burn-in iterations, which we checked are sufficient to approximate the stationary distribution.
In Table 1, we compare the BNP causal model to existing causal inference methods. We report the point estimate of the ATE, standard error, root mean square error (RMSE), median absolute error (MAE), 95% confidence interval coverage, and averaged standard error estimate. Note that this simulation study uses a correctly-specified propensity score model for IPTW, so the actual performance of IPTW in practice where the true propensity score model is unknown will not be as good as a result of this simulation study.
Table 1:
Average treatment effects (ATE) estimation in Simulation Study 1 () with the simulation true ATE of 1.5. BNPc denotes the proposed Bayesian nonparametric causal inference method; bartMachine denotes applying the BART method to simulated incomplete data using the R package bartMachine, which is designed to handle missing values; MICE + denotes the existing causal inference methods (BART, LM, IPTW, and TMLE) applied to the imputed data generated from the MICE. Ref denotes the proposed BNP method applied to the simulation dataset before introducing missingness, i.e., all covariates have no missing values; Est denotes the average of the mean estimates over 1000 repeated simulations; SE, the standard error of the mean estimates; RMSE, the root mean square error; MAE, the median absolute error; Coverage, the empirical 95% confidence interval coverage; and SEE, the average of the standard error estimates. The average percentage of complete cases over the repeated simulations is around 9.1%.
| Method | Est | SE | RMSE | MAE | Coverage | SEE |
|---|---|---|---|---|---|---|
|
| ||||||
| Ref | 1.487 | 0.120 | 0.121 | 0.083 | 0.974 | 0.144 |
|
| ||||||
| BNPc | 1.542 | 0.132 | 0.139 | 0.095 | 0.950 | 0.141 |
| bartMachine | 1.684 | 0.351 | 0.396 | 0.265 | 0.912 | 0.356 |
| MICE + BART | 1.370 | 0.324 | 0.349 | 0.240 | 0.959 | 0.352 |
| MICE + LM | 1.514 | 0.440 | 0.440 | 0.304 | 0.949 | 0.438 |
| MICE + IPTW | 1.473 | 0.445 | 0.446 | 0.313 | 0.986 | 0.574 |
| MICE + TMLE | 1.479 | 0.452 | 0.452 | 0.318 | 0.949 | 0.449 |
Table 1 shows that the suggested BNP causal method (denoted as BNPc in the table) results in a more precise estimation than other existing methods in terms of the smallest RMSE and MAE. The suggested method’s performance is comparable to the reference result (denoted as Ref in the table), which applies the BNP causal model to the simulation dataset before introducing missing values. The slight increase in the standard error of the BNP causal method compared to the reference result is explained by the uncertainty introduced by the imputation process. The BNP causal method also results in the confidence interval coverage close to its nominal value, which suggests that the proposed method correctly estimates the total variance, which is the sum of the uncertainty in the missing values and the causal inference. Web Table 2 presents the performance comparison of the proposed method under different missing rate scenarios, confirming that the method consistently works well, showing smaller variances with lower missing rates.
3.2.2. Imputation performance and causal inference under no missing data case.
To compare the imputation performance of the suggested method with the MICE, we calculated the mean estimates of two covariates and from their completed datasets. Web Table 3 shows that the imputation performance of the MICE approach for the covariates is as good as (or slightly better than) that of the suggested method for this simulation data. This result implies that the relatively superior performance of the BNP causal method is based on the better causal inference itself, not from the imputation performance, at least in this simulation data. To confirm this conjecture, we apply all methods to the simulation data before deletion, i.e., no missing values exist in the dataset. Web Table 4 shows that the BNP causal model results in the smallest RMSE, outperforming other methods by large gaps, implying that the flexible feature of our joint modeling helps estimate the treatment effect efficiently.
3.3. Simulation Study 2 (Outcome and Covariates with a Mixture Distribution)
The second simulation aims to compare the performance of different causal inference approaches when the mixed-type data (with both continuous and categorical covariates) shows a multimodal and complex distribution. Mimicking the data structure in Section 2.2.1, we produced the simulation data by combining the mixture of categorical distributions and conditional regression models. The data with records were generated as follows. First, we randomly generated latent mixture labels, , and from the categorical distribution with the number of components, , and , respectively. The components follow a categorical distribution with the probability equal to (0.05, 0.40, 0.30, 0.15, 0.10). The probability mass functions of and conditional on are shown in Web Appendix F. Second, we generated the categorical covariates whose marginal distributions are given by the probability mass functions in Web Appendix F. Third, we assumed that continuous variables , consisting of a potential outcome of the control group and two continuous covariates, follow the multivariate normal mixture distribution with mean and covariance matrix where for and , is the corresponding design matrix for the categorical variables through , and
The true values of parameters are shown in Web Appendix F. Fourth, we generated the treatment indicator given continuous and categorical covariates by . Last, the true simulation unit causal effect was set as 1.5 for all units, i.e., . The outcome was then generated with treatment indicator by .
As shown in Web Figure 2, the bivariate relationship among the continuous variables shows a complex distributional form, suggesting that any single-component multivariate Gaussian model could not capture this irregular and skewed joint density. For missing covariates, we randomly deleted some values of six covariates with missing rates from 20% to 28% for each covariate. The simulation parameters and missing data generation details are presented in Web Appendix F.
3.3.1. Causal inference results.
The causal inference performance given in Table 2 shows a similar pattern to the previous simulation study in Section 3.2. Compared to the reference method, which assumes no missing data, the BNP causal model results in a slightly larger standard error due to the uncertainty introduced by missing covariates. Similar results are observed with incomplete data with lower missing rates (Web Table 5). Compared to other approaches, the proposed model outperforms classical causal inference methods in bias, RMSE, and MAE. Another interesting observation is that the existing causal inference approaches coupled with MICE imputation result in larger biases than those with the complete-case-only data (Web Table 6), which implies that the MICE does not work well with this simulation dataset with the irregular, multimodal joint density. In addition, Web Table 7 confirms our BNP model outperforms other methods under the scenario when the missing rate is low as about 10%.
Table 2:
Average treatment effects (ATE) estimation in Simulation Study 2 () with the simulation true ATE of 1.5, comparing the proposed BNP method (BNPc) with other existing methods. Ref denotes the reference result, which is only available in the simulation study. Est denotes the average of the ATE estimates over 1000 repeated simulations; SE, the standard error of the ATE estimates; RMSE, the root mean square error; MAE, the median absolute error; Coverage, the empirical 95% confidence interval coverage; and SEE, the average of the standard error estimates. The average percentage of complete cases over repeated simulations is around 22.2%.
| Method | Est | SE | RMSE | MAE | Coverage | SEE |
|---|---|---|---|---|---|---|
|
| ||||||
| Ref | 1.489 | 0.107 | 0.108 | 0.076 | 0.968 | 0.117 |
|
| ||||||
| BNPc | 1.496 | 0.126 | 0.126 | 0.085 | 0.957 | 0.130 |
| bartMachine | 1.438 | 0.165 | 0.177 | 0.115 | 0.912 | 0.155 |
| MICE + BART | 1.261 | 0.146 | 0.280 | 0.236 | 0.748 | 0.172 |
| MICE + LM | 1.308 | 0.176 | 0.260 | 0.188 | 0.828 | 0.186 |
| MICE + IPTW | 1.148 | 0.355 | 0.500 | 0.369 | 0.998 | 0.713 |
| MICE + TMLE | 1.331 | 0.212 | 0.271 | 0.185 | 0.947 | 0.264 |
3.3.2. Imputation performance.
Figure 1 (and Web Table 8) shows that the suggested BNP method is reasonably good in imputing the missing values in , which results in a decent performance in the point estimation and the confidence interval coverage. Under this complex data distribution setting, MICE imputation does not work well. The bottom panels in Figure 1 show that some imputed values from MICE (crosses in the figure) are located far from their original values (empty dots in the figure), which leads to the MICE resulting in the biased estimation and the low confidence interval coverage.
Figure 1:

Scatter plots of continuous variables in Simulation Study 2 imputed from the BNP method (top panels) and MICE (bottom panels). The solid dots on the background represent original data points; the empty dots represent points removed from input data as missing data points; the crosses represent imputed data points drawn from each method.
3.4. Simulation Study 3 (Missingness in Both Outcomes and Covariates)
The third simulation aims to investigate the performance of causal inference approaches in a setting similar to our cystic fibrosis case study, which has multiple outcomes and missing values existing in outcomes and covariates. The data generation process of Simulation Study 3 is the same as Simulation Study 2 except that there are two outcomes that are highly correlated. The true ATE for and were set as 1.5 and 0.5, respectively. Web Figure 3 shows the complex distributional feature of the continuous variables in the simulation data, which cannot be adequately fit with a parametric model. To introduce missingness, some values of , and were randomly deleted, where the missing patterns for two outcomes, , and , are highly associated. A similar missing pattern is also observed in the cystic fibrosis study; wherein if an outcome measurement of lung function is missing at 6 months, the following lung function measurement at 12 months is also highly likely to be missing. Our missingness mechanism results in a large amount of missing data, and the proportion of the complete cases is only around 19% out of all records. Web Appendix G illustrates the details of simulation data generation with missing values.
We compared the BNP method with the existing causal methods. Because existing methods are designed to handle a single outcome, the ATEs of and were separately estimated by the existing causal inference methods, while our proposed BNP method simultaneously estimates the ATEs of two outcomes using the joint distributional information of the variables.
Table 3 shows the summary of 1000 repeated simulations, each using a dataset with a sample size of 1000. The BNP method produces significantly better performance than the existing causal inference methods in terms of the RMSE and MAE, implying the efficiency of our method over existing causal models separately fitted to each outcome. Our method’s nominal coverage rate near 95% also suggests that its resulting variance estimation is correct. Comparing the existing methods, the causal models applied to MICE-imputed datasets result in larger biases than complete-case-only methods for the ATE of both outcomes. It implies that the MICE imputation introduces a bias to the causal inference of all existing methods (Web Table 9).
Table 3:
Average treatment effects (ATE) estimation for two outcome variables in Simulation 3 (). The true ATE on outcome and are 1.5 and 0.5, respectively. Est denotes the average of the ATE estimates over 1000 repeated simulations; SE, the standard error of the ATE estimates; RMSE, the root mean square error; MAE, the median absolute error; Coverage, the empirical 95% confidence interval coverage; and SEE, the average of the standard error estimates. The average percentage of complete cases is around 19%.
| Estimand | Method | Est | SE | RMSE | MAE | Coverage | SEE |
|---|---|---|---|---|---|---|---|
|
| |||||||
| ATE on | BNPc | 1.474 | 0.094 | 0.098 | 0.066 | 0.951 | 0.095 |
| MICE + BART | 1.138 | 0.161 | 0.396 | 0.361 | 0.485 | 0.176 | |
| MICE + LM | 1.123 | 0.182 | 0.418 | 0.371 | 0.466 | 0.176 | |
| MICE + IPTW | 0.991 | 0.481 | 0.700 | 0.528 | 0.909 | 0.570 | |
| MICE + TMLE | 1.263 | 0.209 | 0.315 | 0.253 | 0.852 | 0.228 | |
|
| |||||||
| ATE on | BNPc | 0.472 | 0.090 | 0.094 | 0.063 | 0.942 | 0.093 |
| MICE + BART | −0.124 | 0.271 | 0.680 | 0.625 | 0.542 | 0.318 | |
| MICE + LM | −0.189 | 0.298 | 0.751 | 0.705 | 0.477 | 0.324 | |
| MICE + IPTW | −0.054 | 0.490 | 0.740 | 0.572 | 0.873 | 0.621 | |
| MICE + TMLE | 0.046 | 0.357 | 0.577 | 0.460 | 0.836 | 0.404 | |
The imputation performance shows a similar pattern to Simulation Study 2 that the suggested BNP method is reasonably good in imputing missing values, while the MICE imputation does not work well for some variables (Web Figure 4).
4. Application Studies
4.1. Juvenile Idiopathic Arthritis Data Analysis
4.1.1. Data.
Juvenile idiopathic arthritis (JIA) is a childhood autoimmune disease. It is an umbrella term for a set of pediatric-onset arthritis conditions and is considered a major cause of childhood disability (Harrold et al., 2013). Among many treatment options, the two most common approaches are non-biologic disease modifying anti-rheumatic drugs (DMARDs) and biologic DMARDs. This study aims at understanding whether the therapy using an early aggressive combination of non-biologic and biologic DMARDs is more effective than the more commonly adopted non-biologic DMARDs monotherapy in treating children with recently (< 6 months) diagnosed polyarticular courses of JIA.
Our application study is a re-analysis of electronic healthcare data obtained from 465 patients cared for at a single JIA care center (Huang et al., 2020). The primary outcome of interest is the clinical Juvenile Arthritis Disease Activity Score (cJADAS) measured after 6 months of treatment. It is the sum of the physician’s global assessment of disease activity (range, 0–10), the patient/parent’s global assessment of well-being (range, 0–10), and the active joint count truncated above 10. A higher cJADAS indicates higher disease activity. Patients assigned to the early aggressive plan received the combination of non-biologic and biologic DMARDs, and patients assigned to the conservative plan were initiated on non-biologic DMARDs monotherapy and did not receive any biologic DMARDs for at least 3 months.
Web Table 10 summarizes thirteen baseline covariates we considered in the causal inference based on our clinical knowledge. Confounding by indication was evident. Patients with the early aggressive treatment had significantly more active disease at baseline (e.g., mean±SD of cJADAS 16.08±7.14 vs. 12.39±5.91; ). In addition, the data suffer from missing values in the outcome and some covariates, for example, the missing rates in the cJADAS at 6 months, the uveitis ever-positive indicator, and the quality of life total index are 0.53, 0.50, and 0.48, respectively. Overall, only 19 patients have fully observed outcome and covariates, so we do not perform a complete case analysis with the small sample size.
4.1.2. Results.
We applied the proposed BNP causal model to the JIA data to compare the effectiveness of two early aggressive and conservative treatment plans. For the proposed BNP causal model, we used the same prior distributions described in Section 2.2. The ATE estimate was obtained by averaging over 30,000 posterior MCMC draws after discarding the first 5,000 iterations as burn-in. For comparison purposes, we applied the existing causal analysis methods to the multiply imputed datasets generated from the MICE package with the predictive mean matching method. The variance of the treatment effect estimated from each existing method was calculated using Rubin’s combining rule, as we did in the simulation studies. For the detailed implementation of the existing methods, we use the setting introduced in Section 3.1.
Table 4 shows the case study result with the JIA data. The estimated causal effect from our BNP causal model shows that the early aggressive plan is more effective than the conservative plan by decreasing the disease activity measured with cJADAS at 6 months on average by 1.26 points. The posterior distribution of the causal effect (displayed in Web Figure 5) results in 95% equal-tailed credible interval (−2.42, −0.06), which does not comprise a zero value. Moreover, the posterior probability that the treatment effect is less than zero is around 98%, i.e., the posterior probability that the early aggressive plan is more effective in suppressing the disease activity than the conservative plan is 98%. Both Bayesian approaches can provide the posterior probability of ATE less than 0.
Table 4:
Average treatment effects estimation in juvenile idiopathic arthritis (JIA) data analysis. The proposed BNP method (BNPc) is compared with the four existing causal inference methods applied to the completed datasets generated by the multiple imputation by chained equations (MICE). The negative value in the estimator means that the disease activity in the early aggressive plan is lower than that in the conservative plan, i.e., the early aggressive plan is more effective than the conservative plan. 95% CI indicates the credible interval for BNPc and the confidence intervals for other methods. The p-value for BNPc is calculated based on the Bayesian asymptotic theory using posterior mean and variance. The P(ATE < 0) denotes the posterior probability of ATE less than 0, suggesting prescribing an early aggressive plan is effective in decreasing the disease activity; P(ATE < 0) for MICE + BART is calculated based on one imputed data generated from the MICE.
| Method | Est | SE | 95% CI | p-value | P(ATE <0) |
|---|---|---|---|---|---|
|
| |||||
| BNPc | −1.26 | 0.60 | (−2.42, −0.06) | 0.018 | 0.980 |
| MICE + BART | −1.76 | 0.94 | (−3.72, 0.21) | 0.077 | 0.944 |
| MICE + LM | −1.47 | 0.87 | (−3.28, 0.35) | 0.107 | |
| MICE + IPTW | −1.65 | 0.96 | (−3.63, 0.33) | 0.098 | |
| MICE + TMLE | −0.38 | 0.53 | (−1.41, 0.65) | 0.470 | |
4.2. Cystic Fibrosis Data Analysis
Cystic fibrosis (CF) is a lethal autosomal disease that primarily affects the lungs, requiring chronic medications and management throughout the lifespan. Pseudomonas aeruginosa (Pa) is an opportunistic bacteria commonly found in the environment that causes widespread, severe infection in individuals with CF. Nearly half of all people living with CF have Pa infection (Cystic Fibrosis Foundation, 2020). In clinical trial settings, inhaled tobramycin (hereafter, Tobi) is an antibiotic that has been shown to improve lung function in CF patients with Pa. Observational studies aimed at estimating the causal effect of Tobi on lung function and survival have indicated confounding by disease severity (Sawicki et al., 2012). However, it is unclear what may be the best timing for Tobi usage, particularly in real-world clinical settings.
4.2.1. Study goal and data.
This study aims to investigate whether early initiation of Tobi can improve lung function outcomes measured at 6 months and 12 months after the first Pa infection. The forced expiratory volume in 1 second of % predicted (hereafter, FEV1) is our measure of lung function. The covariates identified based on clinical input and the aforementioned Tobi observational studies include sex, age, baseline FEV1, weight percentile, height percentile, insurance type, dornase alfa use, pancreatic insufficiency (defined as taking pancreatic enzymes), staphylococcus co-infection, and hypertonic saline use.
The study uses encounter-level data from the Cystic Fibrosis Foundation Patient Registry, ranging from January 1, 1996, to December 31, 2015. We did not consider study records with individuals younger than 6 years old due to limitations of modality to measure lung function in young children. Patients who did not experience at least one Pa infection or were not recorded as using Tobi were excluded. Meeting the study subject criteria, 555 patients were eligible for the study. We evaluated the effects of two treatment strategies according to the timing of tobramycin delivery: receiving Tobi at the first occurrence of Pa infection (referred to as early Tobi) or receiving Tobi at the second or later Pa infection occurrences (late Tobi). Of 555 eligible patients, 289 and 266 were treated with early Tobi and late Tobi groups, respectively.
Like many other clinical data, this dataset suffers from the measurement accuracy issue because physicians do not always measure the lung function at the exact date on which a patient was infected by Pa but measure the lung function within a certain time after the first Pa infection is observed. Therefore, the current practice of data prepossessing is to set a certain time window defined with the time of interest (e.g., date of Pa infection diagnosis) plus or minus a margin, then consider the measurement at the visit within the time window as that at the time of interest. The optimal or reasonable time window size is often unknown in practice, and our study examined four different time window sizes: 1 week, 2 weeks, 1 month, and 1.5 months. There is a trade-off between the size of the time window and the missing rate in outcome variables. As the time window increased, we had fewer accurate measurements, but we reduced the missing values. When using a short time window, the missing data problem is substantial. For example, when we set the window size as plus/minus one week, the missing rates for each outcome are more than 75% (Web Table 11).
4.2.2. Results.
We implemented our proposed BNP method in the CF study with two lung function outcome measures processed using four windows. The ATE estimates were obtained by averaging over 15,000 posterior MCMC draws after 5,000 burn-ins. Table 5 summarizes the results. The ATEs of lung function at the 6-month follow-ups are all positive in the four different windows, suggesting a beneficial effect of taking early Tobi. With the 1-week time window, the ATE at the 6-month follow-up is 6.04, which is the largest among the analyses with four window sizes, and the 95% credible interval of (1.36, 10.72) does not cover the value of zero. Despite having the largest standard error due to the highest missing rate with the 1-week time window, its posterior probability of the positive treatment effect is 99.4%. The ATE estimate at the 6-month follow-up decreases as the time window increases. For ATE at 12-month, we have observed some variation, and there is insufficient evidence to conclude that early Tobi is effective for 12-month FEV1.
Table 5:
ATE estimates of lung function in the CF study from the proposed BNP method. The lung function outcome measures are processed using varying time windows, i.e., measurements at 6 months plus/minus a time window and 12 months plus/minus a time window. The ATE estimate, its standard error, the 95% credible interval, and the posterior probability that the ATE is greater than zero are presented for different measurements with varying time windows.
| Time window | Estimand | Est | SE | 95% CI | P(ATE >0) |
|---|---|---|---|---|---|
|
| |||||
| 1 week | ATE at 6 month | 6.04 | 2.37 | (1.36, 10.72) | 0.994 |
| ATE at 12 month | 2.71 | 1.98 | (−1.24, 6.45) | 0.908 | |
| 2 weeks | ATE at 6 month | 2.90 | 1.79 | (−0.61, 6.39) | 0.946 |
| ATE at 12 month | −0.68 | 1.66 | (−4.00, 2.58) | 0.341 | |
| 1 month | ATE at 6 month | 1.51 | 1.44 | (−1.36, 4.28) | 0.852 |
| ATE at 12 month | 1.77 | 1.34 | (−0.85, 4.40) | 0.906 | |
| 1.5 months | ATE at 6 month | 0.66 | 1.24 | (−1.76, 3.09) | 0.705 |
| ATE at 12 month | 0.95 | 1.21 | (−1.42, 3.37) | 0.787 | |
5. Discussion
The proposed Bayesian hierarchical model for the average treatment effect estimation with incomplete observational data demonstrated several key advantages that are useful for practical purposes. First, our method allows us to perform causal inference when missing values exist in the covariates (continuous and categorical) and the outcome variables. Second, our flexible and data-adaptive method minimizes model misspecification concerns and is better suited for complex data than parametric alternatives. If a true data-generating mechanism is simple, using a flexible nonparametric model leads to a larger variance than a simple true model, but the expense is often outweighted by its benefit in practice where it is often difficult to assure the parametric model covers a class of the true data-generating model. We extended the flexible but imputation-only approach of Murray and Reiter (2016) to causal inference. The results of our simulation studies have confirmed findings in Murray and Reiter (2016) that the hierarchically coupled mixture imputation model is superior to MICE imputation in capturing complex dependencies in mixed continuous and categorical variables. Third, the fully joint modeling feature of the suggested model applies to multiple continuous outcome variables, which are often of research interest. As demonstrated in the CF study, our joint modeling approach accounts for the multivariate structure of multiple outcomes, which improves estimation efficiency compared to fitting separate models to data. Fourth, we wrote an R package HCMMcausal, available at https://github.com/huaiyu0112/BNP-causal, which is easy to implement. The runtime with the R package for Simulation 1 (12,000 iterations of the MCMC sampler) was about 12 minutes on a 2018 Dell G3 3579 with an Intel Core i7-8750H processor @2.20Ghz with 16GB RAM. Lastly, relying on Bayesian inference, our method presents a direct probabilistic statement (posterior summary) of the treatment effect and its uncertainty to researchers, which we believe increases the interpretability of the causal inference result.
We note that settings in Simulation Studies 2 and 3 slightly favor our method because of the mixture distributional form of the simulation data. However, real data often have this irregular form, and our model does not fix the number of mixtures; we did not use an exact analysis model to generate simulation data. In addition, our CF application study was conducted before the recent introduction of disease-modifying therapies called modulators, so we reserve making a concrete statement regarding the effect of early Tobi. Moreover, increasing the dimension of the categorical variables is feasible under reasonable computation resources, while increasing the dimension of the continuous variables imposes a significant computational burden as the dimension of the covariance matrix grows, for example, sampling for requires computational cost (Murray and Reiter, 2016). Despite the limitation, the proposed method suffices for our JIA dataset, which has a total of 8 categorical variables and 6 continuous variables. Assuming an extraordinary case, we checked that our model works for a larger simulation dataset with 20 continuous variables and two categorical variables, which took 37 minutes for 12,000 iterations of the MCMC sampler, with the same CPU capacity used in simulation studies. Furthermore, verifying the missing mechanism in practice without external knowledge is uneasy or even impossible. One may perform a sensitivity analysis of possible departure from the MAR assumption (Van Buuren, 2018), which demonstrates the robustness (or sensitivity) of the estimated results. Finally, due to the label-switching phenomenon in the Dirichlet processes mixture modeling and its computation, one cannot check the model fit of each component in the nonparametric mixture model. Instead, we checked the model fit by comparing the joint distributions of completed data (including imputed values) and original simulation data as presented in Figure 1. We also indirectly checked the model fit by assessing the performance of treatment effect estimation, which is our main goal of inference.
There are several interesting future research directions. First, our method can be extended to investigate heterogeneous causal effects across subgroups (e.g., treatment-covariate interaction) by allowing the causal parameters to vary by the mixture components, i.e., . However, some additional challenges need to be addressed in this research direction, e.g., the mixture components of the BNP model do not necessarily represent subclusters and show a label-switching phenomenon that needs additional post-processing, and missing data may complicate the problem, as the subgroup effects could be related to missingness. Second, our method can also be extended to account for the time-varying confounding setting. One limitation of our study is that we assume only continuous outcomes. Generalizing our data model to incorporate the categorical outcomes will be an appealing setup in other real data applications. For example, one could add a latent structure using a probit specification as DeYoreo et al. (2017) suggested. An additional limitation is that we only consider a point treatment setting in our JIA and CF motivational studies. The extension to a dynamic treatment regime might be of practical interest.
Supplementary Material
Acknowledgments
The authors thank the referees and the Associate Editor for their helpful suggestions. This study was supported in part by the Patient-Centered Outcomes Research Institute (grant ME-1408-19894), the Innovation Fund at Cincinnati Children’s Hospital and Center for Clinical and Translational Science and Training, the National Center for Advancing Translational Sciences of the National Institutes of Health (under award number 5UL1TR001425-03), the Cystic Fibrosis Foundation (grant SZCZES18Y7), and the National Institutes of Health/Heart, Lung and Blood Institute (grant R01 HL141286).
Footnotes
Supporting Information
Web Appendices, Tables, and Figures referenced in Sections 2–4 are available with this paper at the Biometrics website on Wiley Online Library. The R codes used in simulation studies are available at https://github.com/huaiyu0112/BNP-causal.
References
- Cystic Fibrosis Foundation (2020). Cystic fibrosis foundation patient registry 2019 annual data report. Cystic Fibrosis Foundation, Bethesda, MD, https://www.cff.org/medical-professionals/patient-registry. [Google Scholar]
- D’Agostino RB Jr and Rubin DB (2000). Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association 95, 749–759. [Google Scholar]
- DeYoreo M, Reiter JP, Hillygus DS, et al. (2017). Bayesian mixture models with focused clustering for mixed ordinal and nominal data. Bayesian Analysis 12, 679–703. [Google Scholar]
- Dorie V and Hill J (2018). bartCause: Causal Inference using Bayesian Additive Regression Trees. (R package version 1.0–0.), https://cran.r-project.org/package=bartCause. [Google Scholar]
- Greifer N (2019). Weightit: Weighting for covariate balance in observational studies (R package version 0.5.1.), https://CRAN.R-project.org/package=WeightIt.
- Groenwold RH (2020). Informative missingness in electronic health record systems: the curse of knowing. Diagnostic and prognostic research 4, 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrold LR, Salman C, Shoor S, Curtis JR, Asgari MM, Gelfand JM, Wu JJ, and Herrinton LJ (2013). Incidence and prevalence of juvenile idiopathic arthritis among children in a managed care population, 1996–2009. The Journal of Rheumatology 40, 1218–1225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill JL (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, 217–240. [Google Scholar]
- Huang B, Qiu T, Chen C, Zhang Y, Seid M, Lovell D, Brunner HI, and Morgan EM (2020). Timing matters: real-world effectiveness of early combination of biologic and conventional synthetic disease-modifying antirheumatic drugs for treating newly diagnosed polyarticular course juvenile idiopathic arthritis. RMD open 6, e001091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imbens GW and Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. Nwe York, NY: Cambridge University Press. [Google Scholar]
- Ishwaran H and James LF (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, 161–173. [Google Scholar]
- Josefsson M and Daniels MJ (2021). Bayesian semi-parametric g-computation for causal inference in a cohort study with mnar dropout and death. Journal of the Royal Statistical Society. Series C, Applied statistics 70, 398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang JD, Schafer JL, et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22, 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kapelner A and Bleich J (2016). bartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software 70, 1–40. [Google Scholar]
- Knapp EA, Fink AK, Goss CH, Sewall A, Ostrenga J, Dowd C, Elbert A, Petren KM, and Marshall BC (2016). The Cystic Fibrosis Foundation Patient Registry. Design and methods of a national observational disease registry. Annals of the American Thoracic Society 13, 1173–1179. [DOI] [PubMed] [Google Scholar]
- Leyrat C, Seaman SR, White IR, Douglas I, Smeeth L, Kim J, Resche-Rigon M, Carpenter JR, and Williamson EJ (2019). Propensity score analysis with partially observed covariates: How should multiple imputation be used? Statistical Methods in Medical Research 28, 3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu B and Ashmead R (2018). Propensity score matching analysis for causal effects with MNAR covariates. Statistica Sinica 28, 2005–2025. [Google Scholar]
- Lumley T et al. (2004). Analysis of complex survey samples. Journal of Statistical Software 9, 1–19. [Google Scholar]
- MacEachern SN (1999). Dependent nonparametric processes. In ASA proceedings of the section on Bayesian statistical science, pages 50–55. [Google Scholar]
- Mayer I, Sverdrup E, Gauss T, Moyer J-D, Wager S, and Josse J (2020). Doubly robust treatment effect estimation with missing attributes. The Annals of Applied Statistics 14, 1409–1431. [Google Scholar]
- Miriovsky BJ, Shulman LN, and Abernethy AP (2012). Importance of health information technology, electronic health records, and continuously aggregating data to comparative effectiveness research and learning health care. Journal of Clinical Oncology 30, 4243–4248. [DOI] [PubMed] [Google Scholar]
- Murray JS and Reiter JP (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association 111, 1466–1479. [Google Scholar]
- Rosenbaum PR and Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 516–524. [Google Scholar]
- Roy J, Lum KJ, Zeldow B, Dworkin JD, Re VL III, and Daniels MJ (2018). Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics 74, 1193–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688. [Google Scholar]
- Rubin DB (1987). Multiple imputation for nonresponse in surveys, volume 81. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
- Sawicki GS, Signorovitch JE, Zhang J, Latremouille-Viau D, von Wartburg M, Wu EQ, and Shi L (2012). Reduced mortality in cystic fibrosis patients treated with tobramycin inhalation solution. Pediatric Pulmonology 47, 44–52. [DOI] [PubMed] [Google Scholar]
- Szczesniak R, Heltshe SL, Stanojevic S, and Mayer-Hamblett N (2017). Use of fev1 in cystic fibrosis epidemiologic studies and clinical trials: a statistical perspective for the clinical researcher. Journal of Cystic Fibrosis 16, 318–326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Buuren S (2018). Flexible imputation of missing data. CRC press. [Google Scholar]
- Van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45, 1–68. [Google Scholar]
- Van der Laan MJ, Polley EC, and Hubbard AE (2007). Super learner. Statistical Applications in Genetics and Molecular Biology 6, 1–21. [DOI] [PubMed] [Google Scholar]
- Van der Laan MJ and Rose S (2011). Targeted learning: causal inference for observational and experimental data. New York, NY: Springer. [Google Scholar]
- Van der Laan MJ and Rubin D (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, 1–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang S, Wang L, and Ding P (2019). Causal inference with confounders missing not at random. Biometrika 106, 875–888. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
