Summary
We propose a general Bayesian nonparametric (BNP) approach to causal inference in the point treatment setting. The joint distribution of the observed data (outcome, treatment, and confounders) is modeled using an enriched Dirichlet process. The combination of the observed data model and causal assumptions allows us to identify any type of causal effect—differences, ratios, or quantile effects, either marginally or for subpopulations of interest. The proposed BNP model is well-suited for causal inference problems, as it does not require parametric assumptions about the distribution of confounders and naturally leads to a computationally efficient Gibbs sampling algorithm. By flexibly modeling the joint distribution, we are also able to impute (via data augmentation) values for missing covariates within the algorithm under an assumption of ignorable missingness, obviating the need to create separate imputed data sets. This approach for imputing the missing covariates has the additional advantage of guaranteeing congeniality between the imputation model and the analysis model, and because we use a BNP approach, parametric models are avoided for imputation. The performance of the method is assessed using simulation studies. The method is applied to data from a cohort study of human immunodeficiency virus/hepatitis C virus co-infected patients.
Keywords: Bayesian modeling, Causal effect, Cluster, Enriched Dirichlet process mixture model, Missing data, Observational studies
1. Introduction
Bayesian methods have not been widely used for causal inference in observational studies. A possible reason for this is that causal inference in a likelihood-based framework often requires modeling the joint distribution of all of the observed data, including covariates or at least complex relationships between outcomes and confounders. For example, to estimate marginal causal effects, integration over the distribution of confounders is required (Robins, 2000). Because the dimension of covariates that need to be controlled for might be high, modeling the joint distribution of these covariates offers many opportunities for model misspecification. As a result, semiparametric methods that do not require specification of the joint distribution of the covariates have dominated the causal inference literature (e.g., Robins et al., 2000; van der Laan, 2010a,b; Neugebauer et al., 2013).
However, recent developments in Bayesian nonparametric (BNP) modeling, along with increasing computing capacity, have opened the door to new, potentially powerful approaches to causal inference. In the point treatment setting, one option is to directly model the conditional distribution of the outcome given covariates using a dependent Dirichlet process (MacEachern, 1999). Modeling a conditional distribution directly is what Shahbaba and Neal (2009) refer to as discriminative models. Marginal causal effects can be obtained by integrating the conditional distribution over the empirical distribution of the covariates. This approach was used by Roy et al. (2017) to directly parameterize causal effects from a marginal structural model. However, dependent Dirichlet process models can be computationally expensive. In addition, they do not easily allow for imputation of covariates within the model. A popular approach for flexible estimation of causal effects is Bayesian additive regression trees (BART) (Hill, 2011); however, this approach is restricted to “average” causal effects. Parametric Bayesian approaches that incorporate information about the relationship of the covariates with the treatment have also been explored (Wang et al., 2015); however, current development of those is restricted to parametric, additive regression models.
An alternative to discriminative models is generative models, which model the joint distribution of the data (i.e., the outcomes and covariates). BNP generative models can be used to induce the conditional distribution of the outcome given covariates (Müller et al., 1996) and were used by Xu, et al. (2018) for causal inference using the propensity score. In this article, we use a Dirichlet process model similar to that proposed by Shahbaba and Neal (2009). These models can easily accommodate discrete and continuous covariates as ell as missing covariates under a specific assumption about the missingness. We use a refinement to the model proposed by Wade et al. (2014) to obtain a flexible, yet computationally tractable regression model for the outcome. These priors/models also provide consistent estimation of the distribution function (Wade et al., 2011) which will hold in our setting under the (uncheckable) assumption of ignorable missingness for the covariates.
Our approach to causal inference is therefore as follows. We model the joint distribution of the observed data (and covariates) and then using standardization (g-formula) to obtain causal effects. More specifically, from the joint model of the observed data, we “extract” the conditional distribution of the outcome given treatment and covariates and then “standardize” using the marginal distribution of the covariates. The implicit missing data imputation uses all the observed data (outcome, treatment, other covariates). It is important to note that the same BNP model applied to the same observed data could be used to extract a variety of causal effect parameters, including average treatment effects, quantile treatment effects, causal effect of treatment on treated, conditional treatment effects, and so on. This is the same approach that was taken by Daniels et al. (2012) and Kim et al. (2016) in causal mediation settings. Note that specific choices for the causal effects, for example, quantile causal effects, preclude certain approaches for the joint distribution of the observed data (e.g., BART will only be appropriate for average causal effects.)
Modeling the full observed data distribution instead of just the conditional distribution of the outcome has many potential benefits, including possible efficiency gains, full posterior inference rather than just point estimates and confidence intervals, automatic imputation of missing data under an assumption of ignorable missingness, and a general way to account for uncertainty about a variety of assumptions. A drawback of many Bayesian approaches, including our approach, is the computations, as we need to sample from the posterior distribution of the observed data parameters using MCMC. In addition, we use G-computation which requires an MC integration for each posterior sample (though this aspect can be parallelized).
The article is organized as follows. Section 2 specifies causal identifying assumptions and causal effects that may be of interest. In Section 3, we develop a flexible model for the joint distribution of the observed data. Computations are described in Section 4. There are simulation studies in Section 5 which compare the proposed approach to several semiparametric alternatives. The BNP approach is applied to data from a study of human immunodeficiency virus/hepatitis C virus (HIV/HCV) co-infected patients in Section 6 followed by a discussion in Section 7.
2. Causal Effects
Suppose, we are interested in causal effects of treatment A on outcome Y. We assume that treatment is discrete/categorical and not continuous, taking one of q possible values. For the ith subject (i = 1, . . . , n), the treatment is represented by a vector of indicator variables , where At,i is an indicator for treatment category t. Most typically, Ai will just be a single variable indicating whether the subject received the new treatment. Denote by Li a p × 1 set of pre-treatment variables.
Our goal is to identify causal effects from the observed data (Y, A, L). In this section, assume the joint distribution of the observed data, p(y, a, l) is known. That is, our goal here is to specify causal effects of interest and identification assumptions, given that p(y, a, l) is known. In this section, it does not matter whether the joint distribution of the observed data is described with few (very parametric) or infinitely many (nonparametric) parameters. Estimation of the joint distribution is a distinct step from defining causal effects and making identifying assumptions.
We consider definitions of causal effects that are functions of potential outcomes. Each subject has q potential outcomes, , where Ya is the outcome that would be observed if treatment was set to a. There are many possible causal effects that could be of interest to researchers. For simplicity, we will focus here on the situation where q = 2. Some examples include:
E(Y1 − Y0): average causal effect (continuous outcome) or average causal risk difference (binary outcome)
E(Y1)/E(Y0): average causal relative risk (binary outcome)
E(Y1−Y0/V): conditional average causal effect (where V ⊂ L)
E(Y1 − Y0|A = 1): average effect of treatment on treated
, where is the pth quantile of the cumulative distribution function P(Ya ≤ y): a quantile causal effect (Xu, et al., 2018).
All of the above causal effects are functionals of the distribution of the potential outcomes. We can identify these causal effects from the following three assumptions. The first assumption is consistency, which states that Ya = Y among subjects with A = a, for all a. That is, the potential outcome if we were to set A = a is the same as the outcome that we observe if A = a. We next assume positivity p(A = a|L) > 0 if p(L) > 0. This implies that at each possible level of the confounders, each treatment level has non-zero probability. Finally, we assume ignorability, or {Ya ⫫ A|L}. In other words, given confounders L, treatment can be thought of as randomly assigned.
These three assumptions imply F(y|A = a, L) = F(ya|A = a, L) = F(ya|L). We can, therefore, identify any functional of F(ya|L) from p(Y, A, L): E(Ya) = E{E(Y|A = a, L)},
where L = (V, W). For E(Ya|A = a′), integration is over p(L|A = a′), which is known if p(Y, A, L) is known.
3. BNP Model for Observed Data
In order to estimate causal effects described in the Section 2, we first need to estimate the joint distribution p(Y, A, L). Let and consider estimation of p(Y, X). While estimation of this joint distribution could be parametric or nonparametric, we propose a Bayesian nonparametric approach. This will allow us to flexibly model the joint distribution (whose parameter values are not of interest) while allowing ignorable missingness in L (more on the latter in Section 4).
We propose to model the joint distribution of (Y, X) using the following enriched Dirichlet process (EDP) mixture (Wade et al., 2011; Wade et al., 2014):
(1) |
The notation P ~ EDP(αθ, αω, P0) means that Pθ ~ DP(αθ, P0,θ) and Pω|θ ~ DP(αω, P0,ω|θ) with base measures P0 = P0,θ × P0,ω|θ.
This formulation implies that each subject i has their own parameters θi and ωi. However, because P is discrete (Ferguson, 1973), some clusters of subjects will have the same θi and ωi. Note that the covariates, Xir are specified to be independent within clusters. The number of clusters depends on the concentration parameters αθ and αω, where low values indicate fewer clusters. Typically DP models have a single concentration parameter. The enrichment of the usual DP is to have nested concentration parameters. This allows for more x-clusters than y-clusters, which is important because the dimension of x will typically be much larger than that of y. Importantly, this is accomplished while keeping cluster membership dependent on both y|x and x through the nesting of the random partition.
We assume a local generalized linear model for p(y|x, θi) (Hannah et al., 2011). That is, where g{b′(ηi)} = Xiβi and g{} is a link function. For example, if Y is binary then Yi|Xi, θi ~ Bern{logit−1(Xβi)} where θi = βi and X is the design matrix involving A and L. In the linear regression case, θi would include both regression coefficients and a variance. In practice, covariates are standardized.
An important aspect of this model is that covariates X are specified to be locally independent. That is given ωi, covariates are independent; this is similar to latent class models, where given latent class membership, random variables, here covariates, are assumed independent. Two subjects in the same subcluster would have similar values of X. In general, dependence between random variables decreases as the window under consideration shrinks. As an illustration, consider bivariate normal random variables x1 and x2 with mean 0, variance 1, and correlation 0.9. In that case, cor{x1, x2|x1 ∈ (0, 0.2), x2 ∈ (0, 0.2)} ≈ 0.02. The local independence specification makes it easy to include many continuous and discrete confounders, because the joint distribution is just a product of marginal distributions. In addition, computations are considerably faster because covariance matrices for the joint distribution of confounders are not needed. Note that while we assume that locally the generalized linear model is correctly specified for y and x and that the x’s are independent from each other, globally all of the variables are dependent with potentially non-linear relationships.
The EDP model (1) can equivalently be represented with the square-breaking formulation (Wade et al., 2011), which is a generalization of the standard stick-breaking representation of DP models (Sethuraman, 1994). The joint distribution of the observed data for subject i can be written
where j indexes the y-clusters and the K() are the kernels of the corresponding distributions in (1). The weights have priors and , where and .
The conditional distribution implied by the joint model is , where
Notice that the weights wj(x) depend on x. Therefore, even though K(y|x, θj) is a generalized linear model, p(y|x) is a computationally tractable, flexible, non-linear, non-additive model.
4. Computations
We use a Gibbs sampler to obtain draws from the posterior distribution. In particular, we use an extension of Neal (2000) Algorithm 8 to accommodate nested clustering. This approach alternates between sampling cluster membership for each subject and sampling values of the parameters, given the cluster partitioning. Sampling cluster membership is not complex due to the closed form resulting from the Pólya urn in the collapsed Gibbs sampler.
Here, we briefly describe the Gibbs sampling steps. Detailed steps are given in Web Appendix A. Following Wade et al. (2014), let si = (si,y, si,x) denote cluster membership for subject i. Note that the value of si,x is only meaningful in conjunction with si,y, as it describes which cluster within si,y it belongs. The basic steps in the Gibbs sampler are as follows. We sample si for each subject, and then, given s, we sample parameters θ and ω from their conditional distributions, given data. Denote by the θ that is associated with the jth currently non-empty y-cluster (j = 1, . . . , k), with defined similarly and denote by kj the number of currently non-empty x-clusters within the jth y-cluster. The clusters, subclusters and their corresponding parameters are depicted in Figure 1. To obtain a new draw of si given the current partition and current cluster-specific parameters, we simply draw from a multinomial distribution (see appendix for multinomial probabilities). Next, we provide further details on two additional components necessary for the causal inference setting, estimation of casual effects and imputation of missing covariates.
Figure 1.
Diagram of current clusters, subclusters, and values of parameters, along with proposed new clusters and subclusters. When updating cluster membership, the probability of being in each current and proposed new S is computed for subject i, and they are then randomly assigned to an S from the corresponding multinomial distribution. Once all subjects have been assigned to a cluster at a given iteration in the Gibbs sampler, then the parameters are updated, given cluster membership.
4.1. Post-Processing Steps for Estimation of Causal Effects
Once we have obtained draws of the parameters from the posterior distribution, we can compute any functionals of the distribution of potential outcomes. Here, we focus on the following expectations of the potential outcomes: E(Ya), E(Ya|V), or E(Ya|A = a′).
Suppose, for example, that we would like to obtain draws from the posterior of E(Ya). Recall that E(Ya) = E{E(Y|A = a, L)} and we assume a GLM within clusters, with , where . We can alternate between drawing cluster membership and covariates from p(s, L) and computing the E(Y|A = a, L = l, θ*, ω*, s). We can use MC integration to integrate out s and L. In particular, given current values of the parameters, {θ*, ω*, s}, from the Gibbs sampler, we can compute E(Y|A = a, L = l, θ*, ω*, s) as
where wk+1(a, l) = [αθ/(αθ + n)]K0(a, l),
The terms K0(a, l) and E0(y|a, l) are the distribution and mean, respectively, after integrating the parameters over the prior distributions. That is, and . For non-conjugate distributions, Monte Carlo (MC) integration can be used to obtain these quantities.
We can then obtain a draw from the marginal distribution E(Ya) by integrating over the marginal distribution of L using MC integration. In particular, for each posterior sample of the observed data parameters, we do a MC integration over the current “estimate” of the marginal distribution of L; this can be viewed as sampling from the posterior predictive distribution of L. For this, we must first draw M samples from p(L, s) as follows. For m = 1 , . . . , M,
Draw from a multinomial {1, . . . , k + 1} with probabilities .
If , draw from a multinomial {1, . . . , kj + 1} with probabilities ; else, .
Draw Lm from , where, if or if (i.e., if a new cluster is opened up), is drawn from the prior distribution.
Once we have obtained M values (lm, sm), we can approximate the integral as follows:
Computing this separately for a = 1 and a = 0, for example, would allow us to obtain a draw of a causal effect, such as E(Y1) − E(Y0). For causal effects conditional on V = v or A = a, we essentially repeat the above steps, but integrate over the conditional distribution of W|V = v or L|A = a, respectively, rather than the marginal distribution of L.
Because this is a post-processing step, its computation is not necessary for the Gibbs sampler used to sample the observed data model parameters. Therefore, to improve computational efficiency, draws of the causal effect parameters would not need to be obtained for every draw of the Gibbs sampler and could be done in parallel.
4.2. Imputation of Missing Covariates
Missing values of covariates (L’s) can be dealt with using data augmentation under an assumption of ignorable missingness. Because, we have already specified a full model for (Y, A, L), we simply need to obtain draws of missing L’s from the appropriate conditional posterior distribution at each iteration in the Gibbs sampler. Suppose Li,r is a binary covariate that is missing for subject i. At each step in the Gibbs sampler, we do the following. Denote by the current value of binomial probability parameter for the rth covariate. Note that this value of ω is based on the cluster assigned to subject i. Denote by the original vector Xi in which covariate Li,r is set to a value of k. We draw Li,r from a binomial distribution with probability
To draw values for missing continuous covariates, we use the Metropolis–Hastings algorithm with a random walk candidate distribution. The posterior for a missing continuous covariate Li,r is proportional to K(Li,r|ωi)g−1(Xiβi).
5. Simulation Studies
We carry out simulations studies under six different data generating models to assess the performance of the proposed BNP method. The first two simulation scenarios involve binary outcomes. The causal parameters of interest are a marginal relative risk, ψrr = E(Y1)/E(Y0) and a marginal risk difference ψrd = E(Y1) − E(Y0). In the last three scenarios, there is a continuous outcome. The causal parameter of interest is the average causal effect ψ = E(Y1) − E(Y0). Note that we only present the first three scenarios in the main paper. Scenarios 4–6, which explore complex distributions, residual confounding, and sensitivity to the prior on the precision parameters, are presented in Web Appendix B.
Methods:
In each simulation scenario, we estimate the causal parameter(s) using several methods. For the first two methods, inverse probability of treatment weighting (IPTW) and targeted maximum likelihood estimation (TMLE), we estimate the propensity score with a logistic model assuming an additive, linear form of the covariates, L. In scenarios 1 and 2, this propensity score model is correctly specified. In scenarios 3–5, the propensity score model is misspecified because the treatment is generated using a complex functional form. To estimate the outcome model in the TMLE approach, we use Super Learner (van der Laan et al., 2007), which is an ensemble machine learning method that uses cross-validation to weigh different prediction algorithms. We use six algorithms (mean, glm, step, gam, randomForest, glmnet), implemented using the R package tmle (Gruber and van der Laan, 2012). In scenarios 1 and 2 only, we compare the proposed BNP approach with a parametric Bayesian approach in which we fit fully Bayesian logistic regression models, with the treatment and covariates included in the model as additive, linear predictors. In scenario 2, the specified parametric distribution does not match the data generating distribution for the outcome. Average causal effects are obtained by averaging over the empirical distribution of the covariates. Finally, we use the BART approach proposed by Hill (2011) in scenarios 3–5. The results for scenarios 4–6 can be found in the supplementary materials.
For the proposed BNP approach, we first standardize the continuous covariates. We then assume
(2) |
where and . The Xi,r are independent over i and r. The base measures for θ are , and when the outcome is continuous, p0θ(σ2) =Scale Inv-χ2(1, 1). We set β0 and to the maximum likelihood estimates from an ordinary linear or logistic regression of Y on (Taddy, 2008). That is, our prior guess is that cluster specific regression coefficients will equal the corresponding coefficients from a logistic regression model applied to all of the data, with uncertainty in that guess reflected by a constant c > 1 times the covariance matrix (weakly informative). In the data analyses we used c = n/5. We assume conjugate priors p0(πr) = Beta(ax, bx) for binary covariate parameters, where ax = bx = 1 for r = 1, . . . , p1 + 1. Priors for the continuous covariate parameters are and , where ν0 = 2, , c0 = 0.5, and μ0 = 0, for r = (1 + p1 + 1), . . . , (1 + p1 + p2). For the concentration parameters αθ and αω, we assume Gam(1, 1) priors. For scenarios with n = 250, we used a burn-in of 10,000 and then used an additional 90,000 Gibbs samples for posterior inference. For n = 1000 or n = 3000, we found that 1000 draws from the Gibbs samplers were sufficient for a burn-in period and that 19,000 additional draws were enough to accurately capture the posterior.
For each scenario and each method, we tested the methods on 1000 generated datasets, and we report the absolute bias, empirical standard deviation (ESD), coverage probability, and the width of 95% credible or confidence intervals.
5.1. Scenario 1: Binary Outcome, Simple Functional Forms
We simulated a binary treatment, two binary covariates, two continuous covariates, and the binary outcome as follows: L1 ~ Bern(0.2), L2|L1 ~ Bern{logit−1(0.3 + 0.2L1)}, L3|L1, L2 ~ N(L1 − L2, 12), L4|L1, L2, L3 ~N(1 + 0.5L1 + 0.2L2 − 0.3L3, 22), A|L1, . . . , L4 ~ Bern{logit−1(−0.4 +L1+L2+L3 − 0.4L4)}, Y|A, L1, . . . , L4 ~ Bern{logit−1(−0.5 + 0.78A − 0.5L1 − 0.3L2 + 0.5L3 − 0.5L4)}. The true causal parameters are ψrr = 1.5 and ψrd = 0.13.
Missing data:
We carried out an additional analysis for the BNP approach only, where we randomly deleted values of L based on the following probabilities: L1 missing with probability logit−1(−2 + L2 + Y), L2 missing with probability logit−1(2 + L3 + A), L3 missing with probability logit−1(−1.5 − A + Y), and L4 missing with probability logit−1(−.9 − L1 − L2). This results in about 20% missing values for each covariate. We then analyzed the data using the data augmentation approach described in Section 4.
Results:
The results are given in Table 1. The proposed BNP approach performed well with nearly unbiased point estimates and the smallest ESDs, matching the performance of the correctly specified parametric Bayesian model. This is because the BNP model essentially settled on one cluster making it equivalent to the parametric Bayesian model. Thus, in this simulation, no efficiency price was paid for fitting the more complex model. TMLE and IPTW also had low bias, but had higher ESD than the Bayesian approaches.
Table 1.
Results from simulation scenario 1: binary outcome, simple functional forms. The true values were ψrr = 1.5 and ψrd = 0.13. IPTW uses a correctly specified propensity score. TMLE uses a correctly specified propensity score and Super Learner with 6 prediction algorithms for the outcome model. The Bayesian parametric (Bayesian par.) approach uses a correctly specified logistic regression model and integrates over confounders using the empirical distribution. EDP is the proposed method. “EDP missing data” is the BNP approach, with data augmentation, applied to a data set where approximately 20% of the covariate values were set to missing. The first four methods were applied to the full data set with no missing covariate values. Bias is the absolute bias, ESD is the empirical standard deviation, and Coverage and CI width are the empirical coverage and width of 95% interval estimates. Results are from 1000 simulated datasets.
Relative risk, ψrr |
Risk difference, ψrd |
|||||||
---|---|---|---|---|---|---|---|---|
Method | Bias | Coverage | ESD | CI width | Bias | Coverage | ESD | CI width |
n = 250 |
||||||||
IPTW | 0.09 | 0.96 | 0.43 | 1.76 | 0.00 | 0.96 | 0.08 | 0.33 |
TMLE | 0.06 | 0.91 | 0.37 | 1.29 | 0.00 | 0.90 | 0.07 | 0.25 |
Bayesian par. | 0.05 | 0.93 | 0.33 | 1.29 | 0.00 | 0.94 | 0.06 | 0.24 |
EDP | 0.03 | 0.93 | 0.32 | 1.20 | 0.00 | 0.93 | 0.06 | 0.23 |
EDP missing data | 0.05 | 0.94 | 0.33 | 1.31 | 0.00 | 0.94 | 0.07 | 0.25 |
n = 1000 |
||||||||
IPTW | 0.03 | 0.97 | 0.20 | 0.84 | 0.00 | 0.97 | 0.04 | 0.17 |
TMLE | 0.01 | 0.93 | 0.18 | 0.65 | 0.00 | 0.93 | 0.04 | 0.13 |
Bayesian par. | 0.01 | 0.95 | 0.15 | 0.59 | 0.00 | 0.94 | 0.03 | 0.12 |
EDP | 0.01 | 0.94 | 0.15 | 0.58 | 0.00 | 0.94 | 0.03 | 0.12 |
EDP missing data | 0.01 | 0.93 | 0.16 | 0.63 | 0.00 | 0.93 | 0.03 | 0.13 |
The above results are all for the case of complete data. When fitted to data with missing covariates, the ESD increased slightly relative to the BNP approach when all covariates were observed (for example, 0.15 versus 0.16). This particular simulation shows a potential benefit of the BNP approach in terms of minimal degradation of performance in the presence of MAR covariates.
5.2. Scenario 2: Binary Outcome, Mixture Distribution
For scenario 2, we generated a continuous confounder, binary treatment, and a binary outcome that depends on the confounder in a complex way as follows: L ~ N(4, 22), A|L ~ Bern{logit−1(1.3 − 0.8L)},
where
The true causal parameters are ψrr = 1.4 and ψrd = 0.155.
Missing data:
We carried out an additional analysis for the BNP approach only, where we randomly deleted values of L with probability logit−1(−2 + A + Y). This resulted in about 20% missing values of L.
Results:
The results are given in Table 2. Here, the outcome was generated using a mixture distribution; however, the Bayesian parametric method modeled the outcome using a logistic model with all covariates included as additive, linear terms. The parametric approach, therefore, performed poorly (with large bias and coverage dropping below 20% in the n = 1000 case). In contrast, the BNP approach had absolute bias of 0.04 or below at both sample sizes, good coverage, and a lower ESD than IPTW and TMLE. Deleting about 20% of the covariate values, and then imputing within the BNP approach, resulted in credible intervals that were typically only about 2% wider than in the complete data scenario using the BNP approach.
Table 2.
Results from simulation scenario 2: binary outcome, mixture distribution. The true values were ψrr = 1.4 and ψrd = 0.155. IPTW uses a correctly specified propensity score. TMLE uses a correctly specified propensity score and Super Learner with 6 prediction algorithms for the outcome model. The Bayesian parametric (Bayesian par.) approach uses a misspecified logistic regression model. EDP is the proposed method. “EDP missing data” is the BNP approach, with data augmentation, applied to a data set where approximately 20% of the covariate values were set to missing. The first four methods were applied to the full data set with no missing covariate values. Bias is the absolute bias, ESD is the empirical standard deviation, and Coverage and CI width are the empirical coverage and width of 95% interval estimates. Results are from 1000 simulated datasets.
Relative risk, ψrr |
Risk difference, ψrd |
|||||||
---|---|---|---|---|---|---|---|---|
Method | Bias | Coverage | ESD | CI width | Bias | Coverage | ESD | CI width |
n = 250 |
||||||||
IPTW | 0.02 | 0.92 | 0.27 | 1.27 | 0.01 | 0.89 | 0.13 | 0.44 |
TMLE | 0.00 | 0.87 | 0.32 | 1.04 | 0.01 | 0.85 | 0.12 | 0.36 |
Bayesian par. | 0.36 | 0.65 | 0.25 | 0.94 | 0.12 | 0.62 | 0.08 | 0.29 |
EDP | 0.04 | 0.93 | 0.26 | 0.97 | 0.01 | 0.93 | 0.09 | 0.34 |
EDP missing data | 0.07 | 0.95 | 0.26 | 1.00 | 0.02 | 0.94 | 0.09 | 0.35 |
n = 1000 |
||||||||
IPTW | 0.00 | 0.92 | 0.19 | 0.71 | 0.00 | 0.92 | 0.07 | 0.26 |
TMLE | 0.01 | 0.92 | 0.16 | 0.56 | 0.00 | 0.91 | 0.06 | 0.20 |
Bayesian par. | 0.33 | 0.19 | 0.13 | 0.46 | 0.12 | 0.17 | 0.04 | 0.14 |
EDP | 0.04 | 0.95 | 0.13 | 0.54 | 0.02 | 0.94 | 0.05 | 0.20 |
EDP missing data | 0.02 | 0.94 | 0.15 | 0.58 | 0.01 | 0.94 | 0.05 | 0.21 |
5.3. Scenario 3: Continuous Outcome, Complex Treatment, Many Covariates
For scenario 3, we simulate from a continuous outcome. We also assume there are a large number of covariates (84), but only a few of them really matter. Specifically, we generated 20 binary covariates (L1, . . . , L20), independently distributed as Bernoulli(0.05), 20 binary covariates (L21, . . . , L40), independently distributed as Bernoulli(0.5), and 44 continuous covariates (L41, . . . , L84), distributed as multivariate normal with mean 0, variance 1, and correlation 0.3. We then generated a binary treatment, A|L41, . . . , L44 from the following distribution:
where
The outcome was generated as follows:
where
μ1 = −4 + 2A − 0.5L42 − L43 + 0.5L44, . The true average causal effect is ψ = 1.503.
Results:
The results from scenario 3 are given in Table 3. In each of the methods, all 84 covariates were treated as confounders and included in all models. BART performed slightly better than EDP for n = 1000. The two methods performed about equivalently for n = 3000. For IPTW and TMLE, we estimated the propensity score using a logistic model assuming an additive, linear form of the 84 covariates. These methods both had some bias and under coverage, especially IPTW.
Table 3.
Results from simulation scenario 3: continuous outcome, complex treatment, many covariates. The true average causal effect was ψ = 1.503. IPTW and TMLE use a misspecified propensity score. TMLE uses SuperLearner for the outcome model. BART is causal inference proposed by Hill (2011). EDP is the proposed approach. Bias is the absolute bias, ESD is the empirical standard deviation, and Coverage and CI width are the empirical coverage and width of 95% interval estimates. Results are from 1000 simulated datasets.
Method | Bias | Coverage | ESD | CI width |
---|---|---|---|---|
n = 1000 |
||||
IPTW | 0.25 | 0.91 | 0.23 | 1.10 |
TMLE | 0.10 | 0.90 | 0.21 | 0.76 |
BART | 0.03 | 0.93 | 0.18 | 0.69 |
EDP | 0.04 | 0.89 | 0.23 | 0.72 |
n = 3000 |
||||
IPTW | 0.27 | 0.60 | 0.13 | 0.61 |
TMLE | 0.10 | 0.84 | 0.12 | 0.43 |
BART | 0.04 | 0.91 | 0.11 | 0.40 |
EDP | 0.06 | 0.90 | 0.11 | 0.41 |
6. Application
Antiretroviral therapy (ART) is recommended for all human immunodeficiency virus (HIV) / chronic hepatitis C virus (HCV)-coinfected patients. ART regimens often include drugs from the nucleoside reverse transcriptase inhibitor (NRTI) class. There is concern that some drugs in the NRTI class (didanosine, stavudine, zidovudine, and zalcitabine) might cause depletion of mitochondrial DNA, leading to liver injury. We apply the proposed BNP approach to compare outcome Y (death within 2 years) among those prescribed mitochondrial toxic NRTI (mtNRTI)-containing ART regimen to those prescribed other NRTI-containing ART regimen.
To address this question, we used data from a study of HIV/HCV patients who newly initiated ART within the Veterans Aging Cohort Study (VACS) (Fultz et al., 2006). The study population included co-infected patients who newly initiated an ART-regimen that include NRTIs (either mtNRTIs or other NRTIs) from 2002 to 2009. There were a total of n = 1747 patients included in the study. As can be seen from table S3 in Web Appendix C, use of mtNRTI-containing ART regimens as first-line therapy decreased over time, going from a large majority of cases in 2002 to a small minority of cases in 2009.
Our exposure A was set to 1 for patients initiating an ART regimen that included an mtNRTI, and set to 0 for patients initiating an ART regiment that included some other NRTI. The outcome was all-cause mortality and we focused on the event occurring within 2 years of ART initiation. We had follow-up data on all patients through 2011, and so even patients who initiated ART in 2009 had 2 years of follow-up data. There were 76 deaths out of 836 patients in the mtNRTI group, and 89 deaths out of 911 patients in the other NRTI group. Our causal parameter of interest is the relative risk ψrr = E(Y1)/E(Y0).
Variables that were included in the model as confounders (L) included the following baseline demographics and clinical variables: age at baseline (years), race/ethnicity, body mass index, diabetes mellitus, alcohol dependence/abuse, injection/non-injection drug abuse, year of ART initiation, and exposure to other antiretrovirals associated with hepatotoxicity (i.e., abacavir, nevirapine, saquinavir, tipranavir). In addition, the following baseline laboratory variables were included in L: CD4 count, HIV RNA, alanine aminotransferase (ALT), aspartate aminotransferase (AST), and fibrosis-4 (FIB-4) score. The percentage of missing data for each variable is as follows: ALT 1.3%, AST 2.5%, CD4 1.8%, FIB-4 3.1%. The percentage of patients with at least one missing variable is 4.8%.
For the observed data, we used the model specified in (2) with a logistic regression model for the outcome. The prior distributions were the same as those specified in the simulation studies.
Results:
We ran three chains of the Gibbs sampler, each with 20,500 iterations. The chains mixed well, with convergence appearing to be reached by iteration 500. The Gelman–Rubin convergence diagnostic was 1.04, providing further evidence of convergence. We calculated the average causal relative risk at every 100th draw of the sampler after a burn-in of 500 for a total of 200 iterations. The number of y-clusters k and x-subclusters kj varied from iteration to iteration and depended on the most recent posterior sample of αθ and αω, with bigger values leading to more clusters. The posterior median and 95% credible intervals (CI) for αθ and αω were 0.63(0.21, 1.43) and 0.74(0.43, 1.16), respectively. The value of k tended to be about 4, while kj tended to range from 1 to 7. For example, at the last iteration of the first chain, there were k = 5 y-clusters, with the following sample sizes in each subcluster: cluster sy = 1, (36, 164, 134, 45, 32, 38, 76, 1); cluster sy = 2, (171, 211, 131, 68, 18, 1); cluster sy = 3, (171, 281, 172, 50, 28); cluster sy = 4, (137, 30, 2, 1); cluster sy = 5, (2).
The posterior median and 95% CI of the average causal relative risk (RR), ψrr, were 1.16 (0.87,1.54); thus, there was a 16% increased risk of death within 2 years comparing mtNRTI-containing ART regimens with other NRTI-containing ART regimens. However, the uncertainty about ψrr is substantial enough that we cannot rule a small reduced risk of death (e.g., RR about 0.9) or a larger increased risk (e.g., RR about 1.5). The posterior distribution of ψrr is displayed in Figure S1 of Web Appendix C, along with the trace plot. As a sensitivity analysis, we fitted the model with a different gamma distribution prior for the concentration parameters (Gam(0.1,0.1)) and found similar results: 1.17 (0.89, 1.55).
As comparators, we implemented an ITPW approach using logistic regression (additive, linear) for the propensity score, and a TMLE approach in R that included the following SuperLearner library: mean, glm, step, gam, randomForest, glmnet. For missing covariates, we imputed 10 data sets using predictive mean matching. The point estimate and 95% CI for ψrr was 1.02 (0.97, 1.08) for IPTW and 1.22 (1.06, 1.47) for TMLE.
7. Discussion
In this article, we developed a fully Bayesian approach for causal inference that can handle discrete or continuous outcomes and categorical treatment. While the full distribution of outcome, treatment, and confounders is modeled, the proposed BNP approach allows for flexible modeling of these distributions, estimation of any functionals of the potential outcome distribution, and high-dimensional confounding. In addition, because we have a model for the joint distribution, imputation of missing covariates under ignorable missingness is straightforward and does not require multiply imputed datasets. A drawback is the need to use MCMC for posterior computations and MC integration for the G-computation step; however, the latter can be done in parallel to facilitate computations. It should be noted that computational challenges are not unique to BNP. For example, TMLE with multiple imputation and bootstrap also offers computational challenges.
Our simulations showed overall good performance of the BNP approach. It is worth noting that we found (in simulation 1) that if the base model of BNP is the true model, no efficiency price was paid for fitting the more complex model. Compared to IPTW and TMLE, the BNP approach had the smallest ESD for all scenarios and sample sizes. Scenario 3 had a relatively large number of non-confounders and each of the methods displayed some amount of undercoverage. Thus, future work on settings with many covariates that are not actually confounders is of interest. For example, zero-inflated or shrinkage priors for the coefficients in the BNP model could be explored. Such priors should be relatively simple to implement and should mitigate the undercoverage. Simulation scenario 6 (in the supplementary materials), demonstrated that the BNP approach can be quite robust to residual confounding.
Methods such as IPTW and TMLE make use of propensity scores, whereas our approach involves modeling the outcome given confounders. Potential advantages of methods that utilize propensity scores include that it facilitates discovery of lack of overlap between treatment groups, can control for high dimensional confounding even if the outcome has few events, and does not require specification of all treatment-confounder interactions. However, direct standardization as do here allows for estimation of different types of causal effects simultaneously and does not require a different objective function/estimating equation for each. Standardization might also be preferred when treatment is rare, but the outcome is common. When missing covariates are involved, off-the-shelf imputation methods that are commonly used in conjunction with propensity score-based might lead to incompatibility between the missing covariate models and the (mean) models for inference (Bartlett et al., 2015).
Our BNP approach directly models the full joint distribution p(A, L, Y); however, if the sample size in each treatment category is sufficient, one could alternatively condition on treatment and use a separate BNP model P(L, Y|A = a) for each a. This might help reduce the risk of residual confounding and can improve computation time as these can be fitted in parallel. Also, the general EDP approach allows for αω to be a function of θ (cf. equation (1)). In our analyses we only included a single αω parameter. Thus, more complex models could be considered. An area for future research is the extension to the time-varying confounding setting. G-computation in that setting requires generating draws from confounders, and the relatively simple form of the (local) joint distribution of covariates should make this approach feasible.
With BNP models as used earlier, there can be sensitivity to the prior on the precision parameter though we observed little in our setting as described in Section 6 and simulation scenario 6 in Web Appendix B. A discussion of this issue as well as recommendations on priors can be found in Murugiah and Sweeting (2012).
Supplementary Material
Acknowledgements
This research was supported by NIH grants R01GM112327 and R01CA183854. The VACS was funded by the National Institute on Alcohol Abuse and Alcoholism (U01 AA13566, U24 AA20794, and U01 AA20790). We thank Amy Justice, Janet Tate, and Michael Kallan for their work on the design and creation of VACS data set that was re-analyzed in this methodological research project. We also thank the reviewers for their helpful comments and suggestions.
Footnotes
Web Appendices A, B, and C, referenced in Sections 4, 5, and 6, respectively, are available with this article at the Biometrics website on Wiley Online Library. Code used to simulate and analyze the data are available (https://github.com/jasonroy0/EDP_causal).
References
- Bartlett JW, Seaman SR, White IR, and Carpenter JR (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistical Methods in Medical Research, 24, 462–487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniels MJ, Roy JA, Kim C, Hogan JW, and Perri MG (2012). Bayesian inference for the causal effect of mediation. Biometrics 68, 1028–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson TS (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209–230. [Google Scholar]
- Fultz SL, Skanderson M, Mole LA, Gandhi N, Bryant K, Crystal S, et al. (2006). Development and verification of a “virtual” cohort using the National VA Health Information System. Medical Care 44, 25–30. [DOI] [PubMed] [Google Scholar]
- Gruber S and van der Laan MJ (2012). tmle: An R package for targeted maximum likelihood estimation. Journal of Statistical Software 51, 1–35.23504300 [Google Scholar]
- Hannah LA, Blei DM, and Powell WB (2011). Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research 12, 1923–1953. [Google Scholar]
- Hill JL (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, 217–240. [Google Scholar]
- Kim C, Daniels MJ, Marcus BH, and Roy JA (2016). A framework for Bayesian nonparametric inference for causal effects of mediation. Biometrics 73, 401–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacEachern SN (1999). Dependent nonparametric processes. ASA Proceedings of the Section on Bayesian Statistical Science, 50–55. [Google Scholar]
- Murugiah S and Sweeting T (2012). Selecting the precision parameter prior in Dirichlet process mixture models Journal of Statistical Planning and Inference 142, 1947–1959. [Google Scholar]
- Müller P, Erkanli A, and West M (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika 83, 67–79. [Google Scholar]
- Neal RM (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9, 249–265. [Google Scholar]
- Neugebauer R, Fireman B, Roy JA, Raebel MA, Nichols GA, and O’Connor PJ (2013). Super learning to hedge against incorrect inference from arbitrary parametric assumptions in marginal structural modeling. Journal of Clinical Epidemiology 66, 99–109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM (2000). Marginal structural models versus structural nested models as tools for causal inference In Statistical Models in Epidemiology, the Environment, and Clinical Trials, Halloran ME and Berry D (eds), 95–133. The IMA Volumes in Mathematics and its Applications, vol. 116 New York: Springer. [Google Scholar]
- Robins JM, Hernán MA, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550–560. [DOI] [PubMed] [Google Scholar]
- Roy J, Lum KJ, and Daniels MJ (2017). A Bayesian nonparametric approach to marginal structural models for point treatments and a continuous or survival outcome. Biostatistics 18, 32–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sethuraman J (1994). A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650. [Google Scholar]
- Shahbaba B and Neal R (2009). Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research 10, 1829–1850. [Google Scholar]
- Taddy MA (2008). Bayesian nonparametric analysis of conditional distributions and inference for Poisson point processes. PhD thesis, Santa Cruz, CA: University of California. [Google Scholar]
- van der Laan MJ (2010a). Targeted maximum likelihood based causal inference: Part I. The International Journal of Biostatistics 6, Article 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan MJ (2010b). Targeted maximum likelihood based causal inference: Part II. The International Journal of Biostatistics 6, Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Laan MJ, Polley EC, and Hubbard AE (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology 6, 1–21. [DOI] [PubMed] [Google Scholar]
- Wade S, Mongelluzzo S, and Petrone S (2011). An enriched conjugate prior for Bayesian nonparametric inference. Bayesian Analysis 6, 359–385. [Google Scholar]
- Wade S, Dunson DB, Petrone S, and Trippa L (2014). Improving prediction from Dirichlet process mixtures via enrichment. Journal of Machine Learning Research 15, 1041–1071. [Google Scholar]
- Wang C, Dominici F, Parmigiani G, and Zigler CM (2015). Accounting for uncertainty in confounder and effect modifier selection when estimating average causal effects in generalized linear models. Biometrics 71, 654–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu D, Daniels MJ, and Winterstein AG (2018). A Bayesian nonparametrics approach to causal inference on quantiles. Biometrics 10.1111/biom.12863 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.