Abstract
Causal inference is formulated using the counterfactual framework, enabling direct investigation of causal questions. Causal inference methods can incorporate machine learning techniques into the estimation process, allowing for more flexible models. However, the integration of machine learning methods adds complexity to statistical inference. In this paper, we systematically assess several methods for making causal inference with multiple treatment groups, including the outcome regression, inverse propensity score weighting, double-robust estimators, and their counterparts when employing a super learner in the estimation process, as well as the targeted maximum likelihood estimator (TMLE). We conduct numerical studies with complex data-generating models to evaluate these different estimators. Our results suggest that the double-robust estimator, when combined with machine learning, is the most favourable approach, demonstrating lower biases, a valid variance estimator, and improved coverage probabilities for the 95% confidence interval.
Keywords: Doubly robust, influence curve, observational study, propensity score, target maximum likelihood estimator
AMS SUBJECT CLASSIFICATION: 62D20
1. Introduction
Causal inference has received more and more attention in recent years because it empowers us to answer fundamental questions about cause and effect, enabling more rational decisions and effect comparisons across various fields, from healthcare, medical science and economics, to social sciences and policy-making. The origins of causal inference can be traced back to ancient philosophy, where thinkers like Aristotle and Hume contemplated the concept of causality. In the 20th century, Sir Ronald A. Fisher’s work on experimental design and randomised controlled trials (RCTs) became a cornerstone for establishing causation in scientific research. Contemporary causal inference was developed by Rubin (2005) using the formalisation of counterfactuals and potential outcomes as key concepts to directly address causal effect questions, and subsequently was developed by many kinds of literature such as Pearl et al. (2016), Imbens and Rubin (2015), Hernán and Robins (2020), and Jewell (2003).
Causal inference methods have been used for analysing observational data to make inferences on the effect of treatment on a response. A vast literature has emerged employing propensity scores, which were based on the property of the propensity scores as illustrated by Rosenbaum and Rubin (1983). These methods include propensity score stratification, propensity score regression, propensity score matching, and inverse probability weighting (D’Agostino 1998). In addition to these propensity score-based methods, one can apply the potential outcome concepts directly to model the outcomes and such a method is called outcome regression method, or parametric g-formula (Hernán and Robins 2020). In addition, Robins et al. (1994) developed the semi-parametric theory and proposed a semi-parametric efficient ‘augmented’ estimator that relies on both the propensity score and the outcome regression model. Scharfstein et al. (1999) and Bang and Robins (2005) called it ‘double-robust’ estimator since this estimator is consistent if either the propensity score model or the outcome regression model is correctly specified.
Recent advances in data analysis have been driven in machine learning methods, which encompass a wide range of techniques and algorithms that enable computers to learn from data and make predictions or decisions without specifying a parametric model (Hastie et al. 2009). Since these methods are generally intended for making predictions, they are not directly applicable to medical and health research when the objective is to evaluate a treatment effect on the response variable. However, in the causal inference framework using potential outcomes, all that is needed for causal treatment estimation involves predictions from the outcome model and/or the propensity score model, thereby enabling the use of machine learning methods in causal inference. With numerous machine learning methods available, one often faces the dilemma of choosing a suitable method among many possible algorithms for a specific application. Therefore, an ensemble method that can utilise multiple machine learning methods becomes desirable. One such ensemble method is the super learner introduced by van der Laan et al. (2007), which can guarantee that it performs asymptotically as well as the best choice among the family of weighted combinations of algorithms.
Statistical inference for approaches using traditional parametric models are relatively straightforward. However, when building a model using data-driven machine learning methods, inference for such ‘data dependent parameters’ is no longer straightforward, as pointed out by Hines et al. (2022). If we ignore the data-dependent nature of finding a statistical model, bias and excess variability will be introduced that is not accounted for in the parametric inference procedure or standard statistical software. Recent developments made by van der Laan and Rubin (2006), van der Laan and Rose (2011), Chernozhukov et al. (2018), and Hines et al. (2022) have shown how to identify unbiased estimators and obtain correct variance estimation using the efficient influence function approach. These resulting strategies enable the use of data-driven estimation strategies to build the model while permitting valid inference of the estimand of interest. Two ‘debiased’ approaches are the double-robust estimator (Scharfstein et al. 1999; Bang and Robins 2005) and the target maximum likelihood estimator (van der Laan and Rubin 2006; van der Laan and Rose 2011).
Multi-valued treatments are very common in observational studies in medical and public health research. Multiple treatments may include more than two different interventions or multi-level of various doses of one drug. Imbens (2000) proposed the generalised propensity score and the weak unconfoundedness assumption for making inferences in the multiple treatment situation, using a potential outcome framework for each treatment group. Under the ignorability assumption, the purpose of causal inference with multiple treatments is to compare the average treatment effects between any two treatment groups. Although some newer software can accommodate multiple treatment groups, it is easy to encounter the non-convergence problem, or the software may lack flexibility in modelling choices. While the existing literature has extensively explored the methods and their performance, the targeted maximum likelihood estimator (TMLE) and super learner employed in these methods have not been thoroughly investigated in the context of multiple treatments with a complex model-generating process. In addition, researchers often overlook the underlying theory when applying machine learning techniques in causal inference. This paper makes three main contributions: First, we outline the theory of causal inference in the context of multiple treatment group comparisons, incorporating both traditional regression models and machine learning methods within causal inference methods. Second, we evaluate the performance of these approaches through simulation studies, including super learners for machine learning methods and TMLE, focussing on complex data generation mechanisms that are likely to suffer from model mis-specification and relatively small propensity scores in multiple-treatment settings. Finally, we provide practical recommendations for analysts regarding these methods and review the latest software for their implementation – a gap in the existing literature – offering useful guidance for practitioners.
In this paper, we will consider various types of estimators for conducting causal inference in the context of multiple treatment group comparisons. These estimators include: (1) the outcome regression (regression adjustment) model (RA). (2) the propensity score-based inverse probability weighting estimator (IPW). (3) the double-robust estimator (DR). We will assess their performance when the parametric regression models are employed correctly and incorrectly for both the outcome model and the propensity score model. Furthermore, we will investigate how these approaches (RA, IPW, and DR) perform when machine learning methods, such as a super learner, are used instead of the parametric models. Finally, we will examine the TMLE, which also utilises the super learner method for both the outcome and propensity score models. This paper is organised as follows: Section 2 outlines the generalised propensity score, the weak unconfoundedness assumption, and the potential outcome framework for estimating multiple treatments. Section 3 introduces various approaches, including RA, IPW, DR, super learner-based methods (RA_SL, IPW_SL, and DR_SL), and TMLE, along with their inference methods. Section 4 presents the design and outcomes of the Monte Carlo simulation. A real data analysis will be shown in Section 5, and final Section 6 offers concluding remarks.
2. Assumptions for making causal inference for multiple treatments
We adopt the notation based on the potential outcome framework in causal inference (Rubin 2005). We denote the observed samples as , where and represent the outcome, covariates, and treatment group, respectively, for the th individual. In binary treatment case, the treatment variable can take values of 0 or 1; in multi-valued treatment case, can take multiple values .
We denote as an -dimensional vector for the treatment indicator, with , where . For example, for an individual , with treatment , the treatment indicator vector is
2.1. Definition of causal treatment effect
The average causal treatment effect (ATE) comparing treatment with treatment is
where and are the potential outcomes of when the treatment is and respectively. They are counterfactuals or potential outcomes because an individual can only receive one treatment, so can only take one value for each person. Therefore, only one of the potential outcomes can be observed for each patient, and the observed outcome Y can be represented as
where is the indicator function. Notice that
Thus, if we treat the Lth group as the reference group, we only need to estimate , and all other comparisons are simply linear functions of these contrasts.
2.2. Assumptions and potential outcomes
To make causal inference with binary or multiple treatment groups, a basic condition we need to make is the Stable Unit Treatment Value Assumption (SUTVA) (Imbens and Rubin 2015), which states that there is no interference between treatments (i.e. the potential outcomes for any unit do not vary with the treatments assigned to other units), and the treatment is consistent (i.e. for each unit, there is no different versions of each treatment level which lead to different potential outcomes).
Following Imbens (2000), the generalised propensity score for multiple treatment groups is
In order to make causal inference for comparisons among multiple treatment groups, two additional assumptions are necessary:
- Weak unconfoundedness assumption (Imbens 2000): The treatment assignment to group is independent of the potential outcome conditioned on covariates , i.e.
- Positivity assumption: the generalised propensity score is bounded away from , i.e.
The weak uncounfoundedness assumption and the positivity condition together can be called as the ignorability condition (Rosenbaum and Rubin 1983).
3. Estimation of average treatment effects with multiple treatment groups
In this section, we introduce several methods of causal inference with multiple treatments, including regression adjustment, inverse probability weighting, double-robust estimator, and the targeted maximum likelihood estimator under the SUTVA and ignorability assumptions. We also outline the methods for obtaining variability estimates for these different estimators.
Although propensity score matching is a popular tool in casual inference, this study did not consider this method due to the following reasons: (1) The matched sample may not be representative of the original population due to the exclusion of some participants during matching, which could result in estimates that do not reflect the ATE; (2) Estimating the standard error of the estimates is still an open research question (Lopez and Gutman 2017). Although Abadie and Imbens (2016), Yang et al. (2016), and Scotina et al. (2020) derived the asymptotic variance that accounts for variability induced by estimation of propensity scores and matching procedure with parametric propensity score models, there is no appropriate analytic variance estimator when the propensity scores are estimated using machine learning methods; (3) The results can vary depending on the choice of matching algorithm and calliper width, introducing some subjectivity into the analysis.
3.1. Regression adjustment (RA)
Under the ignorability condition, for any treatment group , we have
which means that the conditional mean of each potential outcome can be estimated by the conditional mean of the observed outcome using the units in treatment group .
Specifically, to obtain the predicted potential outcome for each individual , denoted as , we first regress the observed outcome on covariates in treatment group , and obtain a prediction function . Noted that this prediction function can also be obtained using machine learning methods such as a super learner (van der Laan et al. 2007). Next, we predict the potential outcomes for every subject from all treatment groups by plugging in the covariates for individual in the above prediction functions, i.e.
We repeat this process for all the treatment groups and thus obtain the predicted potential outcomes for all treatment groups for all individuals .
Therefore, the ATE between any treatment group and treatment group can be estimated by contrasting the average of the potential outcomes between the two treatment groups,
This method is called the regression adjustment method (RA).
The variance formula for the RA estimator based on a parametric outcome model could be derived. However, it cannot be applied when the outcome is predicted by machine learning methods. A naive way of estimating the standard error for the regression adjustment estimator is based on the influence curve of the RA estimator (Ichimura and Newey 2022):
and the variance of the RA estimator can be estimated as
| (1) |
However, Hines et al. (2022) pointed out that such a ‘plug-in’ estimator would be biased and the naive confidence intervals and tests would be invalid (see more discussion in Section 3.4). As demonstrated in our simulations in Section 4, this variance estimator has a poor performance.
3.2. Inverse probability weighting (IPW)
The IPW estimator, also known as the Horvitz–Thompson estimator, was first proposed by Horvitz and Thompson (1952) for survey research. Later, researchers applied IPW to causal inference with binary treatment groups; for instance, Rosenbaum (1987), Robins et al. (1995), Hahn (1998), Hirano et al. (2003), and Lunceford and Davidian (2004). Recently, this weighting method has also been extended to multiple treatment situations, see Cattaneo (2010) and Uysal (2013, 2015).
According to the notation described in Section 2, the IPW estimator of is
where is the estimated propensity score function for individual for being assigned to treatment group . This propensity score can be obtained by using a multinomial logistic regression, machine learning methods, or a super learner that incorporates a variety of parametric and machine learning methods.
An extremely small propensity score will make the IPW estimator more unstable due to the huge weight resulting from the extreme propensity score. Weight trimming is often applied to mitigate this issue (Sturmer et al. 2021). In addition, a normalised version of the IPW estimator is
which often enhances the precision and stability of the estimator (Kang and Schafer 2007). We will use this normalised version of IPW estimator in the following sections.
The IPW estimator of is
Lunceford and Davidian (2004) derived the variance formula for the IPW estimator based on a logistic regression model for the propensity score (equation 15). They also provided a more stable variance estimator using the empirical sandwich method (Stefanski and Boos 2002) in their equation (19). However, the formula requires the specification of a parametric form for the propensity score model, so it cannot be applied when the propensity score is obtained by machine learning methods. Yan et al. (2019) proposed a naive way of obtaining the variance as
| (2) |
where
We will evaluate the performance of this variance estimator when using machine learning methods (including super learners) for estimating the propensity score model.
3.3. Double-robust estimator (DR)
The consistency of the IPW estimator depends on the correct specification of the propensity score model, while the RA estimator’s consistency relies on the correct specification of the outcome model. In real-world applications, it’s often challenging to determine which model is correct. In such cases, a double-robust estimator becomes valuable as it remains consistent if either of the two models is correctly specified (Scharfstein et al. 1999; Bang and Robins 2005). This estimator, also known as the augmented inverse probability weighting (AIPW) estimator, incorporates an augmented term into the IPW estimator. For more details about the AIPW estimator with binary treatment, refer to Robins et al. (1995), Lunceford and Davidian (2004), and Kang and Schafer (2007).
In recent years, researchers have extended AIPW to the case of multiple treatments; see McCaffrey et al. (2004), Feng et al. (2012), Linden et al. (2016), and Rose and Normand (2019). As mentioned in Section 3.1, under the ignorability assumption, the potential outcomes in treatment group for individual can be estimated based on the model:
Using the double-robust estimator, the population mean of potential outcome can be estimated by
Similar to the normalised IPW estimator, a normalised version of the DR estimator can be used to enhance the stability (Kang and Schafer 2007):
which is obtained by normalising the weights for the residual part in the DR estimator.
The estimation of the ATE comparing treatment vs. is
Extending the results from Lunceford and Davidian (2004) to multiple treatments, the variance of the DR estimator can be obtained by
| (3) |
where
Another advantage of employing the DR estimator is its property of belonging to one of the debiased estimators, especially when a data-driven approach is employed to identify the optimal outcome model and propensity score model (Hines et al. 2022). This estimator not only improves upon the RA estimator by reducing bias but also offers a dependable variance estimator, as shown above, even when incorporating machine learning methods, including super learners.
3.4. Super learner-based methods (SL)
As previously discussed, whether it’s for the outcome model or the propensity score model, the essential requirement is to obtain the predicted outcome or predicted propensity score for each individual. Consequently, machine learning techniques can be employed to estimate these functions, offering the advantage of not being constrained by a specific parametric form. This flexibility is particularly valuable because we frequently lack a clear understanding of the true data-generating process for our data. While machine learning techniques can suffer interpretability issues when used directly for making predictions of an outcome in clinical and policy applications, this does not pose the same challenge in estimation of ATE in the causal inference framework. The causal conclusions are drawn based on well-established identification assumptions, statistical estimation techniques, and the accurate prediction capabilities of machine learning model.
Given the multitude of available machine learning methods, selecting a single one can be challenging. An alternative approach is to opt for an ensemble method, which combines several machine learning techniques. Yan et al. (2019) introduced an ensemble method for estimating the ATE in scenarios involving multiple treatments. This method determines optimal weights by minimising a weighted rank aggregation quantity. Another ensemble method, known as the super learner, is designed to generate robust predictions for both continuous and categorical outcomes (van der Laan et al. 2007).
Among the various candidate estimation methods, the optimal combination of weights for the super learner is determined by cross-validation with a given loss function. When dealing with continuous outcomes, the super learner defaults to using the least square loss, while for categorical outcomes, the default choice is the negative log loss. Super learner is typically robust to model mis-specification by integrating the outcomes of different machine learning algorithms into a single result through the minimisation of the risk function. Several R packages such as SuperLearner and are readily available for implementing this approach (Coyle et al. 2021; Polley et al. 2021).
We can utilise the super learner to initially predict the propensity score and the outcome. Subsequently, we can estimate the ATE using the various methods mentioned earlier by incorporating the estimated propensity score and outcome into the formulas for RA, IPW, and DR methods. We refer to these methods as RA_SL, IPW_SL, and DR_SL. It’s important to note that, in the case of a categorical response, the R package SuperLearner can handle this scenario by estimating the propensity score through a series of binomial models. This involves treating one specific group as the treatment group and all other groups as the control group. Therefore, we suggest normalising the individual treatment propensity scores to obtain a multinomial propensity score. Specifically, if represents the initial estimated propensity score to receive treatment , which is obtained from the R package SuperLearner by comparing treatment group to all other groups, the normalised propensity score is calculated as follows:
This normalisation ensures that the summation of estimated propensity scores equals 1. On the other hand, the generalised propensity scores estimated by the R package are already normalised by this package, and hence do not need manual normalisation after obtaining the estimated propensity scores using .
Based on the nonparametric estimation theory of statistical functions (Bickel et al. 1993; Chernozhukov et al. 2018; Hines et al. 2022), the RA_SL estimator exhibits bias when employing a data-driven process for identifying an outcome model. Additionally, the variance formula for the RA_SL estimator loses its validity in this context. On the other hand, the DR_SL estimator effectively mitigates bias and maintains the validity of the variance formula.
3.5. Target maximum likelihood estimation (TMLE)
Introduced by van der Laan and Rubin (2006) and van der Laan and Rose (2011), the TMLE is a doubly robust, maximum-likelihood-based estimation. TMLE features a ‘targeting step’ that optimises the bias-variance tradeoff of average treatment effect. As mentioned earlier, extreme propensity scores can make classic method questionable. However, the targeting step in TMLE is an iterative process that can make the term containing the propensity score infinitely close to 0, making TMLE robust to the positivity assumption. Next, we will introduce the implementation of TMLE in multiple treatment cases.
As described earlier, the (non-normalised) double-robust estimator of the population mean of potential outcome is
It can be treated as obtaining the initial estimate , then updating the initial estimate using
The TMLE estimator also performs an update, and it does so by substituting the residual term with a tuning parameter in the above equation, i.e.
where is obtained iteratively by setting the plug-in bias to zero (Hines et al. 2022).
Consequently, the TMLE estimator of the average treatment effect is
As described in van der Laan and Rose (2011), the super learner method can be employed to estimate both the outcome and the propensity score. To compute the variance of the TMLE estimator, one can calculate the expected square of the efficient influence function and Delta method transformations, provided that sufficient conditions are met to guarantee the convergence of the empirical process to zero. The method was implemented in several R packages such as tmle (Gruber and van der Laan 2012) and tmle3 (Coyle 2021). The R package tmle was mainly developed to implement the TMLE method for data with binary treatments, but the recently developed R package tmle3 can easily handle the multiple treatments scenario.
4. Monte Carlo simulation
We perform a simulation study using R software to assess the finite sample performance of various causal effect estimation methods, including (1) RA, IPW, and DR methods with mis-specified linear models, (2) Oracle methods with correct covariates in the models: RA_Oracle, IPW_Oracle, and DR_Oracle, (3) SL-based methods: RA_SL, IPW_SL, and DR_SL, and (4) TMLE for estimating causal effects with multiple treatments. The TMLE is computed using the tmle3 package, while the SL-based methods utilise the super learner package sl3. Our simulation consists of 1000 iterations, with sample size of and 5000.
4.1. Data generating process (DGP)
In our simulation studies, we generate propensity scores using a multinomial logistic regression model and draw the outcome from a normal distribution. Both the propensity scores and the outcome depend on six covariates and their interactions. The outcome model is generated separately for each treatment level.
4.1.1. Covariates
The data generating process of covariates is adapted from Yang et al. (2016). We generate covariates and from a multivariate normal distribution with a mean vector () and a covariance matrix of
; and . And we get covariate vector .
In the propensity score model, we set the true covariates as
In the outcome model, we set the true covariates as
In Web Appendix D of the Supplemental online materials, we provide additional simulation results where the data generation process is similar except for the inclusion of five additional covariates as distractors which are unrelated to either the treatment or the outcome variable.
4.1.2. Propensity score
The treatment assignment follows the multinomial logistic regression
where
are the true propensity scores with parameters as follows:
Figure 1 shows the boxplot of the true propensity scores in one simulation, indicating that there are many propensity scores below 0.2. Thus, for our simulation, the propensity scores approach zero, which poses a challenge for propensity score-based methods to perform well.
Figure 1.

Boxplot of the true propensity scores for different treatment groups in one simulation run.
In Web Appendix B of the Supplemental online materials, we provide additional simulation results where the data generation is very similar except that the parameters for the true propensity score model are smaller, leading to fewer extreme propensity scores.
4.1.3. Outcomes
The potential outcomes are generated using the following model:
where
The true ATEs for pairwise comparisons of the four treatments are
In the Web Appendix C of the Supplemental online materials, we provide additional simulation results where the data generation is very similar except that the error term is generated based on a log-normal distribution, leading to errors with a skewed distribution.
4.1.4. Super learner and TMLE libraries
As discussed in Section 3.4, the super learner software leverages a weighted combination of various machine learning methods, aiming to yield optimal results for a given loss function (e.g. non-negative least squares loss for continuous outcomes, negative log likelihood loss for categorical outcomes). Thus, we must supply a range of machine learning libraries for utilising the super learner package . In our simulation study, the libraries used for the outcome model (a continuous variable) are: xgboost, lasso (i.e. glmnet with alpha = 1), gam, gbm, glm_fast, ranger, mean, ridge (i.e. glmnet with alpha = 0), earth. The outcome models are fitted within each treatment group separately. For the propensity score model (a categorical variable), we employ the following libraries: xgboost, lasso, ranger, mean, ridge. These same libraries are employed for TMLE estimation, as it relies on super learner libraries within its routine.
4.2. Monte Carlo simulation results
For the estimators RA, IPW and DR, we must assume a parametric model for either the outcome, the treatment, or both. In this simulation study, we evaluate their performance under two scenarios: (1) a mis-specified model that includes only the six covariates without any interaction or quadratic terms, and (2) a correctly specified model using the true covariate vectors. We refer to the estimators in the latter case as RA_Oracle, IPW_Oracle, DR_Oracle.
We simulate the data and compute ATE estimates using various methods over 1000 iterations. We then evaluate the ATE estimators in terms of bias (the average of all point estimates from 1000 simulations minus the true ATE), Estimated Standard Error (ESE, the sample average of the standard error estimates from 1000 simulations), Sample Standard Error (SSE, the square root of the sample variance of the ATE estimates from 1000 simulations), root mean square error (RMSE) of the estimates, and the coverage probability (CP) of the true ATE parameter by the 95% confidence intervals across all simulations. For all RA and DR methods, the Equations (1) and (3) are used to estimate their variances, respectively. For the IPW_SL estimator, the Equation (2) is used to estimate the variance. For the IPW and IPW_Oracle estimators, their variances are estimated using empirical sandwich method to incorporate the parametric propensity score models, similarly to the equation (19) provided in Lunceford and Davidian (2004). The results of all pairwise comparisons are presented in Table 1, Figure 2, and Web Appendix A of the Supplemental online materials (for the table with sample size of 5000).
Table 1.
Simulation study results for various causal methods for multiple treatment group comparisons with sample size of 1000.
| Comparison | T1 vs.T2 (True ATE = −1.5) | T1 vs. T3 (True ATE = −4) | T1 vs. T4 (True ATE = −5.5) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|||||||||||||
| Method | Bias | ESE | SSE | RMSE | CP | Bias | ESE | SSE | RMSE | CP | Bias | ESE | SSE | RMSE | CP |
|
| |||||||||||||||
| RA | 0.600 | 0.152 | 0.469 | 0.762 | 0.215 | −0.300 | 0.151 | 0.411 | 0.509 | 0.427 | 0.284 | 0.135 | 0.564 | 0.631 | 0.323 |
| IPW | 0.426 | 0.469 | 0.482 | 0.643 | 0.830 | −0.423 | 0.408 | 0.420 | 0.596 | 0.822 | 0.144 | 0.565 | 0.587 | 0.604 | 0.930 |
| DR | 0.471 | 0.472 | 0.479 | 0.672 | 0.814 | −0.400 | 0.411 | 0.419 | 0.579 | 0.840 | 0.163 | 0.558 | 0.577 | 0.600 | 0.923 |
| RA_Oracle | −0.003 | 0.279 | 0.299 | 0.299 | 0.933 | 0.000 | 0.259 | 0.275 | 0.275 | 0.932 | 0.006 | 0.261 | 0.274 | 0.274 | 0.938 |
| IPW_Oracle | −0.005 | 0.531 | 0.543 | 0.543 | 0.948 | −0.001 | 0.425 | 0.438 | 0.438 | 0.942 | 0.053 | 0.501 | 0.523 | 0.526 | 0.935 |
| DR_Oracle | −0.003 | 0.295 | 0.299 | 0.299 | 0.941 | 0.001 | 0.274 | 0.274 | 0.274 | 0.946 | 0.006 | 0.277 | 0.273 | 0.273 | 0.956 |
| RA_SL | 0.069 | 0.272 | 0.353 | 0.360 | 0.867 | −0.249 | 0.253 | 0.323 | 0.408 | 0.775 | −0.248 | 0.253 | 0.401 | 0.471 | 0.713 |
| IPW_SL | 0.570 | 0.447 | 0.489 | 0.751 | 0.723 | −0.257 | 0.363 | 0.411 | 0.485 | 0.853 | 0.418 | 0.460 | 0.538 | 0.681 | 0.800 |
| DR_SL | 0.123 | 0.290 | 0.330 | 0.352 | 0.881 | 0.015 | 0.267 | 0.293 | 0.294 | 0.919 | 0.138 | 0.280 | 0.333 | 0.361 | 0.856 |
| TMLE | 0.440 | 0.448 | 0.441 | 0.623 | 0.827 | −0.064 | 0.381 | 0.376 | 0.382 | 0.957 | 0.317 | 0.487 | 0.463 | 0.561 | 0.904 |
|
| |||||||||||||||
| Comparison | T2 VS. T3 (True ATE = −2.5) | T2 VS. T4 (True ATE = −4) | T3 VS. T4 (True ATE = −1.5) | ||||||||||||
|
|
|
|
|||||||||||||
| Method | Bias | ESE | SSE | RMSE | CP | Bias | ESE | SSE | RMSE | CP | Bias | ESE | SSE | RMSE | CP |
|
| |||||||||||||||
| RA | −0.900 | 0.181 | 0.496 | 1.028 | 0.119 | −0.316 | 0.225 | 0.662 | 0.733 | 0.456 | 0.584 | 0.101 | 0.565 | 0.813 | 0.176 |
| IPW | −0.849 | 0.497 | 0.513 | 0.992 | 0.581 | −0.282 | 0.648 | 0.684 | 0.740 | 0.914 | 0.567 | 0.571 | 0.587 | 0.816 | 0.797 |
| DR | −0.871 | 0.492 | 0.509 | 1.009 | 0.554 | −0.309 | 0.640 | 0.674 | 0.741 | 0.905 | 0.563 | 0.560 | 0.577 | 0.806 | 0.801 |
| RA_Oracle | 0.004 | 0.181 | 0.204 | 0.204 | 0.912 | 0.010 | 0.357 | 0.382 | 0.382 | 0.927 | 0.006 | 0.257 | 0.279 | 0.279 | 0.923 |
| IPW_Oracle | 0.004 | 0.494 | 0.502 | 0.502 | 0.952 | 0.058 | 0.637 | 0.662 | 0.665 | 0.944 | 0.054 | 0.540 | 0.564 | 0.567 | 0.935 |
| DR_Oracle | 0.004 | 0.202 | 0.205 | 0.205 | 0.941 | 0.009 | 0.369 | 0.382 | 0.382 | 0.937 | 0.005 | 0.272 | 0.279 | 0.279 | 0.942 |
| RA_SL | −0.317 | 0.193 | 0.285 | 0.426 | 0.594 | −0.316 | 0.342 | 0.501 | 0.593 | 0.754 | 0.001 | 0.238 | 0.409 | 0.409 | 0.738 |
| IPW_SL | −0.827 | 0.464 | 0.495 | 0.964 | 0.559 | −0.152 | 0.543 | 0.630 | 0.648 | 0.885 | 0.675 | 0.476 | 0.541 | 0.865 | 0.681 |
| DR_SL | −0.107 | 0.214 | 0.248 | 0.270 | 0.872 | 0.015 | 0.367 | 0.434 | 0.434 | 0.887 | 0.123 | 0.274 | 0.331 | 0.353 | 0.847 |
| TMLE | −0.504 | 0.397 | 0.373 | 0.627 | 0.766 | −0.123 | 0.542 | 0.535 | 0.548 | 0.954 | 0.382 | 0.445 | 0.424 | 0.571 | 0.861 |
Note: ‘ATE’ is the average treatment effect; ‘Bias’ is the difference between the mean estimate and the true value of ATE; ‘SEE’ is the empirical standard error of the estimates; ‘ESE’ is the sample average of the standard error estimates; ‘RMSE’ is the root mean square error of the estimates; ‘CP’ gives the proportion containing the true value within 95% confidence interval; ‘RA’, ‘IPW’, and ‘DR’ use mis-specified models in regression adjustment, inverse probability weighting, and double-robust estimator, respectively; ‘RA_SL’, ‘IPW_SL’, and ‘DR_SL’ are corresponding super learner-based methods; ‘RA_Oracle’, ‘IPW_Oracle’, and ‘DR_Oracle’ use correctly specified models in the three methods; ‘TMLE’ is the targeted maximum likelihood estimator.
Figure 2.

Simulation results with sample size of 1000 and 5000. ‘Absolute Bias’ is the absolute value of the difference between the mean estimate and the true value of average treatment effect; ‘Rejection Rate’ is 100% – coverage probability of 95% confidence interval; ‘RA’, ‘IPW’, and ‘DR’ use mis-specified models in regression adjustment, inverse probability weighting, and double-robust estimator, respectively; ‘RA_SL’, ‘IPW_SL’, and ‘DR_SL’ are corresponding super learner-based methods; ‘RA_Oracle’, ‘IPW_Oracle’, and ‘DR_Oracle’ use correctly specified models in the three methods; ‘TMLE’ is the targeted maximum likelihood estimator.
From the results, we can draw the following conclusions. (1) When the treatment model or the outcome model is correctly specified, the model-based estimators RA_Oracle, IPW_Oracle, and DR_Oracle perform quite well. They have small biases, and the estimated standard errors are very close to the sample standard errors. The 95% confidence interval coverage probability are also very close to 0.95. The IPW_Oracle estimator has bigger SSE and RMSE compared to the other two oracle methods. (2) When both the outcome model and the propensity score model are mis-specified, the estimators RA, IPW, and DR perform poorly, with bigger bias for all estimators and much larger SSE as compared to the ESE for RA estimator. The SSE is similar to ESE for IPW and DR estimators. The coverage probabilities can be much smaller than 0.95. (3) For super learner-based methods using an array of machine learning tools, the outcome-based estimator RA_SL shows improved bias compared to RA for most pairs of comparisons. However, as predicted by the theory (Hines et al. 2022), the variance formula is incorrect, resulting in a big difference between the ESE and SSE and a smaller coverage probability for the 95% confidence interval. Although the IPW_SL estimator tries to improve bias through using the super leaner, its bias does not always improve compared to the naive IPW estimator. The ESE tends to be smaller than the SSE in general, but the difference is not as dramatic as for the RA_SL estimator. (4) The double-robust estimator using machine learning methods (DR_SL) tends to have very small bias and RMSE close to the corresponding oracle method (DR_Oracle). Its ESE is slightly smaller than SSE, leading to a coverage probability slightly lower than 0.95. (5) The performance of TMLE is good for some group comparisons, but poor for others compared to DR_SL. We do not really know where the problem might come from, since it depends on the algorithm used in tmle3 package. One possible reason is that tmle3 package does not fit the outcome models separately within each treatment group, but rather uses the treatment group and the covariates together to fit the outcome models. Given that the outcome models differ significantly between treatment groups in our simulation scenario, not fitting the outcome models stratified by treatment groups might lead to poorer performance. The exact reason for this phenomenon still needs further investigation. (6) When the sample size increases from 1000 to 5000, the biases of both DR_SL and TMLE methods decrease. The coverage probabilities of the DR_SL method remain similar in general, but the TMLE method sometimes produces worse coverage probabilities with the sample size of 5000 due to relatively bigger bias for certain group comparisons compared to DR_SL. The biases of other non-oracle methods do not improve with increased sample sizes and their coverage probabilities deteriorate when the sample size increases.
In the Web Appendix B of the Supplemental online materials, we provide simulation results when the data has fewer extreme propensity scores. Under this simulation scenario, due to a better fitting for the propensity score model, the IPW_SL and TMLE methods have improved performance, with smaller bias and RMSE, and higher coverage of 95% confidence intervals. The RA_SL has similar performance, as expected since it does not depend on propensity scores. The DR_SL method has a similarly small bias, and its ESE is closer to the SSE, leading to an improved coverage probability that is close to 0.95. In the Web Appendix C, we provide simulation results when the error term was generated from a skewed distribution, and the results are similar. In Web Appendix D, we present simulation results with five additional covariates as distractors. Under this simulation scenario, the performance of DR_SL remains unchanged. The TMLE methods exhibit slightly poorer performance overall, with small increases in bias and RMSE and reduced coverage probabilities of 95% confidence intervals. As predicted by theory, RA_SL and IPW_SL continue to perform poorly.
Hence, while our aim is to utilise the best model, the true model is often unknown to us. Machine learning methods allow us to approximate the true model, but we must exercise caution when employing machine learning for the outcome regression model, as bias may persist, and obtaining a valid variance estimator can be challenging. Similarly, machine learning can be applied to the IPW estimators, but these estimators often exhibit greater variability. Therefore, the most favourable approach for leveraging machine learning methods is to utilise the double-robust estimator, which is characterised by lower biases, a valid variance estimator, and improved coverage probabilities for the 95% confidence interval.
5. A case study
Estimating resource utilisation for the care of HIV patients is crucial for shaping health policy. In the ongoing global battle against the HIV/AIDS pandemic, comprehending the allocation and use of healthcare resources is vital for effective policy formulation. This ensures that healthcare systems remain adequately prepared to meet the long-term needs of HIV patients without overwhelming the system. We will utilise data from Leon-Reyes et al. (2019), who conducted a study that evaluated the relationship between costs for HIV care and patient characteristics using linked data from insurance claims and the Swiss HIV Cohort Study (SHCS).
The SHCS, established in 1988, encompasses all HIV-infected individuals aged over 18 years (Schoeni-Affolter et al. 2010). During biannual visits, clinical, laboratory, and behavioural data are collected. The cohort is representative of the Swiss HIV epidemic and includes an estimated 70% of HIV-infected individuals receiving antiretroviral therapy (ART). In Switzerland, health insurance is mandatory for all residents and includes premium and copayment components. Swiss health insurance claims provide information about dates and reimbursed fees from healthcare providers. Cost assessments are from the health insurer’s perspective and represent charges to health insurers that were converted from Swiss francs to US dollars (Leon-Reyes et al. 2019). For illustrative purposes, we restrict our analysis to individuals with complete data and ignore the censoring issue for some costs. This combined dataset comprises 951 individuals and includes medical costs from the year 2012 (the outcome), CD4 levels at the initial presentation (the treatment variable with four levels: <100/uL, 100–350/uL, 350–500/uL, ≥500/uL), and 18 other variables that may influence costs and/or CD4 levels at presentation. A summary of the data is presented in Table 2. Our objective is to apply the causal inference methods to compare the costs of HIV patient care among different CD4 level groups, while controlling for all observed confounders. Although CD4 level is not a ‘treatment’ in the conventional sense of causal inference, as it is not manipulable, the causal methods discussed here are applicable in settings where biological (e.g. body weight, LDL-cholesterol, CD4 cell count) or social characteristics (e.g. socioeconomic status) are studied (Li et al. 2013; Decruyenaere et al. 2020; Hernán and Robins 2020), allowing for control of all observed confounders.
Table 2.
Summary of the HIV example data (N = 951).
| Patient characteristics | Mean (SD) or n (%) |
|---|---|
|
| |
| Cost | 28,538.1 (14,841.0) |
| CD4 counts at presentation | |
| <100 | 148 (15.6%) |
| 100–350 | 336 (35.3%) |
| 350–500 | 224 (23.6%) |
| ≥ 500 | 243 (25.6%) |
| Age | |
| ≤ 40 | 147 (15.5%) |
| 40–60 | 652 (68.6%) |
| ≥ 60 | 152 (16.0%) |
| Drug-past | 796 (83.7%) |
| Drug-current | 221 (23.2%) |
| Past alcohol consumption | |
| Non-drinker | 219 (23.0%) |
| Light | 529 (55.6%) |
| Moderate | 120 (12.6%) |
| Severe | 83 (8.7%) |
| Smoking | |
| Never | 268 (28.2%) |
| Past | 222 (23.3%) |
| Current | 461 (48.5%) |
| Education | |
| Basic | 231 (24.3%) |
| Other | 720 (75.7%) |
| Risk | |
| Homo | 409 (43.0%) |
| Hetero | 367 (38.6%) |
| Other sources | 175 (18.4%) |
| Hep B | 63 (6.6%) |
| Hep C | 225 (23.7%) |
| Any prior cardiovascular event | 49 (5.2%) |
| Psychiatric co-morbidity | 339 (35.6%) |
| Non-adherence to ART regimen | 59 (6.2%) |
| Bipolar | 75 (7.9%) |
| Ethnicity | |
| White | 799 (84.0%) |
| Black | 93 (9.8%) |
| Hispanic | 28 (2.9%) |
| Asian or other | 31 (3.3%) |
| Female | 261 (27.4%) |
| Experienced past virological failure | 625 (65.7%) |
| Time since HIV diagnosis (years) | 15.5 (7.6) |
| Time on ART regimen (years) | 11.8 (5.7) |
We denote the four CD4 groups <100/uL, 100–350/uL, 350–500/uL, ≥500/uL as groups 1, 2, 3 and 4 respectively, for ease of notation. The propensity scores for treatment for the four CD4 groups are denoted as PS1, PS2, PS3, and PS4. Figure 3 displays boxplots for the estimated propensity scores calculated using multinomial logistic regression with the R package nnet, and the super learner-based method from the R package sl3, both based on all the 18 covariates. The libraries used for the super learner methods are the same as those employed in the simulation studies. Multinomial logistic regression and super learner method produce similar results for the estimated propensity score, although multinomial logistic regression tends to exhibit more extreme propensity scores in general.
Figure 3.

Boxplot of the propensity scores estimated by multinomial logistic regression (Multinomial) and super learner-based method (SL) for the HIV example data.
We use various methods mentioned in Section 3 to perform data analysis on this dataset. We obtain the costs difference comparing CD4 Group 1, 2, and 3 with Group 4. Since we do not know what is the correct model for this data, we obtain RA, IPW and DR estimators after applying parametric regression models on the outcome and propensity score regression models using all available covariates, and also apply SL methods for these models and obtain RA_SL, IPW_SL, and DR_SL estimators. The standard errors are obtained using the formula given in Section 3 (same as in the simulations). Finally, we obtain the TMLE estimator using the R package tmle3, and the same libraries as employed in the simulation studies. The results are displayed in Table 3.
Table 3.
Estimated average treatment effects for the HIV example study.
| Comparison | < 100 vs. ≥ 500 | 100–350 vs. ≥ 500 | 350–500 vs. ≥ 500 | |||
|---|---|---|---|---|---|---|
|
|
|
|
||||
| Method | ATE | SE | ATE | SE | ATE | SE |
|
| ||||||
| RA | 2132.95 | 201.92 | 1616.23 | 232.93 | −387.18 | 148.50 |
| IPW | 1197.36 | 1669.61 | 1040.67 | 1119.80 | −809.31 | 1098.87 |
| DR | 1460.91 | 1773.38 | 1441.18 | 1116.12 | −447.19 | 1083.69 |
| RA_SL | 602.29 | 125.11 | 1254.28 | 183.31 | −1234.33 | 137.78 |
| IPW_SL | 503.66 | 1172.98 | 1147.02 | 1059.25 | −1259.26 | 903.93 |
| DR_SL | 862.10 | 1027.83 | 1462.39 | 885.33 | −1160.77 | 681.74 |
| TMLE | 1744.11 | 1651.94 | 1792.73 | 1190.07 | −719.26 | 1116.43 |
Note: ‘ATE’ is the estimated average treatment effect (i.e. costs difference); ‘SE’ is the estimated standard errors; ‘RA’, ‘IPW’, and ‘DR’ fit linear regression for the outcome models and multinomial model for the propensity score models; ‘RA_SL’, ‘IPW_SL’, ‘DR_SL’ and ‘TMLE’ use super learner-based methods.
Similar to the simulation studies, we obtained different results by using different estimators. The RA and RA_SL methods produced smaller standard errors, but we have learned from the theory and the simulation studies that these standard errors tend to underestimate the true variability. Due to the large standard errors, most of the ATE (difference of costs) are not statistically significantly different from zero. However, we can see that the estimated costs for the group with CD4 counts <100/uL and 100–350/uL are associated with bigger costs compared to the group with CD4 counts ≥500, and the group with CD4 counts 350–500/uL have incurred less costs than the group ≥500/uL.
6. Discussion and conclusions
In this paper, we systematically introduce the application of advanced causal inference methods for comparing multiple treatment groups. These methods involve modelling the outcome, the propensity score, or both. We extend these methods to situations where machine learning techniques, particularly the super learner, are employed to model these relationships. While machine learning methods offer the advantage of not requiring specific parametric relationships, thereby avoiding restrictive assumptions, they also present challenges in terms of statistical inference. Traditional methods, such as the outcome regression model, can exhibit bias that does not diminish with a large sample size. However, the double-robust estimator and the TMLE method are considered de-biased estimators and provide accurate variance estimates (van der Laan and Rose 2011; Chernozhukov et al. 2018; Hines et al. 2022).
In our simulation study, we observed the following results: (I) when both the propensity score model and the outcome model are correctly specified, RA, IPW, and DR methods exhibit negligible biases and achieve the correct coverage probability for the 95% confidence interval; (II) when both the propensity score model and the outcome model are incorrectly specified, RA, IPW, and DR may perform poorly with significant biases; (III) when the super learner method is employed, the DR estimator and TMLE estimator are preferred over the RA and IPW estimators; (IV) the TMLE estimator does not perform as well as the DR estimator, and the reasons for this discrepancy need further investigation.
While there are existing statistical software packages for conducting causal inference with multiple treatment group comparisons, we have identified areas where improvements are needed for broader applications. For instance, the Stata command teffects aipw can be utilised to estimate multivalued treatment effects, but we encountered convergence issues when applying it to our real data. Additionally, the R package npcausal, developed by Edward H. Kennedy, calculates the DR estimator while incorporating machine learning methods. However, we observed that it employs a ‘1 vs. other’ strategy when using the super learner to estimate the propensity score, but calculates the propensity score for the last group (e.g. PS4 in our data) as 1 minus the sum of the other propensity scores (e.g. PS4 = 1 – PS1 – PS2 – PS3), which may result in PS4 occasionally being less than zero. The package currently does not support the use of different super learner libraries for the outcome and propensity score models. Similarly, we believe that there is room for improvement in the TMLE packages as well.
In summary, our results suggest that the double-robust estimator, when combined with machine learning, is the most favourable approach, demonstrating lower biases, a valid variance estimator, and improved coverage probabilities for the 95% confidence interval. In practice, machine learning-based methods can be computationally intensive, particularly for large dataset. To address this, parallel processing can be adopted for larger-scale applications to reduce computation time.
For propensity score estimation in the presence of multiple treatments, we recommend using the super learner R package , which efficiently handles multiple treatments without requiring additional ‘1 vs. other’ estimation and manual normalisation steps.
From a methodological perspective, all causal inference methods rely on certain assumptions. While doubly robust and TMLE estimators provide some protection against model misspecification, they still require at least one model (either the propensity score or the outcome model) to be correctly specified. Additionally, methods based on inverse probability weighting can be sensitive to extreme propensity scores, leading to high variance in the estimates. To address this, strategies such as restricting the covariate space (i.e. excluding covariate values where certain treatment options are not feasible), trimming, or using stabilised weights can help improve the robustness of the estimates.
Causal inference has found applications across various fields, such as longitudinal data analysis, mediation studies, and dynamic treatment assessments. It’s crucial to grasp the underlying theory and assumptions that support these methods. The integration of machine learning techniques into causal inference introduces additional statistical inference challenges, making it a domain that warrants further exploration and research.
Supplementary Material
Supplemental data for this article can be accessed online at http://dx.doi.org/10.1080/10485252.2025.2544936.
Acknowledgments
We are deeply appreciative of Dr. Heiner C. Bucher from Basel Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel and University of Basel, for generously providing the de-identified HIV cost data for our use in this study. This research was partly supported by UC Davis Conte Center Biostatistics Core (P50 MH106438) for Dr. Chen. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Funding
This work was supported by National Institute of Mental Health [P50MH106438].
Footnotes
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- Abadie A, and Imbens GW (2016), ‘Matching on the Estimated Propensity Score’, Econometrica, 84(2), 781–807. [Google Scholar]
- Bang H, and Robins JM (2005), ‘Doubly Robust Estimation in Missing Data and Causal Inference Models’, Biometrics, 61(4), 962–973. [DOI] [PubMed] [Google Scholar]
- Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA, and Ritov Y (1993), Efficient and Adaptive Estimation for Semiparametric Models (Vol. 4), New York: Springer. [Google Scholar]
- Cattaneo MD (2010), ‘Efficient Semiparametric Estimation of Multi-Valued Treatment Effects Under Ignorability’, Journal of Econometrics, 155(2), 138–154. [Google Scholar]
- Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, and Robins J (2018), ‘Double/Debiased Machine Learning for Treatment and Structural Parameters’, The Econometrics Journal, 21(1), C1–C68. 10.1111/ectj.12097. [DOI] [Google Scholar]
- Coyle JR (2021), ‘tmle3: The Extensible TMLE Framework’, R package version 0.2.0. Available at https://github.com/tlverse/tmle3.
- Coyle JR, Hejazi NS, Malenica I, Phillips RV, and Sofrygin O (2021), ‘sl3: Modern Pipelines for Machine Learning and Super Learning’, R package version 1.4.2. Available at https://github.com/tlverse/sl3. [Google Scholar]
- D’Agostino RB Jr. (1998), ‘Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Control Group’, Statistics in Medicine, 17(19), 2265–2281. [DOI] [PubMed] [Google Scholar]
- Decruyenaere A, Steen J, Colpaert K, Benoit DD, Decruyenaere J, and Vansteelandt S (2020), ‘The Obesity Paradox in Critically Ill Patients: A Causal Learning Approach to a Casual Finding’, Critical Care, 24, 485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng P, Zhou XH, Zou QM, Fan MY, and Li XS (2012), ‘Generalized Propensity Score for Estimating the Average Treatment Effect of Multiple Treatments’, Statistics in Medicine, 31(7), 681–697. [DOI] [PubMed] [Google Scholar]
- Gruber S, and van der Laan M (2012), ‘tmle: An R Package for Targeted Maximum Likelihood Estimation’, Journal of Statistical Software, 51(13), 1–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahn J (1998), ‘On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects’, Econometrica, 66(2), 315–331. [Google Scholar]
- Hastie T, Tibshirani R, Friedman JH, and Friedman JH (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Vol. 2), New York: Springer. [Google Scholar]
- Hernán MA, and Robins JM (2020), Causal Inference, Boca Raton, FL: CRC Press. [Google Scholar]
- Hines O, Dukes O, Diaz-Ordaz K, and Vansteelandt S (2022), ‘Demystifying Statistical Learning Based on Efficient Influence Functions’, The American Statistician, 76(3), 292–304. [Google Scholar]
- Hirano K, Imbens GW, and Ridder G (2003), ‘Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score’, Econometrica, 71(4), 1161–1189. [Google Scholar]
- Horvitz DG, and Thompson DJ (1952), ‘A Generalization of Sampling Without Replacement From a Finite Universe’, Journal of the American Statistical Association, 47(260), 663–685. [Google Scholar]
- Ichimura H, and Newey WK (2022), ‘The Influence Function of Semiparametric Estimators Quantitative’, Economics, 13(1), 29–61. [Google Scholar]
- Imbens GW (2000), ‘The Role of the Propensity Score in Estimating Dose-Response Functions’, Biometrika, 87(3), 706–710. [Google Scholar]
- Imbens GW, and Rubin DB (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, Cambridge, UK: Cambridge University Press. [Google Scholar]
- Jewell NP (2003), Statistics for Epidemiology, Boca Raton, FL: CRC Press. [Google Scholar]
- Kang JD, and Schafer JL (2007), ‘Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean From Incomplete Data’, Statistical Science, 22(4), 523–539. [Google Scholar]
- Leon-Reyes S, Schäfer J, Früh M, Schwenkglenks M, Reich O, Schmidlin K, Staehelin C, Battegay M, Cavassini M, Hasse B, Bernasconi E, Calmy A, Hoffmann M, Schoeni-Affolter F, Zhao H, and Bucher HC (2019), ‘Cost Estimates for Human Immunodeficiency Virus (HIV) Care and Patient Characteristics for Health Resource Use From Linkage of Claims Data With the Swiss HIV Cohort Study’, Clinical Infectious Diseases, 68(5), 827–833. [DOI] [PubMed] [Google Scholar]
- Li F, Zaslavsky AM, and Landrum MB (2013), ‘Propensity Score Weighting With Multilevel Data’, Statistics in Medicine, 32(19), 3373–3387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linden A, Uysal SD, Ryan A, and Adams JL (2016), ‘Estimating Causal Effects for Multivalued Treatments: A Comparison of Approaches’, Statistics in Medicine, 35(4), 534–552. [DOI] [PubMed] [Google Scholar]
- Lopez MJ, and Gutman R (2017), ‘Estimation of Causal Effects With Multiple Treatments: A Review and New Ideas’, Statistical Science, 32(3), 432–454. [Google Scholar]
- Lunceford JK, and Davidian M (2004), ‘Stratification and Weighting Via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study’, Statistics in Medicine, 23(19), 2937–2960. [DOI] [PubMed] [Google Scholar]
- McCaffrey DF, Ridgeway G, and Morral AR (2004), ‘Propensity Score Estimation With Boosted Regression for Evaluating Causal Effects in Observational Studies’, Psychological Methods, 9(4), 403–425. [DOI] [PubMed] [Google Scholar]
- Pearl J, Glymour M, and Jewell NP (2016), Causal Inference in Statistics: A Primer, Hoboken, NJ: John Wiley & Sons. [Google Scholar]
- Polley E, LeDell E, Kennedy C, and van der Laan M (2021), ‘SuperLearner: Super Learner Prediction’, R package version 2.0–28. Available at https://CRAN.R-project.org/package = SuperLearner. [Google Scholar]
- Robins JM, Rotnitzky A, and Zhao LP (1994), ‘Estimation of Regression Coefficients When Some Regressors Are Not Always Observed’, Journal of the American Statistical Association, 89(427), 846–866. [Google Scholar]
- Robins JM, Rotnitzky A, and Zhao LP (1995), ‘Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data’, Journal of the American Statistical Association, 90(429), 106–121. [Google Scholar]
- Rose S, and Normand SL (2019), ‘Double Robust Estimation for Multiple Unordered Treatments and Clustered Observations: Evaluating Drug-eluting Coronary Artery Stents’, Biometrics, 75(1), 289–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenbaum PR (1987), ‘Model-Based Direct Adjustment’, Journal of the American Statistical Association, 82(398), 387–394. [Google Scholar]
- Rosenbaum PR, and Rubin DB (1983), ‘The Central Role of the Propensity Score in Observational Studies for Causal Effects’, Biometrika, 70(1), 41–55. [Google Scholar]
- Rubin DB (2005), ‘Causal Inference Using Potential Outcomes’, Journal of the American Statistical Association, 100(469), 322–331. [Google Scholar]
- Scharfstein DO, Rotnitzky A, and Robins JM (1999), ‘Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models’, Journal of the American Statistical Association, 94(448), 1096–1120. [Google Scholar]
- Schoeni-Affolter F, Ledergerber B, Rickenbach M, Rudin C, Günthard HF, Telenti A, Furrer H, Yerly S, and Francioli P (2010), ‘Cohort Profile: The Swiss HIV Cohort Study’, International Journal of Epidemiology, 39(5), 1179–1189. [DOI] [PubMed] [Google Scholar]
- Scotina AD, Beaudoin FL, and Gutman R (2020), ‘Matching Estimators for Causal Effects of Multiple Treatments’, Statistical Methods in Medical Research, 29(4), 1051–1066. [DOI] [PubMed] [Google Scholar]
- Stefanski LA, and Boos DD (2002), ‘The Calculus of M-Estimation’, The American Statistician, 56(1), 29–38. [Google Scholar]
- Sturmer T, Webster-Clark M, Lund JL, Wyss R, Ellis AR, Lunt M, Rothman KJ, and Glynn RJ (2021), ‘Propensity Score Weighting and Trimming Strategies for Reducing Variance and Bias of Treatment Effect Estimates: A Simulation Study’, American Journal of Epidemiology, 190(8), 1659–1670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uysal DS (2013), ‘Doubly Robust Estimation of Causal Effects with Multivalued Treatments’, Institute for Advanced Studies Economics Series, 297. [Google Scholar]
- Uysal DS (2015), ‘Doubly Robust Estimation of Causal Effects with Multivalued Treatments: An Application to the Returns to Schooling’, Journal of Applied Econometrics, 30(5), 763–786. [Google Scholar]
- van der Laan MJ, Polley EC, and Hubbard AE (2007), ‘Super Learner’, Statistical Applications in Genetics and Molecular Biology, 6(1), 25. [Google Scholar]
- van der Laan MJ, and Rose S (2011), Targeted Learning: Causal Inference for Observational and Experimental Data (Vol. 4), New York: Springer. [Google Scholar]
- van der Laan MJ, and Rubin D (2006), ‘Targeted Maximum Likelihood Learning’, The International Journal of Biostatistics, 2(1), 11. [Google Scholar]
- Yan X, Abdia Y, Datta S, Kulasekera K, Ugiliweneza B, Boakye M, and Kong M (2019), ‘Estimation of Average Treatment Effects Among Multiple Treatment Groups by Using An Ensemble Approach’, Statistics in Medicine, 38(15), 2828–2846. [DOI] [PubMed] [Google Scholar]
- Yang S, Imbens GW, Cui Z, Faries DE, and Kadziola Z (2016), ‘Propensity Score Matching and Subclassification in Observational Studies With Multi-Level Treatments’, Biometrics, 72(4), 1055–1065. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
