Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 9.
Published in final edited form as: Stat Med. 2022 Aug 18;41(25):5016–5032. doi: 10.1002/sim.9551

Flexible propensity score estimation strategies for clustered data in observational studies

Ting-Hsuan Chang 1, Trang Quynh Nguyen 2, Youjin Lee 3, John W Jackson 1,2,4, Elizabeth A Stuart 2,4,5
PMCID: PMC9996644  NIHMSID: NIHMS1866551  PMID: 36263918

Abstract

Existing studies have suggested superior performance of nonparametric machine learning over logistic regression for propensity score estimation. However, it is unclear whether the advantages of nonparametric propensity score modeling carry to settings where there is clustering of individuals, especially when there is unmeasured cluster-level confounding. In this work we examined the performance of logistic regression (all main effects), Bayesian additive regression trees and generalized boosted modeling for propensity score weighting in clustered settings, with the clustering being accounted for by including either cluster indicators or random intercepts. We simulated data for three hypothetical observational studies of varying sample and cluster sizes. Confounders were generated at both levels, including a cluster-level confounder that is unobserved in the analyses. A binary treatment and a continuous outcome were generated based on seven scenarios with varying relationships between the treatment and confounders (linear and additive, non-linear/non-additive, non-additive with the unobserved cluster-level confounder). Results suggest that when the sample and cluster sizes are large, nonparametric propensity score estimation may provide better covariate balance, bias reduction, and 95% confidence interval coverage, regardless of the degree of non-linearity or non-additivity in the true propensity score model. When the sample or cluster sizes are small, however, nonparametric approaches may become more vulnerable to unmeasured cluster-level confounding and thus may not be a better alternative to multilevel logistic regression. We applied the methods to the National Longitudinal Study of Adolescent to Adult Health data, estimating the effect of team sports participation during adolescence on adulthood depressive symptoms.

Keywords: propensity score weighting, clustering, machine learning, observational studies, unmeasured confounder

1. INTRODUCTION

Propensity score methods are widely used to evaluate the causal effects of treatments in observational studies. The propensity score, which is the probability of receiving a treatment conditional on a set of covariates,1 is especially useful when there is a large number of confounders that need to be adjusted for. Conditional on the propensity score, the distribution of the covariates entered in the propensity score model is expected to be similar between treatment groups.1 Thus, the propensity scores can be used to reduce bias in the treatment effect estimate that arises from differences in the distribution of observed confounders between treatment groups. This bias reduction can be achieved using multiple strategies, including matching subjects on propensity scores, grouping subjects into strata with similar propensity scores, adjusting for propensity scores in the outcome model, or applying propensity score weights (for more details see, e.g., D’Agostino2; Hirano and Imbens3).

Despite the increasing use of propensity score methods in substantive studies over the past two decades,4 work on this topic in the context of clustered or multilevel data structures has been relatively limited. However, clustered data is common among many disciplines, especially in medical, behavioral, and educational research settings (e.g., students nested within schools, patients nested within hospitals). Here we consider the simplest case where the data is structured in two levels (individual-level and cluster-level) and treatment is administered at the individual level. The clustered structure adds complexity in conducting propensity score analysis. For example, there may be concerns regarding interference among individuals within clusters as well as possible heterogeneity in treatment effect or treatment implementation across clusters. Moreover, it can be challenging to identify and measure confounders at the cluster level, such as when using survey data that do not record cluster-level covariates. Unmeasured cluster-level confounding would create bias in the treatment effect estimate if unaccounted for. Even when all cluster-level confounders are measured, if the number of clusters is small, there may not be sufficient overlap on these confounders and residual confounding may remain. Treatment effect estimates obtained without consideration of these issues can be misleading, and propensity score methods should be adapted for specific contexts. For the purpose of our study, we will focus exclusively on the issue of unmeasured cluster-level confounding while assuming no unmeasured individual-level confounding, as well as the stable unit treatment value assumption (SUTVA),5 which assumes a single version of each treatment level (i.e., no heterogeneity in treatment implementation across clusters) and no interference between individuals, including those that belong to the same cluster. Although SUTVA is often questionable in practical settings with clustered observations, we will generate data based on SUTVA in order to focus on the robustness of propensity score models to unmeasured cluster-level confounding.

When the treatment administered to individuals is binary, to alleviate the concern for unmeasured cluster-level confounding, a logistic regression model with either fixed or random cluster intercepts is often used to estimate propensity scores with two-level clustered data. These models account for unobserved cluster heterogeneity – and thus potential cluster-level unobserved confounding – by allowing the intercept to vary by cluster.6 In fixed effects models, the intercept is a fixed constant for each cluster; thus, the number of parameters increases considerably when there are many clusters, and the estimation of parameters may fail if cluster sizes are too small. Random effects models typically assume that the cluster-specific intercept follows a normal distribution. However, unlike fixed cluster intercepts, random intercepts are shrunken toward the mean and thus may not fully capture the remaining effects of unobserved cluster-level covariates.7 Random effects models also require independence between the observed covariates and the random intercept.8 Both models acknowledge the possibility of unmeasured cluster-level confounding and assume no confounding within each cluster. Existing research has shown that propensity score models that account for the clustering – either through fixed or random cluster intercepts – led to less biased estimates compared to models that do not.7,9,10

When there are many potential confounders, specification of the propensity score model can become extremely complicated, especially when there is potential interactions between confounders. To allow more flexibility in the propensity score model, Leite et al11 suggested adopting the parsimony principle in building random effects models (i.e., adding random slopes and cross-level interactions step by step until sufficient covariate balance is attained). This approach, though reasonable, is inefficient and still requires a certain level of knowledge of the functional form for the relationship between treatment assignment and covariates. Nonparametric machine learning methods are one promising solution to overcoming model specification challenges due to their general ability to generate flexible models without model specification. Further, there has been evidence that nonparametric propensity score estimation achieves more efficient estimation of the average treatment effect, even when the true propensity score model is known to be parametric.12 This finding is likely related to the results that show a preference to use estimated propensity scores over known treatment assignment probabilities to adjust for chance imbalances.13 Hence, nonparametric methods have gained popularity in propensity score estimation with single-level data – a specific commonly used approach is generalized boosted modeling, which can be used to generate propensity score weights that eliminate most covariate imbalances.14

Some work has been done to compare parametric versus nonparametric approaches for estimating propensity scores in single-level settings. Setoguchi et al15 compared machine learning techniques such as recursive partitioning and neural networks to logistic regression with only main effects with respect to propensity score matching. Their simulation study found that neural networks generally yielded the least biased estimates under various scenarios differing by non-linear and/or non-additive relationships between treatment assignment and covariates. Following Setoguchi et al,15 Lee et al16 examined the performance of propensity score models based on classification and regression trees (CART) with respect to propensity score weighting. Their simulation results supported that of Setoguchi et al,15 showing that estimating propensity scores using nonparametric methods, especially boosted regression trees, may offer advantages in propensity score weighting when the relationship between treatment assignment and covariates is non-linear and/or non-additive (and therefore the logistic regression model with main effects only is misspecified). These improvements include better bias reduction and more consistent 95% confidence interval coverage.16 The simulation designs of Setoguchi et al15 and Lee et al,16 however, assume a single-level data structure and no unmeasured confounding.

Motivated by the limited research on nonparametric propensity score estimation with clustered data, and in particular how unobserved cluster-level confounders may influence the methods’ performance, our goal is to examine whether the advantages of the flexible modeling of propensity scores extend to clustered data settings. In this work, we conduct simulations to examine the performance of nonparametric and parametric propensity score models, when used to achieve covariate balance between treatment groups across clusters via weighting. Focus is on estimation of the overall average treatment effect (not cluster-specific effects) in a two-level clustered data context where a binary treatment is administered at the individual level. The paper is organized as follows: Section 2 provides a brief introduction of the statistical methods, including inverse probability of treatment weighting and the nonparametric methods that are used to estimate propensity scores in this work. Section 3 describes the simulation setup and the performance measures for evaluating the performances of different propensity score models. Section 4 presents the simulation results. In Section 5, we apply the methods to data from the National Longitudinal Study of Adolescent to Adult Health (Add Health),17 estimating the effect of team sports participation during adolescence on depressive symptoms in adulthood. Finally, Section 6 discusses the implications and limitations of our work, as well as potential directions for future research.

2. STATISTICAL METHODS

2.1. Inverse probability of treatment weighting

We first review the basics of treatment effect estimation using propensity score weights. Our definition of the treatment effect is based on the potential outcomes framework, including the SUTVA assumption mentioned in the introduction.18,19 The SUTVA assumption has two components: 1) an individual’s outcome is unaffected by the level of treatment assigned to another individual; 2) there is only one version of each treatment level. Under this assumption, each individual, indexed by subscript k, has two potential outcomes associated with a binary treatment: Yk(1) (potential outcome under treatment) and Yk(0) (potential outcome under the control condition). The individual treatment effect is defined as the difference between the two potential outcomes, Yk1Yk(0). Our target estimand is the average treatment effect (ATE) in the population, which is defined as the expected value of the individual treatment effects, ATE=EYk1Yk0.

We use propensity score weighting to obtain covariate balance between the treated and control groups in the full sample, thereby reducing bias in the overall ATE estimate. Specifically, given the model-estimated propensity score for individual k,e^k (we add the hat symbol for the estimated propensity score to differentiate it from the true propensity score, ek), the inverse probability weight w^k=1/e^k is assigned if the individual is in the treated group and w^k=1/(1e^k) otherwise. To prevent extreme weights, we use stabilized weights, which replace 1 in the numerator with the marginal probability of being treated P(Z=1) for treated individuals (w^k=P(Z=1)/e^k) and the mariginal probability of being in the control group P(Z=0) for control individuals (w^k=P(Z=0)/(1e^k)). The ATE can then be estimated by contrasting the weighted means of the outcome between the two treatment groups:

ATE^=kZkYkw^kkZkw^kk(1Zk)Ykw^kk(1Zk)w^k,

where Zk and Yk denote the assigned treatment level (1 if treated and 0 otherwise) and the observed outcome under the assigned treatment, respectively. Because the weights are a direct function of the propensity scores, this inverse probability weighted (IPW) estimator is particularly sensitive to misspecifications of the propensity score model.20

A common metric of covariate balance is the absolute standardized mean difference (ASMD):

ASMD=X1X0s,

where X¯1 and X¯0 are the weighted sample means of a covariate X (or prevalence if X is binary) for the treated and control groups, respectively; s denotes its standard deviation (SD; usually the pooled SD from the treated and control groups combined). A lower ASMD indicates better covariate balance, and for a covariate to be adequately balanced, a ASMD less than or equal to 0.1 is generally considered acceptable.2123 We assess the usefulness of a propensity score model by calculating the ASMD of each covariate after the model-estimated propensity score weights are applied.

To estimate the standard error of the IPW estimator, one may either use a robust (or “sandwich”) estimator or perform bootstrapping. The need for a robust standard error estimator is to account for the within-subject correlation in replications of units caused by the application of propensity score weights.24 In clustered data settings, cluster robust standard errors should be used to account for within-cluster correlation induced by the clustered structure. We note that robust standard errors do not account for the uncertainty in estimating the propensity scores, and some have recommended using bootstrap standard errors for IPW estimators.7,25 Because bootstrapping would be computationally demanding in our simulations (especially with machine learning involved), we will consider the cluster robust standard error as an approximate for the variability of treatment effect estimates.

The outcome analysis approach described above accounts for the clustered nature of the data in standard error estimation through a sandwich estimator but does not include cluster indicators or other covariates directly. In practice, a “doubly robust” estimator that incorporates the covariates and the clustered structure in both the propensity score and outcome models is recommended (see Li et al7 for a doubly robust IPW estimator). Because the goal of our simulations is to compare different strategies for estimating propensity scores, we retain focus on the simple IPW estimator, which does not use a doubly robust outcome model, in order to isolate the performance with respect to propensity score estimation.

2.2. Propensity score estimation using nonparametric methods

For nonparametric estimation of the propensity scores, we introduce two approaches: generalized boosted modeling (GBM) and Bayesian additive regression trees (BART). GBM is a popular method for estimating propensity scores, in part because its covariate balancing ability has been studied extensively and computing tools have been developed in this regard. BART is easy to implement and has great predictive ability, but its use in propensity score estimation is less explored. The mathematical detail of these methods is outside the scope of this paper; hence we provide only a brief introduction to these two methods below. For both, we describe how they can be adapted to the clustered data setting.

Both methods have decision trees underlying the approach. A decision tree is a nonparametric way of partitioning the covariate space into disjoint sets such that each set, which corresponds to a node in the tree, is as similar as possible.26 When the outcome is a class (e.g., treated or untreated), a decision tree is often referred to as a classification tree, and observations falling in the same node of the tree have similar probabilities of class membership. An ensemble method, such as GBM, fits a series of decision trees to a random subset of the data, and it makes a prediction by averaging the predictions of the different trees. The idea of an ensemble method is to combine the predictions of multiple weak classifiers (i.e., trees), each constrained by a shrinkage parameter to prevent overfitting, to improve prediction accuracy. GBM is an ensemble method for which, in each iteration of tree fitting, observations that were incorrectly classified by previous trees are given a higher weight to be selected in the new tree.27 Propensity score estimation using GBM was first proposed by McCaffrey et al14 and is commonly implemented with the R package twang, which uses an algorithm aimed at achieving optimal covariate balance (e.g., minimizing the mean of the Kolmogorov-Smirnov test statistics).28

Similar to GBM, BART is also a nonparametric ensemble model, introduced by Chipman et al.29 As a Bayesian approach, BART incorporates regularization priors for the model’s residual standard deviation, the tree structure (including tree depth and splitting rules), and the values in the terminal nodes conditional on the corresponding tree. Sampling from the posterior is done by a Bayesian backfitting Markov Chain Monte Carlo approach29,30; the predicted value can be taken as the average of predictions over many draws from the posterior. The Bayesian framework spares the computational effort of cross-validation in determining model hyperparameters such as maximum tree depth and shrinkage parameter, which is commonly done with non-Bayesian ensemble methods. Although BART was developed for continuous outcomes, it can easily be extended for classification of binary outcomes by the probit or logit transformation and thus can be used to estimate propensity scores (see, e.g., Hill et al31; Dorie et al32). Normally, the estimated propensity score is the average of the values over a default number of posterior draws set by the specific statistical package. Chipman et al29 have demonstrated that BART outperforms several popular machine learning techniques, including GBM, random forest, and neural networks, in terms of in- and out-of-sample predictive performance. In our analysis, though, we are more interested in BART’s covariate balancing (rather than predictive) ability, which is the aim of propensity score methods.

To account for the clustered structure in our nonparametric propensity score models, indicators for cluster membership can be included in the GBM and BART, which is analogous to fitting a parametric regression model with fixed cluster effects, although not all cluster indicators may be used by the nonparametric models. An appealing feature of BART is that it allows random intercepts to be easily added to the model and can be implemented with available statistical software,29,33 whereas GBM with random effects has not been fully developed. We therefore consider BART with additive random intercepts as a nonparametric counterpart of the logistic regression model with random cluster intercepts.

Standard errors of the IPW estimator with weights estimated from GBM and BART can be obtained via nonparametric bootstrapping, although further research is needed to confirm its validity in the context of propensity score estimation.34 As mentioned in section 2.1, due to the computational burden required to perform bootstrapping with GBM and BART, we estimate standard errors using a cluster adjustment instead, and compare them with empirical standard errors.

3. SIMULATION STUDY

3.1. Setup

Our simulation setup is motivated by that in Setoguchi et al15 and Lee et al,16 with extensions to a two-level clustered data structure where a binary treatment is administered at the individual level. Given this two-level structure, we use h to index clusters (h=1,2,...,H, where H is the number of clusters in the simulated data set) and k to index individuals within a cluster (k=1,2,...,nh, where nh is the number of individuals in cluster h). The sample size for a given simulated data set is denoted as N=h=1Hnh. We consider three cases: 1) a small number of large clusters (H=20,200nh500 for h=1,2,...,20); 2) a large number of small clusters (H=100,nh=50 for h=1,2,...,100); 3) a small number of medium-sized clusters (H=20,nh=100 for h=1,2,...,20).

For each simulated data set under each case, six individual-level confounders (Xi,i=1,2,...,6), two cluster-level confounders (Vj,j=1,2), and an unmeasured cluster-level confounder (U; the confounder is unmeasured in the sense that it is excluded from both the propensity score model and outcome analyses) are independently generated from a standard normal distribution for each individual. Four of the confounders (X4,X5,X6,V2) are subsequently dichotomized with value 1 if the original value is greater than or equal to 0, and 0 otherwise.

The treatment probability ehk i.e., the true propensity score, for individual k in cluster h is generated from the following random effects logistic regression model, which is a function of individual- and cluster-level covariates, including U:

logitehk*=fX1,hk,X2,hk,,X6,hk,V1,h,V2,h,Uh+β0,h,
β0,h~N(0,1)

with further adjustment ehk=0.7ehk*+0.15 to ensure that each cluster has an adequate number of individuals assigned to each treatment level. The specification of the function in the propensity score generating model varies across scenarios that are described below and further detailed in the supplementary material. The treatment assignment Zhk is randomly sampled from a Bernoulli distribution with probability ehk; we denote Zhk=1 as being assigned to the treated group and Zhk=0 as being assigned to the control group.

The continuous outcome Yhk is generated from the following random effects linear regression model (the coefficients are provided in the supplementary material):

Yhk=α0,h+α1X1,hk+α2X2,hk++α6X6,hk+α7V1,h+α8V2,h+α9Uh+τhZhk+δZhkUh2+εhk,α0,h~N0,1,τh~N0,1,εhk~N(0,0.1)

The interaction term between treatment assignment and the square of U in the outcome model allows non-linear treatment effects in relation to U. We set δ=2 and α9=3. The value for α9 is purposefully chosen to be relatively large to magnify the issue of unmeasured confounding.

Similar to the setup in Setoguchi et al15 and Lee et al,16 we consider seven propensity score generating models (scenarios A-G) that differ in degrees of non-linearity or non-additivity (details in the supplementary material). The functional form of the propensity score generating model in each of the seven scenarios has the following components:

  • A: Main effects of X1,,X6,V1,V2 and U

  • B: Main effects plus three two-way interaction terms between observed confounders (X1X4,X3V2,X5V2)

  • C: Main effects plus six two-way interaction terms between observed confounders (X1X4,X3V2,X5V2,X2X5,X4X6,X6V2)

  • D: Main effects plus three two-way interaction terms between U and observed confounders (X1U,X4U,X5U)

  • E: Main effects plus six two-way interaction terms between U and observed confounders (X1U,X2U,X4U,X5U,X6U,V2U)

  • F: Main effects plus two cubic terms (X13,V13)

  • G: Main effects plus four cubic terms (X13,X23,X33,V13)

In reality, the functional form of the propensity score generating model is unknown. The addition of scenarios D and E is to investigate the performance of propensity score models when U serves as both an unmeasured confounder as well as a key source of variation in the propensity score. A logistic regression model assuming linear and additive associations between the treatment and confounders (i.e., including only main effects) is misspecified in scenarios B to G. Therefore, we expect the nonparametric propensity score models in general to produce less biased effect estimates compared to the main effects only logistic regression models at least in scenarios B, C, F, and G, in which the nonparametric models have more flexibility to detect non-linear or non-additive associations between the treatment and observed confounders. 1000 datasets are independently generated for each of the seven propensity score scenarios. All simulations are performed using R version 4.0.2.35

3.2. Methods compared

Three general modeling approaches are used to estimate propensity scores: logistic regression, BART and GBM. With each approach, we consider versions that either ignore or incorporate cluster information (through either fixed or random cluster effects). All analyses do not have access to the cluster-level confounder U, which is unmeasured. The specific methods are:

  • Logistic regression model (abbreviated as PARAM): single-level logistic regression with a main effect for each observed confounder.

  • Logistic regression model with fixed cluster effects (PARAM-FE): logistic regression with a main effect for each observed confounder and a fixed intercept for each cluster.

  • Logistic regression model with random cluster effects (PARAM-RE): logistic regression with a main effect for each observed confounder and random cluster intercepts.

  • Probit BART ignoring clusters (BART): BART model with probit link is implemented using the pbart function in the R package BART with default settings.36 Although the logit version of BART is also available in the BART package, we opt for probit BART due to its computational efficiency.

  • Probit BART with cluster indicators (BART-FE): Same as above, except that indicator variables for clusters are used as predictors in addition to the observed confounders.

  • Probit BART with random effects (BART-RE): BART model with probit link and additive random intercepts is implemented using the rbart function in the R package dbarts with default settings.33

  • GBM ignoring clusters (GBM): Propensity score estimation using GBM is implemented using the ps function in the R package twang with default settings.28

  • GBM with cluster indicators (GBM-FE): Same as above, except that indicator variables for clusters are added to the model.

For each method, we weight individuals by their estimated stabilized propensity score weights, and estimate the ATE using the marginal IPW estimator (as described in section 2.1).

3.3. Performance criteria

To evaluate the performance of the different propensity score estimation methods, we consider the following measures:

  • Absolute standardized mean difference (ASMD): a measure of covariate balance. In each of the 1000 simulations, we obtain the post-weighting absolute standardized difference of means between the treated and control groups for each individual-level covariate using the R packages survey37 to apply the estimated weights and tableone38 to calculate the ASMD. The average ASMD is then taken across all six individual-level covariates. In the following sections, we refer to this average as “ASMD” for simplicity (i.e, “ASMD” refers to the average ASMD taken across covariates in the same category, e.g., individual-level covariates), and refer to the mean of the ASMDs over 1000 simulations as “mean ASMD”. Similarly, we do this for the observed cluster-level covariates and the unobserved cluster-level covariate. The ASMD prior to propensity score weighting is also calculated to assess the initial covariate balance.

  • Bias: Both the difference between the estimated and true ATEs, ATE^ATE, and the absolute percentage difference from the true ATE, |ATE^ATEATE|, are calculated.

  • Cluster robust standard error: The ATE estimate and cluster robust standard error are obtained using the survey package.37

  • 95% confidence interval coverage: In each simulation, the estimated 95% confidence interval is based on the cluster robust standard error. The 95% confidence interval coverage is the percentage of the 1000 estimated 95% confidence intervals that cover the true ATE.

  • Weights: distribution of the estimated stabilized propensity score weights for control individuals, P(Z=0)1e^hk. Of particular interest is the proportion of extreme weights, which may result in bias and large variance.

4. RESULTS

Simulation results from case 1 (H=20and200nh500)

Table 1 shows the initial covariate balance in each propensity score scenario, with the mean ASMDs all falling between 0.2 to 0.8, indicating substantial imbalance before propensity score adjustment.

Table 1.

Pre-weighting absolute standardized mean difference (ASMD) averaged over 1000 simulations in case 1 (H=20,200nh500).

Scenario

A B C D E F G
ASMD (X)* 0.29 0.31 0.32 0.26 0.23 0.28 0.34
ASMD (V)** 0.26 0.24 0.24 0.26 0.26 0.36 0.26
ASMD (U)*** 0.44 0.39 0.35 0.55 0.72 0.29 0.20
*

Average ASMD of six individual-level covariates (X1,X2,,X6).

**

Average ASMD of two observed cluster-level covariates (V1,V2).

***

ASMD of an unobserved cluster-level covariate (U).

In the true propensity score model: (A) main effects only; (B) three two-way interaction terms between observed confounders; (C) six two-way interaction terms between observed confounders; (D) three two-way interaction terms between U and observed confounders; (E) six two-way interaction terms between U and observed confounders; (F) two cubic terms; (G) four cubic terms.

Propensity score weighting using propensity scores estimated from nonparametric models generally produced excellent balance of the observed covariates (X and V) across the seven scenarios (top and middle panels of Figure 1), though the covariate balancing performance (with respect to X and V) of BART and BART-FE declined slightly in scenario G (where the true propensity score model included several cubic terms) with a number of ASMD values greater than 0.1 (Supplementary Figures 1 and 2). When propensity scores were estimated using parametric models, the mean ASMDs of the observed covariates following weighting were mostly acceptable (e.g., the mean ASMD of X ranged from 0.05 in scenario E to 0.15 in scenario G for PARAM-FE; the mean ASMD of V ranged from 0.03 in scenario E to 0.13 in scenario F for PARAM), but large outliers (ASMD≥0.15) were observed across most scenarios, especially in scenarios F and G (where the true propensity score model included cubic terms), with the performance of PARAM being particularly poor (Supplementary Figures 1 and 2).

FIGURE 1.

FIGURE 1

Post-weighting absolute standardized mean difference (ASMD) averaged over 1000 simulations by propensity score estimation model in each of seven propensity score scenarios (A: main effects only; B and C: interactions between observed covariates; D and E: interactions with an unobserved cluster-level covariate; F and G: cubic terms) in case 1 (20 clusters of size 200–500). ASMD (X) is the average ASMD of six individual-level covariates; ASMD (V) is the average ASMD of two observed cluster-level covariates, and ASMD (U) is the ASMD of an unobserved cluster-level covariate

As to the unmeasured cluster-level covariate (U), estimating propensity scores using BART-FE yielded the smallest mean ASMD in scenarios A to F, while GBM-FE yielded the smallest mean ASMD in scenario G (bottom panel of Figure 1). With the clustered structure ignored, GBM on average produced worse balance on U than the other nonparametric models in all scenarios, and performed even worse than PARAM-FE and PARAM-RE in scenarios D and E, where the true propensity score model included interaction terms involving U (Supplementary Figure 3). However, the mean and distribution of the ASMDs of U for BART, which also ignored clustering, were similar to those for nonparametric models that accounted for the clustered structure (Supplementary Figure 3). As expected, PARAM failed to balance U, which remained substantially imbalanced with the mean ASMD of U ranging from 0.14 (in scenario G) to 0.58 (in scenario E) across the seven scenarios.

In terms of the ATE estimates, the nonparametric propensity score models outperformed the parametric models with smaller mean absolute biases in all scenarios (top right of Figure 2). Among the nonparametric models, GBM yielded larger mean biases than GBM-FE and the BART-based models except in scenario G, where the covariate balancing performance of BART and BART-FE with respect to the observed covariates declined slightly. The parametric propensity score models performed unsatisfactorily with large absolute biases (>40%) across all seven scenarios. Although PARAM exhibited less bias on average than PARAM-FE and PARAM-RE in many scenarios (top left and top right of Figure 2), the spread of the biases (or absolute biases) over 1000 simulations for PARAM was larger and more extreme values were observed (Supplementary Figures 4 and 5).

FIGURE 2.

FIGURE 2

Results from case 1 (20 clusters of size 200–500). Bias (estimated—true average treatment effect; top left), absolute bias(%; top right), and cluster robust SE (bottom left) averaged over 1000 simulations by propensity score estimation model in each of seven propensity score scenarios (A: main effects only; B and C: interactions between observed covariates; D and E: interactions with an unobserved cluster-level covariate; F and G: cubic terms). Bottom right: 95% confidence interval coverage (percentage of 1000 estimated 95%confidence intervals that cover the true average treatment effect)

The cluster robust standard errors did not differ greatly across methods, except for PARAM producing noticeably larger standard errors (bottom left of Figure 2). In addition, both PARAM-FE and PARAM-RE had larger standard errors on average than the nonparametric models in scenarios F and G.

The nonparametric propensity score models had higher 95% CI coverage rates than PARAM-FE and PARAM-RE in all scenarios (bottom right of Figure 2). For example, PARAM-FE and PARAM-RE had 54.9% and 58.7% coverage rates, respectively, in scenario F, and only 43.3% and 48.5% coverage rates, respectively, in scenario G. PARAM, however, had a decent coverage rate in all scenarios, ranging from 74.4% in scenario G to 93.1% in scenario A; note, however, that this coverage is likely related to its large standard errors given the substantial bias it produced.

While bias increased with increasing non-additivity or non-linearity in scenarios B, C, F, and G for PARAM-FE and PARAM-RE, in scenarios D and E we saw slight improvements in both bias and coverage rate as the non-additivity involving U increased, which may be a result of U being a continuous variable. To explore this we repeated the simulations but with U dichotomized, thus increasing non-smoothness in the response surface – as expected, we found increasing absolute bias and decreasing coverage rate with increasing non-additivity involving U for PARAM-FE and PARAM-RE (results not shown).

Overall, parametric propensity score models produced a greater number of extreme weights than the nonparametric models (Supplementary Figure 6). For example, in scenario G, the parametric models produced several stabilized weights greater than 50; the proportion of stabilized weights greater than 5 for untreated subjects from 10 random simulated data sets was approximately 2.4% for the parametric models and <1.5% for the nonparametric models.

Simulation results from case 2 (H=100 and nh=50)

In a setting with more clusters but each of smaller size, we observed fewer benefits of the nonparametric propensity score models. Weights estimated from the BART-based models and GBM provided considerably better balance on the observed covariates (X and V) than those from the parametric models in all scenarios other than D and E, in which the differences in the mean and spread of ASMDs across methods were smaller (top and middle panels of Figure 3, Supplementary Figures 7 and 8). Among the nonparametric models, the performance of GBM-FE was relatively poor as its mean ASMD of X and of V were consistently larger than that of the other nonparametric models, and it provided better balance on X and V compared to the parametric models only in scenarios F and G. On the other hand, for the unobserved cluster-level covariate (U), weights generated using PARAM-FE, PARAM-RE, and GBM-FE provided better balance, especially in scenarios D and E (bottom panel of Figure 3; Supplementary Figure 9). U remained largely imbalanced when using PARAM, BART, BART-RE, and GBM in several scenarios, and slightly imbalanced for BART-FE. For example, the mean ASMD of U for BART ranged from 0.13 (in scenario G) to 0.48 (in scenario E) across the seven scenarios, while the mean ASMD of U for BART-FE ranged from 0.07 (in scenario G) to 0.23 (in scenario E).

FIGURE 3.

FIGURE 3

Post-weighting absolute standardized mean difference (ASMD) averaged over 1000 simulations by propensity score estimation model in each of seven propensity score scenarios (A: main effects only; B and C: interactions between observed covariates; D and E: interactions with an unobserved cluster-level covariate; F and G: cubic terms) in case 2 (100 clusters of size 50). ASMD (X) is the average ASMD of six individual-level covariates; ASMD (V) is the average ASMD of two observed cluster-level covariates, and ASMD (U) is the ASMD of an unobserved cluster-level covariate

With regard to bias, the nonparametric propensity score models as a whole outperformed the parametric ones in scenarios F and G only, with the BART-based models yielding the least bias on average under these scenarios (top left and top right of Figure 4; Supplementary Figures 10 and 11). In scenarios D and E, PARAM-RE had the smallest mean bias, which may relate to its ability to balance U.

FIGURE 4.

FIGURE 4

Results from case 2 (100 clusters of size 50). Bias (estimated—true average treatment effect; top left), absolute bias (%; top right), and cluster robust SE (bottom left) averaged over 1000 simulations by propensity score estimation model in each of seven propensity score scenarios (A: main effects only; B and C: interactions between observed covariates; D and E: interactions with an unobserved cluster-level covariate; F and G: cubic terms). Bottom right: 95% confidence interval coverage (percentage of 1000 estimated 95% confidence intervals that cover the true average treatment effect)

As seen in the previous case, the cluster robust standard errors did not differ greatly across methods, with the exception of PARAM yielding substantially larger standard errors (bottom left of Figure 4). The 95% coverage rates vary greatly, both across models and across scenarios (bottom right of Figure 4). The consistently low coverage rates of GBM-FE were likely the result of a large bias and a small standard error combined; PARAM had a higher coverage rate than GBM-FE despite having larger bias in some scenarios, possibly due to its large standard errors. As in the previous case, the parametric models were more likely to produce extreme weights than the nonparametric models in all scenarios (Supplementary Figure 12).

Simulation results from case 3 (H=20 and nh=100)

Finally, we provide a brief summary of results from the case with a small number of clusters (as in case 1) but where each cluster is smaller. The resulting figures are presented in the supplementary material. With regard to the observed covariates (X and V), the BART-based propensity score models yielded the smallest mean ASMD across the seven scenarios, and the GBM-based models yielded better balance than the parametric models in all scenarios except D and E (Supplementary Figure 13). Similar to case 2 (H=100 and nh=50), PARAM-FE, PARAM-RE, BART-FE, and GBM-FE appeared to be better at balancing the unobserved cluster-level covariate U. BART-based models yielded the least mean bias across the seven scenarios, and PARAM-RE had comparably small biases in scenarios D and E due to its covariate balancing performance on U (top left and top right of Supplementary Figure 14). As seen in previous cases, the nonparametric models outperformed the parametric models in terms of bias when the true propensity score model included cubic terms of the observed covariates (scenarios F and G); they also produced fewer extreme weights than the parametric models in all scenarios (Supplementary Figure 15).

Assessing the standard error estimation

To assess the validity of cluster robust standard errors, we compared them to the empirical standard errors. The two standard errors were generally similar – the ratio of the cluster robust standard error versus the empirical standard error, averaged across all propensity score scenarios and methods, was 0.96, 1.01, and 0.98 in cases 1, 2, and 3, respectively (comparisons by method and by scenario are presented in Supplementary Table 1). Future work may further explore the performance of cluster robust standard errors – and other variance estimation strategies – under different scenarios and the propensity score estimation method used.

5. APPLICATION

As an illustration, we apply the propensity score estimation methods used in the simulation study to the public-use data sets of The National Longitudinal Study of Adolescent to Adult Health (Add Health). A nationally representative sample of U.S. adolescents who participated in Add Health were followed into adulthood – the first wave was conducted during the 1994–1995 school year when the respondents were in grades 7 through 12; the fourth and most recent wave was conducted in 2008 when the respondents were aged 24–32.17

Our application is based on the study by Easterlin et al,39 which used the Add Health data to investigate the association of team sports participation during adolescence with adult mental health outcomes among individuals exposed to adverse childhood experiences. For the purpose of demonstration, we use the wave 1 and wave 4 public-use data sets of Add Health to estimate the effect of team sports participation during adolescence on adulthood depressive symptoms. The Add Health public-use data sets contain limited survey data for a subset of the full Add Health sample and are available for access by the general public. The wave 1 and wave 4 public-use data sets contain data for 6,504 and 5,114 respondents, respectively, from 132 schools. To avoid convergence issues with small clusters, we restrict our analysis to the 10 largest schools, resulting in an analytic sample of 617 respondents with the school sizes ranging from 51 to 95 students.

As in Easterlin et al,39 the “treatment” is defined as whether respondents participated in at least one team sport during adolescence, which was captured by the wave 1 in-school questionnaire. Our outcome of interest is the total score on the 10-item subscale of the Center for Epidemiologic Studies Depression scale (CES-D-10) in the wave 4 survey, ranging from 0 to 25 in our analytic sample (the maximum possible score is 30).

We select six individual-level covariates based on components of the propensity score in Easterlin et al39: sex (female and male), race (White, Black, Native American/Indian, Asian, and other), ethnicity (Hispanic and non-Hispanic), parental education (a number between 0–8 where higher values indicate higher education attainment of whichever parent has the higher education level; education level of the mother is used if that of the father is missing, and vice versa), whether the respondent lived in an urban area, and neighborhood connectedness (0–2, the sum of responses to the questions “People in this neighborhood look out for one another” and “Do you usually feel safe in your neighborhood?”40; a positive response is coded as 1 and negative response as 0). These covariates are obtained from the wave 1 survey. We also calculate the respondents’ total scores on the Feelings Scale in the wave 1 survey, which mostly consists of items from CES-D (range in analytic sample: 0–38; maximum possible score: 57). School characteristics such as school size and region were also included in the propensity score model in Easterlin et al.39 However, school information is not available in the Add Health public-use data files. Therefore, only individual-level covariates are included in our propensity score models.

We estimate the propensity scores using the eight propensity score models listed in section 3.2. Components of the propensity score models include the aforementioned individual-level covariates, score on the wave 1 Feelings Scale, and school indicators for models with fixed or random cluster effects. We assume that participation in team sports do not affect responses to the Feelings Scale during wave 1 but note that this assumption should be carefully validated if the goal is to make substantive conclusions. The average treatment effect of team sports participation during wave 1 on CES-D-10 score during wave 4 is estimated via inverse probability of treatment weighting.

The left plot of Figure 5 shows the covariate balance of each individual-level covariate before and after weighting. All models yield decent balance (ASMD<0.1) on the individual-level covariates with a few minor exceptions (e.g., the ASMD of parental education from BART-RE is 0.11). Given the relatively small sample and moderate cluster sizes in this example, the covariate balancing performance of the nonparametric models may be more affected by unmeasured cluster-level confounding and potential cross-level interactions compared to the parametric models (similar to scenarios D and E in the simulation study). The right plot of Figure 5 shows the balance on school membership before and after weighting. Within each method, models that include fixed cluster effects (i.e., PARAM-FE, BART-FE, GBM-FE) tend to produce better balance on the school indicators, followed by models with random cluster effects (i.e., PARAM-RE and BART-RE).

FIGURE 5.

FIGURE 5

Covariate balance of the individual-level covariates (left) and school indicators (right) before and after propensity score weighting by eight propensity score estimation models (PARAM: logistic regression; PARAM-FE: logistic regression with fixed cluster effects; PARAM-RE: logistic regression with random cluster effects; BART: Bayesian additive regression trees; BART-FE: BART with cluster indicators; BART-RE: BART with random effects; GBM: generalized boosted model; GBM-FE: GBM with cluster indicators)

All models yield broadly similar estimates of the average treatment effect (0.30–0.67) and suggest that team sports participation during adolescence may not have an impact on adulthood depressive symptoms among the general U.S. population (Figure 6). The intracluster correlation coefficient of the CES-D score was small (approximately 0.02), which may be a reason why there is no huge difference between methods. We note that the purpose of this application is to demonstrate the use of different propensity score estimation models on real data rather than drawing substantive conclusions. The unavailability of the complete Add Health sample and survey data, as well as unmeasured confounding, may hinder us from obtaining valid causal effect estimates.

FIGURE 6.

FIGURE 6

Average treatment effect (ATE) estimate, with 95% confidence interval, of team sports participation during adolescence on CES-D-10 score during adulthood by propensity score estimation model (PARAM: logistic regression; PARAM-FE: logistic regression with fixed cluster effects; PARAM-RE: logistic regression with random cluster effects; BART: Bayesian additive regression trees; BART-FE: BART with cluster indicators; BART-RE: BART with random effects; GBM: generalized boosted model; GBM-FE: GBM with cluster indicators)

6. DISCUSSION

Our simulation study extends the findings of Lee et al16 to clustered data settings, supporting the potential usefulness of nonparametric machine learning techniques in improving propensity score weighting for causal inference. However, we also show that nonparametric propensity scores may lose their advantage under certain settings – in particular, when cluster sizes are not sufficiently larger than the number of clusters and there is strong unmeasured cluster-level confounding, propensity score estimation using BART or GBM models should be proceeded with caution.

The goal of propensity score weighting is to make the treated and control groups as similar as possible with respect to pre-treatment covariates in order to reduce bias in the treatment effect estimate. However, it is essentially impossible to capture the full set of confounders in practice. Our simulation study thus assumes that an unobserved confounder exists at the cluster level, and for both parametric and nonparametric approaches, we consider models that either account for or ignore the clustered structure. Regardless of the approach used, our results show the need to incorporate the clustered structure (through either fixed or random cluster effects) in order to provide better balance on the unobserved cluster-level covariate. While the comparison between fixed versus random effects is not our current focus in this paper, we note that the choice between fixed and random effects should be based not only on the data at hand, but also on the propensity score estimation approach (e.g., logistic regression or BART).

Although the simple IPW estimator is used given the focus on propensity score estimation in this paper, incorporating the clustered structure in both the propensity score and outcome models is recommended in practice.7 This can be done by fitting a fixed or random cluster effects outcome model on the weighted dataset or implementing the augmented IPW estimator with outcome models that account for the clustering. To explore the benefits of such an approach, we additionally fit a fixed effects outcome model with a main effect of each observed covariate and found significant improvements for all propensity score estimation strategies in all scenarios (bias plots are presented in the supplementary material, Supplementary Figure 16). The bias reduction was a result of having at least (or additionally) the outcome model correctly specified and accounting for the cluster information.

In our simulation setting with 20 clusters of sizes 200 to 500 (case 1), we found that BART models with random intercepts (BART-RE) provided excellent covariate balance of the observed covariates regardless of the number of interactions or cubic terms in the true propensity score model, while BART and GBM models that included cluster indicators (BART-FE and GBM-FE) were slightly better at balancing the unobserved cluster-level covariate. Further, BART provided decent covariate balance for both the observed and the unobserved cluster-level covariates even with the clustered structure being ignored, whereas GBM without cluster indicators fell short of balancing the unobserved cluster-level covariate. Both the BART-based models (BART, BART-FE, and BART-RE) and GBM with cluster indicators (GBM-FE) provided better balance for all types of covariates than the parametric models on average; this was found not only in scenarios where the logistic regression models were misspecified, but also in the scenario where the true propensity score model was both linear and additive. These findings suggest that (in cases of large cluster and sample sizes):

  • BART and GBM models that account for clustering may be better alternatives to parametric models for propensity score estimation;

  • including cluster indicators in nonparametric propensity score models is recommended when a strong degree of unmeasured cluster-level confounding is likely;

  • BART models with random intercepts may be useful when there is presumed to be little unmeasured cluster-level confounding and the balance of observed covariates is to be prioritized.

Note that the above suggestions may pertain only to cases where the sample and cluster sizes are large (e.g., 20 clusters of sizes 200 to 500). In our simulation setting with 100 clusters of size 50 (case 2), nonparametric propensity score models – especially those without cluster indicators – failed to provide adequate balance of the unobserved cluster-level covariate U, which led to increased bias, especially in scenarios D and E, where U interacted with multiple covariates in the true propensity score model. A possible explanation is that because the nonparametric approaches do not force the cluster structure in the model even when clustering is considered, the unobserved cluster-level covariate is not prioritized when the clusters are small with little information for the nonparametric models to detect their importance, while more effort is spent on handling the observed covariates. We further examined the amount of confounding caused by U by including U in the BART model with random effects (i.e., BART-RE) – the performance of the model improved significantly with the coverage rate ranging from 91.8% to 96.3% across the seven propensity score scenarios in case 2. The same issue arised when there were 20 clusters with a smaller sample size of 100 units per cluster, though to a smaller extent (e.g., the average ASMD of U for BART-RE over the seven scenarios was 0.248 and 0.118 in the case of 100 clusters of size 50 and 20 clusters of size 100, respectively). However, with regard to the observed covariates, the covariate balancing performance of BART-based models was generally better than that of the parametric models (even when the parametric model was correctly specified as we have found through results not presented in this paper, which was likely related to the models’ ability to pick up chance imbalances) when there was no cross-level interaction involving U in all clustering settings, suggesting that BART-based models may improve propensity score weighting under a variety of settings if unmeasured cluster-level confounding is minimal.

In our study, only two nonparametric approaches for estimating propensity scores are examined. Other methods such as random forest and neural networks may offer additional insight into nonparametric propensity score estimation in a multilevel context. We note that an ensemble machine learning algorithm called Super Learner has been developed as a method to automatically select among a “library” of candidate models via cross-validation in order to build an optimal model for a given setting; hence, Super Learner has the potential advantage of combining the strengths of a variety of machine learning strategies.41 It has been shown that estimating propensity scores using Super Learner can improve covariate balance and reduce bias when the main-effects logistic regression model is severely misspecified.34 Evaluation of the effectiveness of Super Learner for propensity score estimation under multilevel contexts could be an avenue for future research.

Our simulation design assumes that all covariates entered in the propensity score estimation model are related to both the treatment and the outcome. However, investigators generally do not know the actual set of confounders and may include redundant covariates in the propensity score estimation model. It has been shown that adding irrelevant covariates to GBM may lead to increased covariate imbalance and bias in the treatment effect estimates.42 BART appears to be effective at detecting important predictors when irrelevant ones are added, but its effectiveness in the context of propensity score estimation is unknown.29 Future work may assess the performance of GBM and BART compared to parametric propensity score modeling under more realistic scenarios where irrelevant covariates are included in addition to unmeasured cluster-level confounding.

Lastly, we point out a few properties regarding the implementation of GBM and BART. Note that default parameter settings in the R packages for BART (BART and dbarts) and GBM (twang) are used in our simulation experiment, but the performance of these machine learning algorithms may be enhanced from parameter tuning. As mentioned in Section 2.2, BART has an advantage in that it only requires minimal assumptions regarding the model parameters by placing prior distributions over the tree models. BART is highly robust to small changes in the prior and the choice of the number of trees, and the defaults are usually adequate.29 Thoughtful specification of the GBM parameters may improve its performance by a greater extent. A disadvantage of GBM is that the twang package can be computationally demanding, as evidenced by others as well as our experience in implementing the two methods, where the speed of BART is markedly faster than GBM.43

In conclusion, our results suggest that in observational studies with clustered data, flexible modeling of the propensity score may offer advantages in terms of covariate balance and bias reduction, at least in studies where the sample size is large and cluster sizes are considerably larger than the number of clusters (e.g., 20 clusters of sizes 200 to 500). However, when the cluster sizes are not sufficiently large (e.g., 100 clusters of size 50), nonparametric models may not be particularly useful for propensity score estimation in some cases due to failure to adequately balance unmeasured cluster-level covariates. A major limitation of our study, and all simulation studies in general, is that we are unable to capture all possible cluster/sample sizes and assignment mechanisms that may occur in practice. In particular, we generated clusters such that each has an adequate number of treated and control units in all scenarios, but treatment allocation is often unbalanced in observational datasets. As seen from our simulations and real data application, results are highly dependent on the specific setup and data generating processes. It is thus important to note that the main contribution of our findings lies in offering insight into parametric versus nonparametric propensity score estimation with clustered data. They should not be regarded as definite conclusions and the choice of which model works best will largely depend on the specific data at hand. Future work may extend our study to explore the potential of nonparametric propensity score estimation in other situations (e.g., with extremely unbalanced treatment allocation within clusters).

Supplementary Material

Supplementary Material

REFERENCES

  • 1.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. doi: 10.1093/biomet/70.1.41 [DOI] [Google Scholar]
  • 2.D’Agostino RB. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med. 1998;17(19):2265–2281. doi: [DOI] [PubMed] [Google Scholar]
  • 3.Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Serv Outcomes Res Methodol. 2001;2(3):259–278. doi: 10.1023/A:1020371312283 [DOI] [Google Scholar]
  • 4.Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol. 2006;59(5):437–447. doi: 10.1016/j.jclinepi.2005.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Rubin DB. Randomization analysis of experimental data: the Fisher randomization test comment. J Am Stat Assoc. 1980;75(371):591–593. doi: 10.2307/2287653 [DOI] [Google Scholar]
  • 6.Schuler MS, Chu W, Coffman D. Propensity score weighting for a continuous exposure with multilevel data. Health Serv Outcomes Res Methodol. 2016;16(4):271–292. doi: 10.1007/s10742-016-0157-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Li F, Zaslavsky AM, Landrum MB. Propensity score weighting with multilevel data. Stat Med. 2013;32(19):3373–3387. doi: 10.1002/sim.5786 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.He Z. Inverse conditional probability weighting with clustered data in causal inference. arXiv. Preprint posted online August 5, 2018. doi: 10.48550/arXiv.1808.01647 [DOI] [Google Scholar]
  • 9.Arpino B, Mealli F. The specification of the propensity score in multilevel observational studies. Comput Stat Amp Data Anal. 2011;55(4):1770–1780. [Google Scholar]
  • 10.Thoemmes FJ, West SG. The use of propensity scores for nonrandomized designs with clustered data. Multivar Behav Res. 2011;46(3):514–543. doi: 10.1080/00273171.2011.569395 [DOI] [PubMed] [Google Scholar]
  • 11.Leite WL, Jimenez F, Kaya Y, Stapleton LM, MacInnes JW, Sandbach R. An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivar Behav Res. 2015;50(3):265–284. doi: 10.1080/00273171.2014.991018 [DOI] [PubMed] [Google Scholar]
  • 12.Kimil K. Efficiency of average treatment effect estimation when the true propensity is parametric. Econometrics. 2019;7(2):25. doi: 10.3390/econometrics7020025 [DOI] [Google Scholar]
  • 13.Rubin DB, Thomas N. Matching using estimated propensity scores: relating theory to practice. Biometrics. 1996;52(1):249–264. doi: 10.2307/2533160 [DOI] [PubMed] [Google Scholar]
  • 14.McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods. 2004;9(4):403–425. doi: 10.1037/1082-989X.9.4.403 [DOI] [PubMed] [Google Scholar]
  • 15.Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf. 2008;17(6):546–555. doi: 10.1002/pds.1555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–346. doi: 10.1002/sim.3782 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Harris KM, Udry JR. National Longitudinal Study of Adolescent to Adult Health (Add Health), 1994–2018 [Public Use]. Carolina Population Center, University of North Carolina-Chapel Hill [distributor], Inter-university Consortium for Political and Social Research [distributor], 2022. –02–09. 10.3886/ICPSR21600.v24 [DOI] [Google Scholar]
  • 18.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701. doi: 10.1037/h0037350 [DOI] [Google Scholar]
  • 19.Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986;81(396):945–960. doi: 10.1080/01621459.1986.10478354 [DOI] [Google Scholar]
  • 20.Schafer JL, Kang J. Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychol Methods. 2008;13(4):279–313. doi: 10.1037/a0014268 [DOI] [PubMed] [Google Scholar]
  • 21.Normand ST, Landrum MB, Guadagnoli E, et al. Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. J Clin Epidemiol. 2001;54(4):387–398. doi: 10.1016/s0895-4356(00)00321-8 [DOI] [PubMed] [Google Scholar]
  • 22.Mamdani M, Sykora K, Li P, et al. Reader’s guide to critical appraisal of cohort studies: 2. assessing potential for confounding. BMJ. 2005;330(7497):960–962. doi: 10.1136/bmj.330.7497.960 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–3107. doi: 10.1002/sim.3697 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Xu S, Ross C, Raebel MA, Shetterly S, Blanchette C, Smith D. Use of stabilized inverse propensity scores as weights to directly estimate relative risk and its confidence intervals. Value Health J Int Soc Pharmacoeconomics Outcomes Res. 2010;13(2):273–277. doi: 10.1111/j.1524-4733.2009.00671.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Austin PC. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Stat Med. 2016;35(30):5642–5655. doi: 10.1002/sim.7084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification And Regression Trees. New York, NY: Routledge; 1984. [Google Scholar]
  • 27.Elith J, Leathwick JR, Hastie T. A working guide to boosted regression trees. J Anim Ecol. 2008;77(4):802–813. doi: 10.1111/j.1365-2656.2008.01390.x [DOI] [PubMed] [Google Scholar]
  • 28.Ridgeway G, McCaffrey DF, Morral AR, Burgette LF, Griffin BA. Toolkit for weighting and analysis of nonequivalent groups: a tutorial for the R TWANG package. RAND Corporation. https://www.rand.org/pubs/tools/TL136z1.html. Accessed June 2, 2022. [Google Scholar]
  • 29.Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Ann Appl Stat. 2010;4(1):266–298. doi: 10.1214/09-AOAS285 [DOI] [Google Scholar]
  • 30.Chipman H, George E, Mcculloch R. Bayesian Ensemble Learning. In: Schölkopf B, Platt J, Hoffman T, eds. Advances in Neural Information Processing Systems. Vol 19. MIT Press; 2006. https://proceedings.neurips.cc/paper/2006/file/1706f191d760c78dfcec5012e43b6714-Paper.pdf [Google Scholar]
  • 31.Hill J, Weiss C, Zhai F. Challenges with propensity score strategies in a high-dimensional setting and a potential alternative. Multivar Behav Res. 2011;46(3):477–513. doi: 10.1080/00273171.2011.570161 [DOI] [PubMed] [Google Scholar]
  • 32.Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference: lessons learned from a data analysis competition. Stat Sci. 2019;34(1):43–68. doi: 10.1214/18-STS667 [DOI] [Google Scholar]
  • 33.Dorie V, Chipman H, McCulloch R. dbarts: Discrete Bayesian Additive Regression Trees Sampler. R package version 0.9–22; 2022. [Google Scholar]
  • 34.Pirracchio R, Petersen ML, van der Laan M. Improving propensity score estimators’ robustness to model misspecification using Super Learner. Am J Epidemiol. 2015;181(2):108–119. doi: 10.1093/aje/kwu253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. [Google Scholar]
  • 36.McCulloch R, Sparapani R, Spanbauer C, et al. BART: Bayesian Additive Regression Trees. R package version 2.9; 2021. [Google Scholar]
  • 37.Lumley T. Survey: Analysis of Complex Survey Samples. R package version 4.1–1; 2021. [Google Scholar]
  • 38.Yoshida K, Bartel A, Chipman JJ, et al. Tableone: Create “Table 1” to Describe Baseline Characteristics with or without Propensity Score Weights. R package version 0.13.2; 2022. [Google Scholar]
  • 39.Easterlin MC, Chung PJ, Leng M, Dudovitz R. Association of team sports participation with long-term mental health outcomes among individuals exposed to adverse childhood experiences. JAMA Pediatr. 2019;173(7):681–688. doi: 10.1001/jamapediatrics.2019.1212 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Reese BM, Halpern CT. Attachment to conventional institutions and adolescent rapid repeat pregnancy: a longitudinal national study among adolescents in the United States. Matern Child Health J. 2017;21(1):58–67. doi: 10.1007/s10995-016-2093-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6(1). doi: 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
  • 42.Griffin BA, McCaffrey DF, Almirall D, Burgette LF, Setodji CM. Chasing balance and other recommendations for improving nonparametric propensity score models. J Causal Inference. 2017;5(2). doi: 10.1515/jci-2015-0026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Parast L, McCaffrey DF, Burgette LF, et al. Optimizing variance-bias trade-off in the TWANG package for estimation of propensity scores. Health Serv Outcomes Res Methodol. 2017;17(3):175–197. doi: 10.1007/s10742-016-0168-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES