Abstract
Propensity score estimation is often used as a preliminary step to estimate the average treatment effect with observational data. Nevertheless, misspecification of propensity score models undermines the validity of effect estimates in subsequent analyses. Prediction-based machine learning algorithms are increasingly used to estimate propensity scores to allow for more complex relationships between covariates. However, these approaches may not necessarily achieve covariates balancing. We propose a calibration-based method to better incorporate covariate balance properties in a general modeling framework. Specifically, we calibrate the loss function by adding a covariate imbalance penalty to standard parametric (e.g. logistic regressions) or machine learning models (e.g. neural networks). Our approach may mitigate the impact of model misspecification by explicitly taking into account the covariate balance in the propensity score estimation process. The empirical results show that the proposed method is robust to propensity score model misspecification. The integration of loss function calibration improves the balance of covariates and reduces the root-mean-square error of causal effect estimates. When the propensity score model is misspecified, the neural-network-based model yields the best estimator with less bias and smaller variance as compared to other methods considered.
Keywords: Causal inference, covariate balance, imbalance score, inverse propensity score weighting, model misspecification, neural network, observational studies
1. Introduction:
In observational studies, the covariate distributions between treatment groups are usually imbalanced, which induces confounding bias when estimating the causal effect.1,2 Propensity score (PS), defined as the conditional probability of being treated given the covariates, 3 plays an essential role in causal inference. Because of the balancing property, the covariates and treatment are conditionally independent given the PS. 4 There are several PS methods to remove the confounding bias when estimating causal effect, including weighting, 5 stratification, 6 matching, 7 or regression adjustment. 8
The standard procedure for propensity score estimation is to fit a logistic regression model, check covariate balance, and re-fit the model until the balance is achieved. 6 However, there are significant drawbacks to this approach. Firstly, the parametric form of a logistic regression could be easily misspecified.9,10 It has been found that even under mild model misspecification, the logistic regression methods for PS estimation may fail and the performance of weighting methods including doubly robust estimator may be unsatisfactory. 11 Secondly, the logistic regression may fail to capture the complex non-linear relationship between covariates and the treatment. 12 Moreover, the iterative process is cumbersome, and people often skip this step, which leads to unreliable conclusions. 13
The goal of propensity score estimation is to balance covariates. Naïve PS estimation method such as a logistic regression may not achieve covariate balancing. 14 Empirical studies showed that the absolute standardized mean difference in covariates was strongly correlated with the bias of causal effect estimation, 15 but the prediction accuracy measure such as statistic was not associated with the magnitude of bias. 16 Best prediction model may not achieve the covariate balance when there is model misspecification. Various approaches focusing on covariate balancing have been proposed, 17 including entropy balancing (EB), 18 stabilized balancing weights (SBWs), 19 covariate balancing propensity score (CBPS), 20 covariate balancing scoring rules (CBSRs), 21 and propensity score local balance (PSLB). 22 Both EB and SBW methods directly model the weights required to balance covariates and bypass the propensity scores. Therefore, they are not applicable to the PS based methods, such as matching or stratification, for causal inference. The CBPS method estimates propensity scores using the logistic regression model, where the balance of covariates is optimized within the framework of generalized method-of-moments. 10 The CBSR method generalizes the concept of CBPS in various aspects, such as incorporating different PS link functions and applying non-parametric kernel methods. It can also handle different causal estimands while maintaining the covariate moment balancing constraints. 21 PSLB, an extension of CBPS method, considers the non-parametric kernel model and optimizes the local balance within the propensity score stratified (PS-stratified) sub-population. 22
Statistical or machine learning approaches are more flexible to incorporate non-linear relationships. 23 Prediction-based machine learning algorithms, such as bagging or boosting, 24 random forest, boosted classification and regression trees, 12 and neural networks are increasingly used to predict the propensity scores.15,16 However, these approaches may not necessarily achieve covariates balancing. 25 Incorporating covariate balance into the optimization process of machine learning methods can improve the estimation of causal effects.
Inspired by the dual characteristics of the PS which is the covariate balancing score and the conditional probability of treatment assignment, we calibrate the likelihood or loss function of a prediction model by adding a covariate-imbalance penalty. The calibrated loss function can take into account the prediction accuracy and covariate balance at the same time. This approach may mitigate the impact of misspecification of the PS model and thus improve the performance of the ATE estimator. Furthermore, we leverage the flexibility of machine learning methods by configuring the base model as either a logistic model or a neural network architecture.
The remainder of this article is organized as follows. In Section 2, we begin with a brief review of the traditional logistic regression PS model. Then we present the proposed methods where the loss function is calibrated by adding an imbalance score. Optimization and tuning parameter selection methods are introduced with a comprehensive algorithm. In Section 3, we assess the performance of the proposed methods in comparison with other PS estimation methods in the simulation study. In Section 4, we apply the proposed methods to a real study that examined the causal effect of the utilization of right heart catheterization in the intensive care unit on the hospital length of stay. We provide conclusions and discussions in Section 5.
2. Methods
2.1. Notation and model specification
Suppose there is a random sample of individuals from a population, let denote the binary treatment assignment, be the -dimensional covariate vector for the th individual, where denotes the th covariate of the th individual for and . According to Rosenbaum and Rubin, 3 we define the potential outcomes for each individual as under the treatment . The observed outcome is . We observe the covariate matrix , the treatment vector , and the observed outcome vector for the sample. Our proposed work focused on the estimation of the average treatment effect (ATE): . The same framework could be easily extended for other causal estimands, such as ATE among treated (ATT). To ensure the ATE is identifiable, we made the classic assumptions as discussed by Angrist et al., 26 including the stable unit treatment value assumption (SUTVA), no unmeasured confounders, and the positivity assumption. 27
The PS of receiving treatment of interest can be defined as . The estimation of PS can be conducted in many ways, including the frequently employed logistic regression or increasingly prevalent machine learning methods.10,28,29 We will start the illustration of our proposed method with logistic regression as the base model for simplicity, and extend it to the case when the neural network method is used as a base model.
The logistic PS model can take the form of:
(1) |
where is a vector of the logistic regression coefficients. Then the inverse propensity weight for the th individual can be defined as a function of :
(2) |
To achieve the goal of balancing the covariates while accurately predicting the treatment assignment, we propose the calibrated loss function for the PS model that accounts for the imbalance of covariates between treatment groups. Specifically, we consider the sum of the cross entropy, which is equivalent to the negative log-likelihood for the logistic regression base model, and the penalty term, namely imbalance score (IS), as defined below:
(3) |
where the overall IS is defined as the sum of the IS for each covariate:
(4) |
and is the tuning parameter.
There are several ways to measure the difference between two covariate distributions, such as differences in moments (mean, variance, etc.), Kullback–Leibler (KL) divergence, and Kolmogorov–Smirnov (KS) test statistics. 30 Here we describe one commonly used measure for an IS, the absolute standardized difference in mean10,31,32:
(5) |
where is the weighted sample variance of the th covariate for group .
In fact, the logistic regression can be viewed as the most basic form of a neural network, comprising solely one input and output layer with a sigmoid function as the activation function, denoted by . The regression coefficient vector defines the weights of the covariates. The PS is then estimated by applying the weighted sum of the covariates to the sigmoid function, that is, . Figure 1 shows the structure of a logistic regression model and Figure 2 showcases one simple structure of a neural network with one hidden layer. To estimate the parameters in the neural network base model using the proposed method, we substitute the conventional prediction-based loss function with our novel calibrated loss function. By minimizing this loss function, we can estimate all parameters involved in the neural network model. 33 With these estimated parameters, the propensity scores can be estimated. A more complicated structure of a neural network can have multiple hidden layers with different structures, designed to model non-linear relationships and capture intricate patterns and relationships in the data.34,35 The structure of the neural network and the number of neurons in each layer should be tailored to the complexity of the dataset. Several methods have been proposed for designing prediction-based neural networks.36–38 These techniques can be used to initially determine the structure of neural networks. Once the structure is established, the calibrated loss function in equation (3) can be easily incorporated to optimize the parameters.
Figure 1.
Neural network architecture for logistic regression.
Figure 2.
Neural network architecture with a hidden layer.
2.2. Optimization
To minimize the calibrated loss function, we use Algorithm 1 described below to estimate the parameter . Specifically, we utilize the gradient descent algorithm to minimize the loss function and update the parameter estimates iteratively based on the following equation:
where is the learning rate to control the step size of each update and is the gradient of the loss function. We update the estimate of parameter until the algorithm converges or when the maximum iteration ( ) is reached.
The initial values of the parameter are selected through a prediction-based PS model, ensuring a reasonable starting point for optimization. While the algorithm may converge to local minima, it is close to the prediction-based estimation. The primary objective of our optimization is not necessarily to find the global minimum but rather to enhance performance within an acceptable range. There are some modified gradient descent algorithms that help find a global minimum even for non-convex problems including adaptive learning rate algorithms such as Adam 39 and Adagrad, 40 which adjust step sizes dynamically during the training stage, and stochastic gradient descent (SGD), 41 which introduces randomness to escape local minima.
2.3. Tuning parameter
Since the objective of PS estimation is not to achieve perfect model fitting, the traditional tuning parameter selection methods that emphasize prediction accuracy cannot be applied. Several researchers suggested that using the covariate balance measures such as the absolute standardized difference as tuning parameter selection criteria worked better in ATE estimation.15,32,42,43 Motivated by this idea, we use grid search and select the tuning parameter that gives the smallest IS as defined in equation (5).
Cross-validation (CV) is widely recognized as an effective technique for parameter tuning. 44 In our context, the application of CV is also feasible, contingent upon available computational resources. It is important to note that if the is zero, the proposed method will degenerate to its base model. Conversely, with a sufficiently large , the method achieves a trade-off between prediction accuracy and covariate balance. Hence, we recommend exploring a broad range of values for during the grid search process.
3. Simulation study
The simulation design introduced by Kang and Schafer 11 has become a widely adopted benchmark setting for evaluating the performance of PS methods.19,20 To test the robustness of our proposed method under model misspecification, we replicated this simulation setting and examined the empirical performance of the proposed method under different model misspecifications. In detail, we applied the calibrated loss function methods for PS estimation based on two base models, logistic regression and one-hidden-layer neural network in which the activation function is ReLU 45 for the hidden layer and the sigmoid function for the output layer.
3.1. Estimators of causal effect
In the following, we briefly reviewed three ATE estimators used in our simulation study. The first two were inverse probability weighting estimators that were proven to be an unbiased estimator when the PS model was correctly specified. 31
-
(1)Horvitz–Thompson (HT) estimator 46 :
(6) - (2)
- (3)
The DR estimator combines a PS model with an outcome model to yield a consistent estimator when one of the models is correctly specified.49–52
3.2. Simulation set-up
We first generated four latent pre-treatment covariates that followed independent standard normal distributions. The potential outcomes were then generated through two linear regressions:
(9) |
and
(10) |
where . Under this setting, the mean potential outcome if everyone was treated was 210, the mean potential outcome if everyone was not treated was 200 and the true ATE was 10.
The true PS was generated with a logistic regression model:
(11) |
The treatment assignment was then generated from . The observed outcome was . To introduce model misspecification, we transformed the to as follows:
(12) |
(13) |
(14) |
(15) |
We investigated the performance of our proposed methods for three ATE estimators as described above under different scenarios. Since only the PS model was required for the HT and Hájek estimators, we assessed the PS estimation methods under two scenarios: (1) PS model was misspecified using and (2) PS model was correctly specified using . For the DR estimator, we also considered whether the outcome model was correctly specified under an incorrect or a correct PS model. Specifically, for the misspecified outcome model we applied linear regression with observed covariates . In contrast, we used the latent covariates for the correctly specified outcome model.
For each scenario, we conducted Monte Carlo simulations with a sample size and . We compared the bias and root-mean-square error (RMSE) of estimated ATE in equations (6) to (8) based on the following PS estimation methods:
GLM: Standard logistic regression
GLM+IS: Proposed calibration method with logistic regression as the base model
NNet: Neural network
NNet+IS: Proposed calibration method with neural network as the base model
CBPS: CBPS algorithm that extends logistic regression to achieve covariate balance
The structure of the neural network in NNet and NNet+IS was chosen to be one hidden layer of four neurons as described in Figure 2, considering that we have four covariates, and the data is not very complex, suggesting that a single hidden layer would be sufficient. We assessed and ensured convergence of the proposed methods by ensuring the stability of the loss function over the final iterations.
Furthermore, we calculated the standardized mean difference in the weighted observed covariates to , and latent covariates to , to assess the covariate balance between the two groups. The inverse propensity weights were estimated using the methods described above.
3.3. Simulation results
Table 1 presents the bias and RMSE of ATE estimation for different effect estimators under various PS estimation methods. We also reported the Monte Carlo standard error of bias and RMSE in Table 2. 53 Figure 3 shows the boxplots of ATE estimates across 1000 simulation runs, truncated at the 0.1th and 99.9th percentiles.
Table 1.
Performance of three average treatment effect estimators based on different propensity score estimation methods.
GLM | GLM+IS | NNet | NNet+IS | CBPS | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE | ||
n = 200 | PS model is misspecified | ||||||||||
HT | 22.469 | 149.114 | 1.652 | 13.000 | 1.076 | 25.503 | 0.895 | 12.466 | 4.986 | 5.940 | |
Hájek | 2.038 | 8.827 | 5.057 | 5.949 | 3.747 | 6.687 | 2.987 | 3.812 | 5.062 | 5.963 | |
DR (outcome model misspecified) | 7.780 | 31.260 | 5.008 | 5.924 | 3.769 | 4.692 | 2.925 | 3.788 | 5.028 | 5.951 | |
DR (outcome model correctly specified) | 0.006 | 0.386 | 0.010 | 0.178 | 0.009 | 0.206 | 0.010 | 0.180 | 0.010 | 0.177 | |
PS model is correctly specified | |||||||||||
HT | 0.029 | 15.735 | 0.109 | 14.844 | 0.284 | 13.342 | 1.187 | 12.492 | 0.100 | 0.239 | |
Hájek | 0.431 | 3.281 | 0.002 | 0.178 | 0.936 | 3.778 | 0.001 | 0.191 | 0.008 | 0.177 | |
DR (outcome model misspecified) | 0.020 | 2.345 | 0.242 | 1.822 | 0.656 | 2.028 | 0.014 | 1.358 | 0.218 | 1.830 | |
DR (outcome model correctly specified) | 0.007 | 0.179 | 0.007 | 0.177 | 0.009 | 0.186 | 0.010 | 0.184 | 0.008 | 0.177 | |
n = 500 | PS model is misspecified | ||||||||||
HT | 39.977 | 350.917 | 0.400 | 5.613 | 3.029 | 62.053 | 1.580 | 5.379 | 5.568 | 5.994 | |
Hájek | 0.230 | 10.551 | 5.574 | 5.988 | 1.393 | 5.526 | 2.018 | 2.485 | 5.583 | 5.999 | |
DR (outcome model misspecified) | 12.156 | 93.065 | 5.570 | 5.992 | 2.446 | 4.284 | 1.995 | 2.470 | 5.573 | 5.992 | |
DR (outcome model correctly specified) | 0.021 | 0.817 | 0.002 | 0.110 | 0.016 | 0.588 | 0.001 | 0.109 | 0.001 | 0.110 | |
PS model is correctly specified | |||||||||||
HT | 0.228 | 9.255 | 0.149 | 8.172 | 0.084 | 6.841 | 0.580 | 5.179 | 0.032 | 0.129 | |
Hájek | 0.078 | 2.125 | 0.006 | 0.113 | 0.611 | 2.281 | 0.010 | 0.124 | 0.001 | 0.107 | |
DR (outcome model misspecified) | 0.029 | 1.650 | 0.109 | 1.308 | 0.369 | 1.311 | 0.019 | 0.920 | 0.103 | 1.310 | |
DR (outcome model correctly specified) | 0.001 | 0.109 | 0.001 | 0.108 | 0.000 | 0.111 | 0.001 | 0.109 | 0.001 | 0.107 |
Note: The bias and RMSE are computed for Horvitz–Thompson (HT) estimator, the Hájek estimator, and the doubly robust (DR) estimator to compare the ATE estimation results. The true ATE is . The sample sizes are and . The number of simulations is . GLM: generalized linear model; RMSE: root-mean-square error; ATE: average treatment effect; NNet: neural network; CBPS: covariate balancing propensity score.
Table 2.
Monte Carlo standard error (MCSE) for the performance measures of bias and RMSE. 53
GLM | GLM+IS | NNet | NNet+IS | CBPS | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE | ||
n = 200 | PS model is misspecified | ||||||||||
HT | 4.664 | 126.595 | 0.408 | 3.599 | 0.806 | 13.220 | 0.393 | 3.181 | 0.102 | 1.052 | |
Hájek | 0.272 | 3.103 | 0.099 | 1.048 | 0.175 | 1.881 | 0.075 | 0.727 | 0.100 | 1.051 | |
DR (outcome model misspecified) | 0.958 | 27.965 | 0.100 | 1.048 | 0.088 | 1.021 | 0.076 | 0.719 | 0.101 | 1.051 | |
DR (outcome model correctly specified) | 0.012 | 0.239 | 0.006 | 0.038 | 0.007 | 0.073 | 0.006 | 0.039 | 0.006 | 0.038 | |
PS model is correctly specified | |||||||||||
HT | 0.498 | 4.249 | 0.470 | 3.519 | 0.422 | 3.789 | 0.393 | 3.798 | 0.007 | 0.054 | |
Hájek | 0.103 | 0.939 | 0.006 | 0.038 | 0.116 | 0.873 | 0.006 | 0.042 | 0.006 | 0.038 | |
DR (outcome model misspecified) | 0.074 | 0.745 | 0.057 | 0.425 | 0.061 | 0.545 | 0.043 | 0.320 | 0.057 | 0.423 | |
DR (outcome model correctly specified) | 0.006 | 0.039 | 0.006 | 0.038 | 0.006 | 0.043 | 0.006 | 0.040 | 0.006 | 0.038 | |
n = 500 | PS model is misspecified | ||||||||||
HT | 11.025 | 329.136 | 0.177 | 1.356 | 1.960 | 59.302 | 0.163 | 1.339 | 0.070 | 0.908 | |
Hájek | 0.334 | 4.550 | 0.069 | 0.904 | 0.169 | 3.121 | 0.046 | 0.471 | 0.069 | 0.907 | |
DR (outcome model misspecified) | 2.918 | 87.252 | 0.070 | 0.910 | 0.111 | 2.492 | 0.046 | 0.467 | 0.070 | 0.906 | |
DR (outcome model correctly specified) | 0.026 | 0.744 | 0.003 | 0.023 | 0.019 | 0.575 | 0.003 | 0.023 | 0.003 | 0.023 | |
PS model is correctly specified | |||||||||||
HT | 0.293 | 2.254 | 0.259 | 1.764 | 0.216 | 2.225 | 0.163 | 1.306 | 0.004 | 0.029 | |
Hájek | 0.067 | 0.608 | 0.004 | 0.024 | 0.070 | 0.513 | 0.004 | 0.027 | 0.003 | 0.023 | |
DR (outcome model misspecified) | 0.052 | 0.552 | 0.041 | 0.322 | 0.040 | 0.393 | 0.029 | 0.219 | 0.041 | 0.317 | |
DR (outcome model correctly specified) | 0.003 | 0.024 | 0.003 | 0.023 | 0.004 | 0.025 | 0.003 | 0.022 | 0.003 | 0.023 |
HT: Horvitz–Thompson; DR: doubly robust; PS: propensity score; GLM: generalized linear model; RMSE: root-mean-square error; ATE: average treatment effect; NNet: neural network; CBPS: covariate balancing propensity score.
Figure 3.
Boxplots of average treatment effect (ATE) estimates from simulation runs for three ATE estimators based on different propensity score methods. Note: The data for the plot was truncated as the 0.1th and 99.9th percentiles of each estimator.
When the PS model was misspecified, the proposed calibration methods for PS estimation (NNet+IS and GLM+IS) generally led to a smaller bias and RMSE in all ATE estimators when compared to the corresponding standard estimation methods (NNet and GLM). When the PS model was correctly specified, the bias was small for all ATE estimators, but our proposed methods (NNet+IS and GLM+IS) had lower RMSE compared with the original (NNet and GLM) methods. Overall, the NNet+IS method performed the best, resulting in the low bias and lowest RMSE for the Hájek or the DR estimator. The impact of sample size on RMSE depended on the PS model. When the PS model was correctly specified, the RMSE decreased with the increasing sample size for all PS methods. However, when the PS model was misspecified, only NNet+IS consistently led to a smaller RMSE of ATE with increasing sample sizes.
Under the simulated scenario, HT estimators appeared to have much larger RMSE and bias than the other estimators. The improvement in the HT estimators was especially substantial when the robust PS estimation methods were used under the misspecified PS model. Compared to the CBPS method, our proposed methods, especially the NNet+IS, resulted in much smaller biases in the HT estimator although the RMSEs were higher. For the Hájek estimator, the NNet+IS method reduced both the bias and RMSE, and the GLM+IS method reduced the RMSE whether PS was correctly specified or not. NNet+IS had overall the best performance in terms of the lowest bias and RMSE compared to other PS estimation methods including the CBPS.
For the DR estimator, when neither the PS model nor the outcome model was correctly misspecified, our robust PS methods, GLM+IS and NNet+IS, led to a moderate to significant reduction in the RMSE when compared to the original GLM and NNet methods. In addition, the flexible neural network (NNet) model appeared to perform better than the simple logistic regression (GLM) since it was more robust to the model misspecification. When either the PS model or outcome model was misspecified, we still observed improvement in DR estimates using our robust PS estimation methods. The NNet+IS method outperformed all other methods including the CBPS method in terms of the bias and RSME of DR estimates. This confirmed that even with the doubly robust property, the DR estimator can still benefit from the robust PS estimation methods in finite samples. When both models were correctly specified, all the PS methods gave similar results as expected, and the DR estimator had a smaller bias and variance than HT and Hájek estimators that did not use outcome models. For the scenario when the PS model was correctly specified, our proposed PS methods (GLM+IS and NNet+IS) resulted in more improvement in the bias and RSME for the Hájek estimator, and the DR estimator under the misspecified outcome model, but less improvement for the HT estimator and DR estimator under the correctly specified outcome model, when compared to the corresponding original methods (GLM and NNet). On the other hand, the CBPS performed the best when used with the HT estimator. Interestingly, the Hájek estimator even outperformed the DR estimator with a misspecified outcome model when GLM+IS or NNet+IS method was used. It appeared that the misspecification of the outcome model may still negatively affect the DR estimator even though the PS model was correctly specified in this scenario.
Figure 4 shows the boxplots of standardized mean difference in each covariate (observed or latent) when the PS model was misspecified using . Each covariate was either unweighted or weighted by the inverse propensity scores estimated from different PS estimation methods. For the observed covariates to (Figure 4(a)), logistic regression (GLM) and neural networks (NNet) resulted in poor covariate balance while the proposed methods GLM+IS and NNet+IS showed significant improvements and CBPS had the best covariate balance. Note that the goal of the PS weighting method was to balance the latent covariates to , which were actually used to define the true PS model. When examining the standardized mean difference in these latent covariates (Figure 4(b)), we found that the NNet+IS achieved the best balance for all latent variables, followed by GLM+IS and CBPS.
Figure 4.
Propensity score (PS) weighted covariate balance under misspecified PS model. The sample size is : (a) covariate balance (observed covariates); and (b) covariate balance (latent covariates).
The mean absolute error (MAE), as a measurement of prediction performance of different PS estimation methods when the PS model is misspecified, was reported in Table 3. In this case, GLM and NNet have the smallest MAE of PS estimation when the sample size is 200 and 500, respectively. However, NNet+IS has the smallest bias of ATE estimation using the Hájek estimator regardless of the sample sizes as we shown in Table 2. These results were consistent with previous findings that the prediction error was not strongly related to the bias of ATE estimation.2,10,16
Table 3.
MAE of estimated PS for different estimation methods when PS model is misspecified.
GLM | GLM+IS | NNet | NNet+IS | CBPS | |
---|---|---|---|---|---|
n = 200 | 0.081 | 0.088 | 0.088 | 0.095 | 0.089 |
n = 500 | 0.070 | 0.075 | 0.067 | 0.069 | 0.076 |
Note: Mean absolute error (MAE) is defined as MAE = . PS: propensity score; GLM: generalized linear model; NNet: neural network; CBPS: covariate balance propensity score.
4. Data application: The right heart catheterization data
We demonstrated our proposed PS methods using the right heart catheterization (RHC) dataset from Connors et al. 54 The objective of the analysis was to assess the effect of the utilization of RHC within the initial 24-hour period of care in the intensive care unit (ICU) on the hospital length of stay. The study cohort had 5734 patients in total, consisting of 2183 right heart catheterization users and 3551 non-users. The covariates identified by a panel of specialists in critical care included age, sex, race, years of education, income, medical insurance type, disease category, admission diagnosis, and other measurements.
In our analysis, we considered covariates for PS estimation. Two predictive base models, logistic regression and neural network, were used to estimate the PSs in the framework of the proposed calibrated loss function approach. We estimated the ATE of RHC with the HT, Hájek, and DR estimators and five PS estimation methods including GLM, GLM+IS, NNet, NNet+IS, and CBPS. We constructed the confidence intervals with bootstrap sampling. For the DR method, we applied linear regression with the treatment assignment and all covariates in the outcome model. The comprehensive ATE estimation results were provided in Table 4.
Table 4.
Average effect of right heart catheterization on length of hospital stay (days).
HT estimate of ATE | Hájek estimate of ATE | Doubly robust estimate of ATE | |
---|---|---|---|
Methods for PS estimation | ATE (95% CI) | ATE (95% CI) | ATE (95% CI) |
GLM | |||
GLM+IS | |||
NNet | |||
NNet+IS | |||
CBPS |
HT: Horvitz-Thompson; ATE: average treatment effect; PS: propensity score; GLM: generalized linear model; NNet: neural network; CBPS: covariate balance propensity score.
Overall, the Hájek and DR estimates of ATE were nearly identical when using calibrated loss approaches (i.e. GLM+IS and NNet+IS) or CBPS for PS estimation. In contrast, the HT estimates were consistently lower than the Hájek and DR estimates, with the exception of when CBPS was used for the PS estimation. Furthermore, the DR methods, whether calibrated or not, produced similar estimates. The calibrated Hájek estimators yielded estimates that closely aligned with those of the DR estimates. Furthermore, the ATE estimates were larger when the neural network base model (NNet or NNet+IS) was used to estimate PSs.
As shown in our simulation results, when the outcome model was correctly specified, the DR estimator would be much less biased than the Hájek estimator. In practice, the PS model may not be perfectly specified. Given the minimal difference between these two estimates in our analysis, we also expected the outcome model to be less likely to be specified correctly. Therefore, we thought the results from the NNet+IS method were more reliable given this method demonstrated the best performance in our simulation study under misspecification of both the PS and outcome models.
We also examined the covariate balance achieved by four PS estimation methods and an unweighted one using the standardized mean difference defined by Austin et al.13,55 (Figure 5). Most covariates can be better balanced via inverse PS weighting based on standard logistic regression and neural network methods and fall within the 0.1 threshold. 55 However, the standardized difference in means for some covariates fell out of the threshold.56,57 As we expected, the proposed calibration methods could further improve the covariate balance for these two prediction models and achieve almost exact balance.
Figure 5.
Standardized mean differences in inverse-propensity-score weighted covariate.
The dataset of Connors et al. 54 has been analyzed with various approaches, while each employing a slightly different set of covariates. We summarized the reported effect sizes of RHC use and our estimates in Figure 6. Notice that some published results did not provide the confidence interval for the estimated effect size. Comparing our results with those reported previously, the overall conclusions for all methods were consistent. However, the proposed method using neural network base model resulted in a larger effect of RHC on the length of stays than most of these reported except the one presented by Li et al. 22
Figure 6.
Forest plot of effect estimation given by multiple studies. Note: Keele and Small 58 used targeted maximum likelihood (TMLE) with super learner and also generalized regression trees (GRF). 58 Frank et al. 59 applied TMLE. 59 Li et al. 22 analyzed the right heart catheterization data with five methods including logistic regression, covariate balance propensity score (CBPS), covariate balancing scoring rules (CBSR), proposed propensity score with local balance 1 (PSLB 1), and proposed propensity score with local balance 2 (PSLB 2). 22
5. Conclusions and discussions
The significance of the proposed methods lies in the ability to address critical challenges associated with PS-based causal effect estimation, particularly in the context of the PS model misspecifications. A correctly specified PS model theoretically ensures balanced covariate distributions between treatment groups.2,3,60,61 Consequently, when the sample size is very large, the estimated PS could balance the covariates very well and the imbalance score penalty term is ignorable. When the sample size is limited, adding an imbalance penalty may correct the imbalance due to sampling variation so that increasing the precision. However, if the PS model is misspecified, the covariates may remain imbalanced even if the model accurately predicts treatment assignment. This misalignment can lead to biased estimates of the ATE. To address this issue, we developed robust PS estimation methods by calibrating the loss functions of predictive PS models. This adjustment considers the dual roles of PSs that the scores could predict the treatment and balance covariates. The proposed methods may improve upon prediction-focused methods when there is model misspecification due to complex, non-linear relationships among covariates. Our approach can mitigate bias and reduce RMSE in the estimation of the average treatment effect, particularly in scenarios involving model misspecifications.
As demonstrated in the simulation studies, when there was a model misspecification, prediction accuracy-based PS estimation methods such as logistic regression and neural network may result in large bias and variance in the ATE estimation. This problem was mainly driven by a small number of highly influential inverse probability weights that led to biased estimates and excessive variance. By adding a penalty term for covariates imbalance to the loss function, the proposed methods for PS estimation can achieve better covariate balance and eliminate the highly influential inverse probability weights, which led to a significant reduction in the bias and variance of ATE estimators. The proposed method also enables a better balance or even exact balance of observed covariates depending on the strength of the imbalance penalty. This type of balance shares similar spirits to the entropy balancing method 18 and the CBPS method, 20 both of which achieve exact balance.
The DR methods have been developed to address potential misspecifications in either the PS or outcome models. These methods can provide consistent estimates even if one of the two models is misspecified, provided the other model is correctly specified. However, it was reported that the DR estimator may not always outperform a single outcome model when both the PS and outcome models are slightly misspecified. 11 Through simulation studies (results not shown), we found that the DR estimator with NNet+IS as the PS model clearly outperformed the single misspecified outcome model, and the use of GLM+IS also showed improved performance. This highlights the ability of our proposed method in enhancing the DR estimator’s performance when both models are misspecified.
Our methods may help researchers avoid the ad hoc manipulation of estimated PSs or weights, such as trimming or truncating the weights. As shown in the simulation study, the PS estimation methods based on the calibrated loss function were more robust to model misspecification. In cases where either the PS model or the outcome model was correctly specified, the proposed calibrated version outperformed the prediction-based base model even for the DR estimator, especially in small sample sizes. When the PS model was correctly specified, the proposed methods could still reduce the variance in ATE estimation. In comparison to the logistic regression model, neural network model appeared to perform better under model misspecification. When the logistic regression was used as the prediction model for PS estimation, our proposed loss function calibration method and the CBPS method yielded very similar results for the Hájek and DR estimators. Both methods aim to optimize prediction and covariate balance at the same time. While the CBPS achieves covariate balancing within the framework of generalized method-of-moments, our approach is based on the calibrated loss function, which is more flexible to accommodate other covariate imbalance measures beyond marginal moments difference in each covariate. The penalty term to account for covariate imbalance can also vary across covariates depending on the strength of confounding effects. Lastly, our approach facilitates the use of other variable selection methods for predicting PS, such as least absolute shrinkage and selection operator (LASSO). 62 Specifically, when the dimension of covariates is high, an additional penalty can be incorporated to shrink the coefficients, thereby facilitating variable selection in conjunction with the proposed imbalance score penalty.
The proposed approach offers several advantages. First, it is robust to mild to moderate model misspecification. Even if the PS model is misspecified, the covariate balance is still controlled to a certain degree. Second, it is user-friendly, reducing the need for the back-and-forth covariate-balance check and model refit. Third, the framework is flexible in the sense that a variety of prediction models can be chosen, from the logistic regression model to machine learning methods such as neural networks. The only modification required is a simple calibration of the loss function. Lastly, the robust estimates of the PSs can be seamlessly integrated with a wide range of standard causal effect estimation methods, such as inverse probability weighting, matching, and stratification methods. Although the proposed method primarily focuses on enhancing PS estimation, integrating neural networks or other non-parametric approaches for outcome modeling may further enhance the performance of DR estimators. This integration allows for more flexible modeling, especially when the true outcome model formulation is unclear and likely to be misspecified with parametric regression models. 63
In this article, we only consider a continuous outcome without censoring. However, extending our algorithms to other types of outcomes, such as binary and survival outcomes, is straightforward. Second, we investigated the most common scenario of a binary treatment in this work. Extensions to multi-level treatment groups warrant further study. While the proposed method is robust to misspecifications in the functional forms of the PS model, it still requires the absence of unmeasured confounding, as is the case with any observational study. Additionally, we apply an equal penalty to the imbalance across all covariates. In high-dimensional settings, confounder selection methods for causal inference are needed to help extend our proposed method to cases with high-dimensional covariates. 32 Moreover, achieving balance across all covariates may not always be feasible in high-dimensional settings. An outcome-assisted modeling approach motivated by Shortreed et al. 43 may be considered to adjust the level of penalty for each covariate imbalance depending on the strength of the covariate-outcome association.
6. Implementation and software
The computational complexity of a neural network primarily depends on the architecture of the network. We run our simulation studies on Intel Xeon E5-2680 v3 with a 2.50 GHz CPU and 270 GB RAM. It roughly took <40 seconds under a sample size of 200, and <50 seconds under a sample size of 500 to implement NNet+IS method for a given tuning parameter with a single thread. The large tuning parameter selection range and CV will also increase the computational complexity. We used PyTorch’s autograd package that offers automatic differentiation for all tensor operations, enabling efficient gradient computation even with the imbalance penalty term in the loss function. Our code enables parallel computing to reduce the computational time, which is extremely helpful when running on high-performance computing machines. We implement the simulation and the real data applications in R 4.2.1 and Python 3.11.4. The source codes are available to the public via GitHub https://github.com/ys3298/RobustPSviaLossCalibration.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
ORCID iD: Yimeng Shang https://orcid.org/0000-0003-3770-4143
References
- 1.Faries DE, Leon AC, Haro JM, et al. Analysis of observational health care data using SAS. vol. 452. NC: SAS Institute Cary, 2010. [Google Scholar]
- 2.Hernán MA, Robins JM. Causal inference: what if. Boca Raton: Chapman & Hall/CRC, 2020. [Google Scholar]
- 3.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55. [Google Scholar]
- 4.Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011; 46: 399–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004; 23: 2937–2960. [DOI] [PubMed] [Google Scholar]
- 6.Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: 516–524. [Google Scholar]
- 7.Caliendo M, Kopeinig S. Some practical guidance for the implementation of propensity score matching. J Econ Surv 2008; 22: 31–72. [Google Scholar]
- 8.Vansteelandt S, Daniel RM. On regression adjustment for the propensity score. Stat Med 2014; 33: 4053–4072. [DOI] [PubMed] [Google Scholar]
- 9.Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics 1993; 49: 1231–1236. [Google Scholar]
- 10.Wyss R, Ellis AR, Brookhart MA, et al. The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bcart, and the covariate-balancing propensity score. Am J Epidemiol 2014; 180: 645–655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007; 22: 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med 2010; 29: 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang Z, Kim HJ, Lonjon G, et al. Balance diagnostics after propensity score matching. Ann Transl Med 2019; 7: 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Yang S, Lorenzi E, Papadogeorgou G, et al. Propensity score weighting for causal subgroup analysis. Stat Med 2021; 40: 4294–4309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cannas M, Arpino B. A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting. Biom J 2019; 61: 1049–1072. [DOI] [PubMed] [Google Scholar]
- 16.Setoguchi S, Schneeweiss S, Brookhart MA, et al. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf 2008; 17: 546–555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li Y, Li L. Propensity score analysis methods with balancing constraints: a Monte Carlo study. Stat Methods Med Res 2021; 30: 1119–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hainmueller J. Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Polit Anal 2012; 20: 25–46. [Google Scholar]
- 19.Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. J Am Stat Assoc 2015; 110: 910–922. [Google Scholar]
- 20.Imai K, Ratkovic M. Covariate balancing propensity score. J R Stat Soc Ser B: Stat Methodol 2014; 76: 243–263. [Google Scholar]
- 21.Zhao Q. Covariate balancing propensity score by tailored loss functions. Ann Stat 2019; 47: 965–993. [Google Scholar]
- 22.Li Y, Li L. Propensity score analysis with local balance. Stat Med 2023; 42: 2637–2660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Li L, Rong S, Wang R, et al. Recent advances in artificial intelligence and machine learning for nonlinear relationship analysis and process control in drinking water treatment: a review. J Chem Eng 2021; 405: 126673. [Google Scholar]
- 24.McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods 2004; 9: 403–425. [DOI] [PubMed] [Google Scholar]
- 25.Moodie EE, Stephens DA. Treatment prediction, balance, and propensity score adjustment. Epidemiology 2017; 28: e51–e53. [DOI] [PubMed] [Google Scholar]
- 26.Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. J Am Stat Assoc 1996; 91: 444–455. [Google Scholar]
- 27.Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. New York: Cambridge University Press, 2015. [Google Scholar]
- 28.Westreich D, Lessler J, Funk MJ. Propensity score estimation: machine learning and classification methods as alternatives to logistic regression. J Clin Epidemiol 2010; 63: 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ferri-García R, Rueda MdM. Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PLoS ONE 2020; 15: e0231500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. New York: Chapman & Hall/CRC, 2003. [Google Scholar]
- 31.Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med 2015; 34: 3661–3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tang D, Kong D, Pan W, et al. Ultra-high dimensional variable selection for doubly robust causal inference. Biometrics 2023; 79: 903–914. [DOI] [PubMed] [Google Scholar]
- 33.Popescu MC, Balas VE, Perescu-Popescu L, et al. Multilayer perceptron and neural networks. WSEAS Trans Circuits Syst 2009; 8: 579–588. [Google Scholar]
- 34.Livingstone DJ, Manallack DT, Tetko IV. Data modelling with neural networks: advantages and limitations. J Comput Aided Mol Des 1997; 11: 135–142. [DOI] [PubMed] [Google Scholar]
- 35.Almeida JS. Predictive non-linear modeling of complex data by artificial neural networks. Curr Opin Biotechnol 2002; 13: 72–76. [DOI] [PubMed] [Google Scholar]
- 36.Shin-ike K. A two phase method for determining the number of neurons in the hidden layer of a 3-layer neural network. In: Proceedings of SICE annual conference 2010, pp.238–242. IEEE.
- 37.Panchal G, Ganatra A, Kosta Y, et al. Behaviour analysis of multilayer perceptrons with multiple hidden neurons and hidden layers. Int J Comput Theory Eng 2011; 3: 332–337. [Google Scholar]
- 38.Kuri-Morales A. Closed determination of the number of neurons in the hidden layer of a multi-layered perceptron network. Soft Comput 2017; 21: 597–609. [Google Scholar]
- 39.Kingma DP. Adam: a method for stochastic optimization. arXiv preprint arXiv:14126980 2014.
- 40.Ward R, Wu X, Bottou L. Adagrad stepsizes: sharp convergence over nonconvex landscapes. J Mach Learn Res 2020; 21: 1–30.34305477 [Google Scholar]
- 41.Bottou L. Stochastic gradient descent tricks. In: Neural networks: tricks of the trade. 2nd ed. Springer, 2012, pp.421–436.
- 42.Griffin BA, McCaffrey DF, Almirall D, et al. Chasing balance and other recommendations for improving nonparametric propensity score models. J Causal Inference 2017; 5: 20150026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Shortreed SM, Ertefaie A. Outcome-adaptive LASSO: variable selection for causal inference. Biometrics 2017; 73: 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Berrar D. Cross-validation. In: Ranganathan S, Gribskov M, Nakai K, et al. (eds) Reference Module in Life Sciences Encyclopedia of Bioinformatics and Computational Biology. Vol.1. Amsterdam: Elsevier, 2019, pp.542–545. [Google Scholar]
- 45.Banerjee C, Mukherjee T, Pasiliao E Jr. An empirical study on generalizations of the relu activation function. In: Proceedings of the 2019 ACM southeast conference, pp.164–167.
- 46.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47: 663–685. [Google Scholar]
- 47.Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 2003; 71: 1161–1189. [Google Scholar]
- 48.Funk MJ, Westreich D, Wiesen C, et al. Doubly robust estimation of causal effects. Am J Epidemiol 2011; 173: 761–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846–866. [Google Scholar]
- 50.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005; 61: 962–973. [DOI] [PubMed] [Google Scholar]
- 51.Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 1999; 94: 1096–1120. [Google Scholar]
- 52.Glynn AN, Quinn KM. An introduction to the augmented inverse propensity weighted estimator. Polit Anal 2010; 18: 36–56. [Google Scholar]
- 53.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Connors AF, Speroff T, Dawson NV, et al. The effectiveness of right heart catheterization in the initial care of critically III patients. Jama 1996; 276: 889–897. [DOI] [PubMed] [Google Scholar]
- 55.Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28: 3083–3107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.What works clearinghouse standards handbook, version 4.1. https://ies.ed.gov/ncee/wwc/handbooks 2020.
- 57.Greifer N. Covariate balance tables and plots: a guide to the cobalt package, https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt.html#ref-stuartPrognosticScorebasedBalance2013 (2024, accessed 15 August 2024).
- 58.Keele L, Small DS. Comparing covariate prioritization via matching to machine learning methods for causal inference using five empirical applications. Am Stat 2021; 75: 355–363. [Google Scholar]
- 59.Frank HA, Karim ME. Implementing TMLE in the presence of a continuous outcome. Res Methods Med Health Sci 2023; 5: 26320843231176662. [Google Scholar]
- 60.Chesnaye NC, Stel VS, Tripepi G, et al. An introduction to inverse probability of treatment weighting in observational research. Clin Kidney J 2022; 15: 14–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Li Y, Kuo YF, Li L. Propensity score analysis with guaranteed subgroup balance. arXiv preprint arXiv:240411713 2024.
- 62.Osborne MR, Presnell B, Turlach BA. On the LASSO and its dual. J Comput Graph Stat 2000; 9: 319–337. [Google Scholar]
- 63.Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 2018; 21: C1–C68. [Google Scholar]