Skip to main content
Health Services Research logoLink to Health Services Research
. 2012 Dec 6;48(4):1487–1507. doi: 10.1111/1475-6773.12020

Squeezing the Balloon: Propensity Scores and Unmeasured Covariate Balance

John M Brooks 1, Robert L Ohsfeldt 2
PMCID: PMC3725536  PMID: 23216471

Abstract

Objective

To assess the covariate balancing properties of propensity score-based algorithms in which covariates affecting treatment choice are both measured and unmeasured.

Data Sources/Study Setting

A simulation model of treatment choice and outcome.

Study Design

Simulation.

Data Collection/Extraction Methods

Eight simulation scenarios varied with the values placed on measured and unmeasured covariates and the strength of the relationships between the measured and unmeasured covariates. The balance of both measured and unmeasured covariates was compared across patients either grouped or reweighted by propensity scores methods.

Principal Findings

Propensity score algorithms require unmeasured covariate variation that is unrelated to measured covariates, and they exacerbate the imbalance in this variation between treated and untreated patients relative to the full unweighted sample.

Conclusions

The balance of measured covariates between treated and untreated patients has opposite implications for unmeasured covariates in randomized and observational studies. Measured covariate balance between treated and untreated patients in randomized studies reinforces the notion that all covariates are balanced. In contrast, forced balance of measured covariates using propensity score methods in observational studies exacerbates the imbalance in the independent portion of the variation in the unmeasured covariates, which can be likened to squeezing a balloon. If the unmeasured covariates affecting treatment choice are confounders, propensity score methods can exacerbate the bias in treatment effect estimates.

Keywords: Propensity scores, covariate balance, matching, binning, assumptions, simulation


The strength of randomized controlled trials (RCTs) is the assumption that randomized treatment assignment yields a balanced distribution of covariates thought to be related to outcome between the treatment and control groups (Rubin 2001). Published studies of RCT results traditionally report a table displaying the balance in measured covariates (e.g., patient age, gender, baseline clinical conditions, etc.) between the treatment and control groups. Demonstrated balance of measured covariates across treatment groups is intended to lend credence that such balance extends to unmeasured covariates (Berk 2004). In the context of observational (nonrandomized) data, researchers have espoused designing treatment effect studies that mimic the measured covariate balancing properties of RCTs (Rosenbaum and Rubin 1983a,b; Rubin 1997, 2001, 2007; Joffe and Rosenbaum 1999; Shah et al. 2005). The use of a propensity score (PS)—the probability a patient received treatment given the patient's measured covariate values—has become a mainstay in efforts to find measured covariate balance in observational data studies to estimate treatment effects. It has been said that PS-based methods “can be used to design observational studies in a way analogous to the way randomized experiments are designed” (Rubin 2001) with a design attempting to “assemble groups of treated and control units such that within each group the distributions of covariates is balanced” (Rubin 2001). While methodologists are quick to qualify that achieving balance in measured covariates between groups of treated and untreated patients does not “guarantee” balance in unmeasured covariates across groups, measured covariate balance often creates an “expectation” of unmeasured covariate balance as in RCTs (Ward and Johnson 2008). Indeed, a review of the PS literature noted that “many of the articles in our review” imply that “propensity scores might also balance the unknown confounders between exposure groups” (Shah et al. 2005).

Several PS-based algorithms have been suggested to create patient samples that are balanced in measured covariates between treated and untreated patients. These algorithms range from stratification (D'Agostino 1998) and matching based on propensity scores (Hall, Summers, and Oberchain 2003; Frolich 2007; Stuart 2010) to using patient-specific propensity scores to weight observations (Rosenbaum 1987; Robins, Hernan, and Brunback 2000). Treatment effect inferences are then made by contrasting average outcomes between treated and untreated patients with similar propensity scores (and correspondingly similar distributions of measured covariates). These algorithms yield unbiased treatment effect estimates only if after balancing measured covariates, unmeasured covariates are “ignorable” or that the remaining unmeasured covariates that affected treatment choice are independent of outcome (Rosenbaum and Rubin 1983a,b; Joffe and Rosenbaum 1999). Unmeasured covariates affecting treatment choice are ignorable if either (1) they have no relationship (either directly or indirectly) with outcome, or (2) they are balanced between treatment and control groups after balancing measured covariates. Neither of these conditions can be verified directly with data available to researchers. The condition that the unmeasured covariates affecting treatment choice have no relationship with outcome is identical to the assumption required to yield unbiased estimates in standard multivariate regression-based treatment effect estimators—treatment is orthogonal to the error term in the outcome relationship after adjusting for the measured covariates included in the regression model (Angrist and Pischke 2009). Stated differently, this condition assumes that none of the unmeasured covariates affecting treatment choice confound the relationship between treatment and outcome. This orthogonal assumption requires theory-based persuasion by researchers for acceptance. Therefore, the conceptual advantage of PS-based methods relative to standard regression appears to hinge on the assumption that balancing measured covariates between treated and nontreated patients leads to unmeasured covariate balance between treated and nontreated patients. If this assumption holds, unbiased treatment effect estimates can be obtained without relying on theory to support the orthogonal assumption.

However, PS-based analyses of treatment effects using observational data largely ignore what seems to be a fundamental question—why did patients with the same or similar propensity scores receive different treatments? Intuitively, it would seem that unmeasured factors not accounted for in the PS model must be different between two patients with similar propensity scores for them to receive different treatments. Let patient utility associated with treatment U(T) and no treatment U(NT) be represented in terms of measured (XM) and unmeasured covariates (XU):

graphic file with name hesr0048-1487-mu1.jpg (1)
graphic file with name hesr0048-1487-mu2.jpg (2)

The measured and unmeasured covariates in equations (1) and (2) represent any factors affecting the utility of treatment versus no treatment for the patient. These covariates could represent factors related to patient preferences over the outcome changes induced by treatment choice (e.g., an actor may value facial changes from cosmetic surgery more than a construction worker) or factors affecting the relative effectiveness of treatment (e.g., a child with an ear infection and a high fever will expect more benefit from an antibiotic than a child with an ear infection and a low fever). A patient will choose treatment if the net utility gain from treatment—NG(T)—is positive:

graphic file with name hesr0048-1487-mu3.jpg (3)

Based on equation (3), patient treatment choices depend on their respective values of XM and XU. If (α2β2) > 0 treated patients will tend to have higher average values of XU than untreated patients, but with XM also varying across patients it may be possible to find treated patients with low values of XU and untreated patients with high values of XU.

If, however, two patients A and B are matched to have identical values of the measured covariate—Inline graphic—and patient A chooses treatment and patient B does not, it must be that:

graphic file with name hesr0048-1487-mu5.jpg (4)

where for patient i NG(T)i equals the net gain of treatment and Inline graphic equals i's value of XU. With a fixed value of XM, for equation (4) to hold it must be that Inline graphic. If (α2 − β2) > 0 treated patients with matched XM values must have higher values of XU than untreated patients. Therefore, across a set of treated and nontreated patients matched on XM, we would expect greater average differences in XU than the average differences in XU between the population of treated and nontreated patients not matched by XM.

In this study, we demonstrate the covariate balancing properties of PS-based algorithms through the lens of a simple treatment choice simulation model in which covariates affecting treatment choice are both measured and unmeasured. Prior simulation-based research showed that imbalance in unmeasured covariates related to treatment assignment remains after using PS-based algorithms (Austin, Grootendorst, and Anderson 2007). Others have described the extent in which treatment effect estimates from propensity score-based approaches are sensitive to imbalance in unobserved covariates (Rosenbaum and Rubin 1983a,b; Lin, Psaty et al. 1998). However, it has not been shown how PS-based algorithms affect the balance of unmeasured covariates between treated and untreated patients.

In our simulations, we find properties that are problematic for researchers hoping to make treatment effect inferences relying only on the expectation that balancing measured covariate implies balanced unmeasured covariates. To yield treated and untreated patients with similar propensity scores, we find that PS algorithms require imbalance in the portion of the variation of the unmeasured covariates that affect treatment choice that is unrelated to the measured covariates. In addition, as compared with the full unweighted sample, PS algorithms exacerbate the imbalance in the portion of the unmeasured covariates unrelated to the measured covariates between treated and untreated patients. This result is directly counter to the assumption often relied on in applications of propensity score methods that balancing measured covariates implies balance in the unmeasured covariates that affected treatment choice (Shah et al. 2005).

Methods

Simulation Model Structure

We modified a simple simulation model of treatment choice and outcome that was used in previous research (Brooks and Fang 2009). In this model, covariates affecting treatment choice are divided between those measured and unmeasured by a subsequent researcher. A propensity score is estimated for each simulated patient using the measured covariate. In simulations, the unmeasured covariates affecting treatment choice are distinguished by their assumed relationship with the model outcome. For simplicity, outcome in the model is defined as being “cured” from a given condition. Patients can either choose the “treatment” or the “alternative” (e.g., another treatment, watchful waiting). Patients can be cured using the alternative, but the treatment increases the cure probability relative to the alternative. Patients value being cured, but treatment is more costly relative to the alternative. Treatment is specified as a binary variable—T, where T = 1 if the patient chooses treatment, and 0 if the patient chooses the alternative. The probability of a cure, P(C), is specified in the following manner:

graphic file with name hesr0048-1487-mu8.jpg (5)

βT equals the increase in the probability of cure relative to the alternative for a patient that chooses treatment. The vector β contains the parameters βT, βM, βU1, and βU2 in equation (5) that equal the changes in the probability of cure related to T, XM, XU1, and XU2, respectively. Covariates XM, XU1, XU2, and ε have direct effects on P(C) and are distinguished by the assumption that XM is measured and available for subsequent research, whereas XU1, XU2, and ε are not. Specifically, ε represents the accumulated other risk factors related to cure that are not related to treatment choice.

The model assumes that patients consult with their providers to gain knowledge of equation (5) and form a treatment valuation relative to the alternative—T*—that is based on the value patients associate with a cure, the cost of treatment, and the effect on treatment valuation from patient-specific covariates:

graphic file with name hesr0048-1487-mu9.jpg (6)

where V is the value patients place on being cured; S equals the incremental costs associated with the treatment relative to the alternative; XM and XU1 are defined as in equation (5), and each was specified to have a direct effect on treatment valuation. XU3 and XU4 are additional unmeasured covariates affecting treatment valuation. XU3 is specified to have an indirect effect on cure through a correlation with the covariate XU2 that is specified in equation (5):

graphic file with name hesr0048-1487-mu10.jpg (7)

An intuitive example of an XU2, XU3 combination could be XU3 (patient health preferences) and XU2 (healthy diet). A patient with higher health preferences may place greater value on treatment (αU3 > 0) and patient health preferences and healthy diet are likely positively correlated. XU4 affects treatment valuation but has no direct or indirect effect on cure and, given these properties, would be an instrumental variable if it was measured. The vector α contains the covariate value weights αM, αU1, αU3, and αU4 in equation (6) that represent changes in treatment valuation from changes in XM, XU1, XU3, and XU4, respectively.

Following standard discrete choice theory (Ben-Akiva and Lerman 1985), the patient chooses treatment (T) if T* is greater than zero, or:

graphic file with name hesr0048-1487-mu11.jpg (8)

Figure 1 contains a directed acyclic graph summarizing the model relationships.

Figure 1.

Figure 1

Directed Acyclic Graph of Simulation Model Relationships

XU1 and XU2 represent unmeasured confounding variables in the estimation of the relationship between the likelihood of a cure C and T when only adjusting for XM.

graphic file with name hesr0048-1487-mu12.jpg (9)

where error = (βU1XU1 + βU2XU2 + ε). Because XU1 and XU2 are in the error term of this relationship, T would not be orthogonal to the error term as XU1 is directly related to T and XU2 is indirectly related to T via its correlation with XU3. As the orthogonal assumption is not valid in this model, to make inferences on βT from equation (5) using PS algorithms, it must be assumed that balancing XM between treated and untreated patients yields balance in XU1 and XU3 (and indirectly XU2) between treated and untreated patients.

We found little discussion in the literature describing how the characteristics of the relationships among the covariates and outcomes affect the ability to balance unmeasured covariates between treated and untreated patients using PS algorithms. Here, we theorized that unmeasured covariate balance will be affected by the relative covariate value weights (α) placed on measured and unmeasured covariates in the treatment valuation equation (6). We suspected that smaller value weights on the unmeasured covariates (αU1, αU3, αU4) relative to the measured covariate (αM) will require greater differences in the actual unmeasured covariate values to yield treated and untreated patients with matched measured covariates. In addition, it has been suggested that relationships among the measured and unmeasured covariates may enable the measured covariates to serve as proxies for the unmeasured covariates and lead to balance in the unmeasured covariates (Schneeweiss et al. 2009; Stuart 2010). To evaluate this suggestion, the simulation model was constructed to specify relationships between the XM and the unmeasured covariates XU1, XU3, and XU4.

We suspect that the stronger the relationships between the measured and unmeasured covariates, the less independent influence unmeasured covariates will have on treatment choice, and the more difficult it will be to find matched treated and untreated patients based on propensity scores generated by the measured covariate. As a result, we expected that as the strength of the relationship between the measured and unmeasured covariates increases, we would find greater imbalance between matched treated and untreated patients in variation in each unmeasured covariate that is unrelated to the measured covariate.

Simulation Approach

Eight simulation scenarios were specified that varied with the value weights (α) placed on the measured and unmeasured covariates in equation (6) and the level of relationship between the measured covariate and the unmeasured covariates. In each scenario, 100,000 simulated patient observations were generated and the values of the parameters V, S, α, and β were identical for all simulated patients. Table 1 contains the parameter values used in the eight simulation scenarios. Scenarios 1, 2, and 3 were constructed with no relationships among the measured and unmeasured covariates in equation (6), and the simulations varied only by the value weights (α) assigned to each covariate. Value weights were assigned to the covariates so that relative value of the unmeasured covariates fell relative to the measured covariate moving from scenarios 1 through 3 (αU1, αU3, and αU4 equaled 200, 100, and 50 in scenarios 1, 2, and 3, respectively, whereas αM equaled 100 across all three scenarios). To better contrast the differences in the covariate balancing properties of propensity score methods across scenarios 1, 2, and 3, the treatment cost parameter (S) was adjusted to ensure that the expected treatment valuation (T*) in each scenario equaled zero so approximately 50 percent of the patients in each scenario chose treatment. In these scenarios, the values of covariates themselves—XM, XU1, XU3, and XU4—for each simulated patient were randomly sampled from independent uniform distributions between (0, 1) using the RANUNI function within the SAS IML procedure. This approach insured the independence of each covariate across patients. XU2 was then estimated to ensure a correlation with XU3 using the following linear relationship:

Table 1.

Model Parameters across Simulation Scenarios

Scenarios

1 2 3 4 5 6 7 8
V 1,500 1,500 1,500 1,500 1,500 1,500 1,500 1,500
βT .2 .2 .2 .2 .2 .2 .2 .2
S 650 500 425 500 500 500 500 500
αM 100 100 100 100 100 100 100 100
αU1 200 100 50 100 100 100 100 100
αU3 200 100 50 100 100 100 100 100
αU4 200 100 50 100 100 100 100 100
Weight* 0 0 0 .1 .3 .5 .7 .9
(R2) .042 .168 .434 .289 .559 .691 .735 .750
*

The value used in equation (11) relating unmeasured covariate i to the measured confounder via:

graphic file with name hesr0048-1487-mu14.jpg

where Ei is a uniform random variable (0,1).

Percentage of variation T explained by XM in the propensity score equation (12).

graphic file with name hesr0048-1487-mu13.jpg (10)

where D is an independent uniform random variable (0,1).

In scenarios 4 through 8, the covariate value weights (α) were specified as in scenario 2, but the covariate values of XU1, XU3, and XU4 for each simulated patient were estimated with relationships with XM based on the following linear models:

graphic file with name hesr0048-1487-mu15.jpg (11)

where each Ei was an independent uniform random variable (0,1). In scenarios 2, 4, 5, 6, 7, and 8, “weight” was set to 0, .1, .3, .5, .7, and .9, respectively, so that the strength of the relationship between the measured and unmeasured covariates increases across scenarios. In each of these scenarios, XU2 was then estimated using equation (10) and the covariate XU3.

Propensity Score Approaches Using the Simulated Data

Once treatment choices were generated for the simulated patients in each of the eight scenarios, a treatment propensity score was estimated for each patient as the predicted probability of treatment based on a linear probability model using the measured covariate XM, where XMi equals the value of the measured covariate for patient i:

graphic file with name hesr0048-1487-mu16.jpg (12)

Inline graphic equals the predicted propensity score for patient i given XMi. Table 1 contains the R2 value for each scenario showing the proportion of T variation described by XM. Using the estimated propensity scores, three distinct PS algorithms were used to balance XM between treated and untreated patients. First, simulated patients were stratified into five PS bins using the algorithm described by D'Agostino (1998). The full sample was grouped into quintiles of the propensity score, with “Bin 1” containing the fifth of the sample with the lowest probability of choosing treatment (lowest propensity scores), and “Bin 5” containing the fifth of the sample with the highest probability of choosing treatment (highest propensity scores). Second, the entire sample was reweighted using the inverse probability weighting algorithm Inline graphic based on Imbens (2000) and Robins et al. (2000). Third, treated patients were matched with untreated patients based Inline graphic on the using matching without replacement and three different criteria for the required “closeness” of propensity scores to be considered a match, sometimes referred to as the match “tolerance” or “caliper” value; specifically, the match tolerance was varied from 0.1, 0.01, and 0.001 (Stuart 2010).

Within each scenario, the ability of the PS algorithms to balance covariates was assessed by comparing the means of the measured covariate XM and the unmeasured covariates for treated and untreated patients within (1) the full unweighted sample; (2) the full sample weighted by the inverse probability weights; (3) the samples within each of the five PS-stratified bins; and (4) patient samples matched by propensity scores. In addition, for each unmeasured covariate XUi, we estimated Ri as the residual of the regression of XM on XUi. Ri contains the portion of the XUi variance that is not “redundant” with XM. We assessed the impact on the balance of these residuals between treated and untreated patients using each PS sampling approach in each scenario.

Results

The results for all model covariates and cure rates for each simulation scenario are provided in the Appendices. Appendices A, B, C, D, and E contain results for XU1, XU2, XU3, XU4, and cure (C), respectively. For comparison purposes, each appendix contains the results for XM. The Appendices report the means of each covariate for the treated and untreated patients and the mean difference between the treated and untreated patients for each model scenario and sampling approach. Appendices A–D also contain results for each Ri—the portion of XUi unrelated XM that was measured as the residuals of the regression of XM on each XUiRi.

General Results across Scenarios

The first row for each scenario in the appendices shows the imbalance for each model covariate between treated and untreated patients in the full unweighted sample. For example, in scenario 2 in Appendix A, the mean of XM is .617 for treated patients and .380 for untreated patients for a difference of .237; and the mean of R1 is .115 for treated patients and −.115 for untreated patients for a difference of .230. The remaining rows under each scenario show the means of each model covariate for treated and untreated patients for the respective PS algorithm. In scenario 2 in Appendix A, using the matching algorithm with (.01) tolerance, the mean of XM is .494 for treated patients and .505 for untreated patients for a difference of .011, and the mean of R1 is .138 for treated patients and −.137 for untreated patients for a difference of .275. As a result, matching by propensity scores decreased the imbalance in the measured covariate XM by 95 percent (100*(.237–.011)/.237) but increased imbalance in R1—the portion of XU1 unrelated to XM—by 20 percent (100*(.230–.275)/.275). Table 2 contains a summary of the change in imbalance between treated and untreated patients moving from the full unweighted sample to each PS-adjusted sample for XM and the portion of the variation in each unmeasured covariate that that is unrelated to XMR1, R2, R3, & R4, for the covariates XU1, XU2, XU3, & XU4, respectively. In every scenario, for every unmeasured covariate related to treatment choice, the imbalance in the portion of those covariates that is unrelated to XM always increases when moving from the full unweighted sample to PS-adjusted samples.

Table 2.

Percent Reduction (Increase) in Sample Mean Difference (Treated–Untreated) from PS Methods for the Measured Covariate XM, and the Independent Portions of the Unmeasured Covariates Affecting Treatment Choice R1, R3, and R4

Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6 Scenario 7 Scenario 8
XM
 Binning* 95.5% 95.7% 93.5% 94.9% 91.1% n/a n/a n/a
 Matching 85.7 95.4 98.9 97.7 99.1 99.2% 99.0% 98.4%
 Weighting 100.8 101.3 122.6 100.0 61.3 22.7 16.8 15.0
R1
 Binning* −4.4 −20.0 −81.5 −42.0 −157.4 n/a n/a n/a
 Matching −3.9 −19.6 −76.1 −40.0 −130.2 −323.5 −720.0 −2,200.0
 Weighting −4.6 −21.9 −117.9 −53.3 −88.6 −75.0 −33.3 −66.7
R3
 Binning* −4.5 −20.3 −82.8 −41.5 −155.1 n/a n/a n/a
 Matching −3.8 −19.7 −75.6 −28.0 −134.9 −343.8 −730.0 −2,300.0
 Weighting −4.6 −21.4 −111.0 −51.6 −101.2 −71.9 −70.0 −40.0
R4
 Binning* −4.2 −19.9 −83.3 −41.3 −155.3 n/a n/a n/a
 Matching −4.2 −18.5 −74.4 −38.5 −131.4 −334.4 −740.0 −2,500.0
 Weighting −4.9 −21.6 −111.3 −53.8 −95.3 −68.8 −60.0 −70.0
*

Average across 5 PS-quintile “bins,” n/a when not all five bins contained patients.

For match tolerance value = 0.01 (for matched portion of total sample only).

Appendix E contains differences in cure rates for each scenario between treated and untreated patients for the full unweighted sample and each PS-adjusted sample. Table 3 summarizes the treatment effect estimates across scenarios using propensity score methods and standard regression estimators. In all scenarios, the true value of βT, the incremental effect of treatment on the probability of a cure, is 0.2. The unmeasured confounding covariates in each scenario—XU1 and XU2—are both positively related to treatment choice and the probability of cure. Therefore, the direct estimation of the effect of treatment on cure after controlling for XM alone should yield estimates of treatment effect that are biased high. Column I in Table 3 shows the difference in cure probabilities between treatment and untreated patients without adjusting for XM. Column II contains the regression-based treatment effect estimate using the full unweighted sample after controlling for XM. Regression-based estimates only come close to the true value of βT when XM is highly correlated with the unmeasured confounders as in scenarios 7 and 8. Columns III, IV, and V contain estimates using inverse probability weighting, binning, and matching, respectively. Although differences in the extent of bias across estimation approaches were generally small, the treatment effect estimates generated via the propensity score algorithms were generally larger (more biased) than regression estimates with only one exception (the matching estimate in scenario 4 using .001 tolerance). Nonetheless, these simulation results suggest that propensity score algorithms are unlikely to reduce the bias in estimated treatment effects compared with regression estimates in the presence of unmeasured confounders.

Table 3.

Treatment Effect Estimates by Scenario Using (1) Regression and (2) Differences between the Cure Rate for the Treated and Untreated Patients from Propensity Score Algorithms

I II III IV V

Matching Tolerance Based on Propensity Score

Scenario Truth No Control* Regression Estimate Controlling for XM Inverse Probability Weighting Average across Bins§ .1 .01 .001
1 .200 .263 .254 .255 .255 .263 .258 .257
2 .200 .273 .257 .257 .257 .268 .261 .260
3 .200 .271 .260 .286 .265 .265 .262 .261
4 .200 .275 .254 .255 .256 .261 .254 .253
5 .200 .282 .232 .256 .239 .240 .234 .233
6 .200 .304 .227 .291 .239 .242 .231 .233
7 .200 .318 .203 .299 .223 .220 .212 .210
8 .200 .338 .194 .316 .220 .213 .222 .225
*

Difference in cure rates between treated and untreated patients without control for XM.

Linear multiple regression estimate of the effect of T on cure controlling for XM.

Weighted difference in cure rates between treated and untreated patients with weights based on the propensity score.

§

Average of the difference in cure rates between treated and untreated patients across the five propensity score bins.

Difference in cure rates between propensity score-matched treated and untreated patients.

Smaller Unmeasured Covariate Value Weights Increase Unmeasured Covariate Imbalance

Scenarios 1, 2, and 3 were constructed with variation in the value weights placed on each covariate in the treatment value relationship but with no relationships between the measured and unmeasured covariates. The treatment value weights applied to the unmeasured covariates fall relative to the value weight placed on the measured covariate moving from scenario 1 to scenario 3. Figures 2 and 3 summarize these results focusing on XM and the portion of XU1 variation unrelated to XMR1. Figures 2 and 3 show the percent change in difference in XM and R1 balance between treated and untreated patients as a result of propensity score binning and propensity score matching, respectively. The green lines represent the balance change for the scenario with the lowest relative treatment value weight placed on the unmeasured covariates. Regardless of propensity score method used, the smaller the relative treatment weight placed on the unmeasured covariates, the greater the increase in the imbalance in unmeasured covariates when propensity score methods are applied (Figures 2b and 3b). This occurs because smaller relative value weights placed on the unmeasured requires greater differences in these covariates between treated and untreated patients for them to match on the measured covariate XM. Interestingly, with respect to XM, the ability to balance XM falls with relative value weights placed on the unmeasured covariates using the propensity score binning approach (Figure 2a), whereas it increases with relative value weights placed on the unmeasured covariates using the matching approach (Figure 3a). The matching process appears to do a better job eliminating matches dissimilar in XM when the unmeasured covariates have less weight in treatment choice. However, the number of simulated treated and untreated patients that are matched falls with relative value weights placed on the unmeasured covariates.

Figure 2.

Figure 2

Percent Reduction (Increase) in Sample Mean Difference (Treated–Untreated) from PS Binning for XM and R1, by Unmeasured Covariate Effect Size (αU1)

Figure 3.

Figure 3

Percent Reduction (Increase) in Sample Mean Difference (Treated–Untreated) from PS Binning for XM and R1, by Unmeasured Covariate Effect Size (αU1)

Greater Correlation between Measured and Unmeasured Covariates Increases Unmeasured Covariate Imbalance

Scenarios 2, 4, 5, 6, 7, and 8 have consistent treatment value weights across covariates, but they vary with the strength of the relationships between the measured and unmeasured covariates. The strength of the relationships between XM and the unmeasured covariates increases moving from scenarios 2 through 8. Figures 4 and 5 summarize these results, focusing on XM and the portion of XU1 variation unrelated to XMR1. Figures 4 and 5 show the percent change in difference in XM and R1 balance between treated and untreated patients as a result of propensity score binning and propensity score matching, respectively. Both figures show that as XM explains more of the variation in the unmeasured covariates, the more difficult it is to find patients that made different treatment choices with similar propensity scores. For example, in scenario 8, no treated patients were found in propensity score bins 1 and 2, and no untreated patients were found in bins 4 and 5. Likewise, in scenario 8 only 1,112 treated patients were matched to an untreated patient using matching tolerance of .001. Regardless of the propensity score method used, the imbalance in R1 between treated and untreated patients increases with the strength of the relationships between XM and the unmeasured covariates. With respect to balancing XM, stronger relationships between XM and the unmeasured covariates moving from scenarios 2 through 8 are similar to decreasing the treatment value weights moving from scenarios 1–3. The ability to balance XM falls with stronger relationships with the unmeasured covariates using the propensity score binning approach (Figure 4a), whereas it increases using the matching approach (Figure 5a). The matching process does a better job eliminating matches dissimilar in XM when the unmeasured covariates have less independent effect on treatment choice, but the number of matched simulated treated and untreated patients falls dramatically.

Figure 4.

Figure 4

Percent Reduction (Increase) in Sample Mean Difference (Treated–Untreated) from PS Binning for XM and R1, by Correlation between XM, XU1 (ρM,U1) Note. [1] Empty cells in Bins 1 and 5; [2] Empty cells in Bins 1, 2, 4, and 5.

Figure 5.

Figure 5

Percent Reduction (Increase) in Sample Mean Difference (Treated–Untreated) from PS Matching for XM and R1, by Match Tolerance Factor (±0.1, ±0.01, ±0.001)

Discussion

It has been suggested that researchers estimating treatment effects using observational data should use propensity score-based algorithms to mimic a randomized controlled trial design. The strength of randomization, however, is the assumption that it will evenly distribute covariates (both measured and unmeasured) across treated and untreated patients. The results of randomization are often reinforced in studies by demonstrating balance in measured covariates. In addition, many researchers using propensity score methods to estimate treatment effects with observational data imply that the measured covariate balance produced by these methods results in unmeasured covariate balance (Shah et al. 2005).

Our simple simulation model suggests that balancing measured covariates between treated and untreated patients actually has opposite implications for randomized and observational studies. We first showed that to balance measured covariates, PS algorithms require the existence of unmeasured covariates with variation unrelated to the measured covariates. This independent variation is needed to ensure that patients with similar propensity scores are observed making different treatment choices. Second, whereas demonstrated measured covariate balance between treated and untreated patients resulting from ex ante randomized treatment assignment reinforces the notion that all covariates are balanced, we showed that the forced balance of measured covariates using PS-based algorithms based on ex post treatment selection in observational studies exacerbates the imbalance in the variation of the unmeasured covariates that is unrelated to the measured covariates. In addition, the greater the impact that measured covariates have on treatment choice relative to unmeasured covariates, the more that the forced balance of measured covariates increases unmeasured covariate imbalance. This can be likened to squeezing a balloon. When a set of patients making different treatment decisions are forced to be balanced on one set of covariates (measured), this must be compensated by increased imbalance in the remaining unmeasured covariates affecting treatment choice. These results have implications on the use of higher dimensional propensity scores for balancing treated and untreated patients (Rassen, Brookhart et al. 2009; Schneeweiss et al. 2009). The more the variation in treatment choice that is explained by measured covariates, the harder it is to match treated and treated patients and the more imbalance in the unmeasured covariates will occur between the treated and untreated patients that are matched by propensity scores.

Because of these results, for PS methods to yield unbiased treatment effect estimates, the remaining factors affecting treatment choice have to have the properties of XU4, which would be natural instruments if they were measured. Treatment variation caused by the class of variables XU4 can be thought of as “good” treatment variation that is tantamount to a natural experiment, but theory is required to justify the notion that all remaining treatment variations are from the class of variables XU4. This is why researchers should not put measured covariates thought to be instruments in propensity score models because this reduces the amount of “good” treatment variation used to estimate treatment effects (Bhattacharya and Vogt 2007).

Conclusion

We used a simple model of treatment choice and outcome to assess the effect of propensity algorithms that balance measured covariates between treated and untreated patients on the balance of unmeasured covariates between these patients. We found that propensity score algorithms that balance measured confounders between treated and untreated patients exacerbate imbalance for these same patients in a portion of the variation in unmeasured confounders that is unrelated to measured covariates. However, independent variation in the unmeasured covariates is required for propensity score algorithms to balance measured covariates between treated and untreated patients. Although our simulation model was simple, we challenge researchers to construct alternative models in which in the independent variation in the unmeasured covariates that affect treatment choice becomes more balanced as a result of using propensity score algorithms to balance measured covariates. As in regression-based estimation, researchers using propensity score algorithms still must provide theoretical justification for the assumption that the unmeasured covariates affecting treatment choice have no direct or indirect effects on study outcomes. Indeed, based on the results here, acceptance of this assumption appears to be even more critical when using propensity score methods as the imbalance between treated and untreated patients in the portion of the variation of the unmeasured covariates unrelated to the measured covariates will increase. As a result, if the unmeasured covariates affecting treatment choice are confounders, propensity score methods can exacerbate the bias in treatment effect estimates.

Acknowledgments

Joint Acknowledgment/Disclosure Statement: This project was supported in part by an Agency for Healthcare Research and Quality Centers for Education and Research on Therapeutics cooperative agreement (5UI8HS016094) and AHRQ grant (ROIHS018381). The authors would also like to acknowledge the extremely valuable input of Brian Dowd and the anonymous reviewers of the manuscript. Any remaining errors are the responsibility of the authors.

Disclosures: None.

Disclaimers: None.

SUPPORTING INFORMATION

Additional supporting information may be found in the online version of this article:

Appendix A: Mean Measured (XM) and Unmeasured Covariate (XU1) Values across Model Scenarios.

Appendix B: Mean Measured (XM) and Unmeasured Covariate (XU2) Values across Model Scenarios.

Appendix C: Mean Measured (XM) and Unmeasured Covariate (XU3) Values across Model Scenarios.

Appendix D: Mean Measured (XM) and Unmeasured Covariate (XU4) Values across Model Scenarios.

Appendix E: Mean Measured (XM) and Cure (C) Rates across Model Scenarios.

Appendix SA1: Author Matrix.

hesr0048-1487-SD1.pdf (1.5MB, pdf)
hesr0048-1487-SD2.doc (610KB, doc)

References

  1. Angrist JD, Pischke J-S. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton, New Jersey: Princeton University Press; 2009. [Google Scholar]
  2. Austin PC, Grootendorst P, Anderson GM. “A Comparison of the Ability of Different Propensity Score Models to Balance Measured Variables between Treated and Untreated Subjects: A Monte Carlo Study”. Statistics in Medicine. 2007;26(4):734–53. doi: 10.1002/sim.2580. [DOI] [PubMed] [Google Scholar]
  3. Ben-Akiva M, Lerman SR. Discrete Choice Analysis. Cambridge, Massachusetts: The MIT Press; 1985. [Google Scholar]
  4. Berk RA. Regression Analysis: A Constructive Critique. Thousand Oaks, California: Sage Publications; 2004. [Google Scholar]
  5. Bhattacharya J, Vogt WB. 2007. “Do Instrumental Variables Belong in Propensity Scores?” Technical Working Paper 343, National Bureau of Economic Research, Cambridge, MA, http://www.nber.org/papers/t0343.
  6. Brooks JM, Fang G. “Interpreting Treatment-Effect Estimates with Heterogeneity and Choice: Simulation Model Results”. Clinical Therapeutics. 2009;31(4):902–19. doi: 10.1016/j.clinthera.2009.04.007. [DOI] [PubMed] [Google Scholar]
  7. D'Agostino RB. “Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Comparison Group”. Statistics in Medicine. 1998;17:2265–81. doi: 10.1002/(sici)1097-0258(19981015)17:19<2265::aid-sim918>3.0.co;2-b. [DOI] [PubMed] [Google Scholar]
  8. Frolich M. “Propensity Score Matching without Conditional Independence Assumption—with an Application to the Gender Wage Gap in the United Kingdom”. Econometrics Journal. 2007;10(2):359–407. [Google Scholar]
  9. Hall JA, Summers KH, Obenchain RL. “Cost and Utilization Comparisons among Propensity Score-Matched Insulin Lispro and Regular Insulin Users”. Journal of Managed Care Pharmacy. 2003;9(3):263–8. doi: 10.18553/jmcp.2003.9.3.263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Imbens GW. “The Role of the Propensity Score in Estimating Dose-Response Functions”. Biometrika. 2000;87(3):706–10. [Google Scholar]
  11. Joffe MM, Rosenbaum PR. “Invited Commentary: Propensity Scores”. American Journal of Epidemiology. 1999;150(4):327–33. doi: 10.1093/oxfordjournals.aje.a010011. [DOI] [PubMed] [Google Scholar]
  12. Lin DY, Psaty BM, Kronmal RA. “Assessing the Sensitivity of Regression Results to Unmeasured Confounders in Observational Studies”. Biometrics. 1998;54(3):948–63. [PubMed] [Google Scholar]
  13. Rassen JA, Brookhart MA, et al. “Observed Performance of the High-Dimensional Propensity Score (hd-PS) Algorithm in Small Samples”. Pharmacoepidemiology and Drug Safety. 2009;18:S15–S15. [Google Scholar]
  14. Robins JM, Hernan MA, Brumback B. “Marginal Structural Models and Causal Inference in Epidemiology”. Epidemiology. 2000;11(5):550–60. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  15. Rosenbaum PR. “Model-Based Direct Adjustment”. Journal of the American Statistical Association. 1987;82(398):387–94. [Google Scholar]
  16. Rosenbaum PR, Rubin DB. “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome”. Journal of the Royal Statistical Society Series B-Methodological. 1983a;45(2):212–8. [Google Scholar]
  17. Rosenbaum PR, Rubin DB. “The Central Role of the Propensity Score in Observational Studies for Casual Effects”. Biometrika. 1983b;70(1):41–55. [Google Scholar]
  18. Rubin DB. “Estimating Causal Effects from Large Data Sets Using Propensity Scores”. Annals of Internal Medicine. 1997;127(8):757–63. doi: 10.7326/0003-4819-127-8_part_2-199710151-00064. [DOI] [PubMed] [Google Scholar]
  19. Rubin DB. “Using Propensity Scores to Help Design Observational Studies: Application to Tobacco Litigation”. Health Services & Outcomes Research Methodology. 2001;2:169–88. [Google Scholar]
  20. Rubin DB. “The Design versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials”. Statistics in Medicine. 2007;26(1):20–36. doi: 10.1002/sim.2739. [DOI] [PubMed] [Google Scholar]
  21. Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. “High-dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data”. Epidemiology. 2009;20(4):512–22. doi: 10.1097/EDE.0b013e3181a663cc. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Shah BR, Laupacis A, Hux JE, Austin PC. “Propensity Score Methods Gave Similar Results to Traditional Regression Modeling in Observational Studies: A Systematic Review”. Journal of Clinical Epidemiology. 2005;58(6):550–9. doi: 10.1016/j.jclinepi.2004.10.016. [DOI] [PubMed] [Google Scholar]
  23. Stuart EA. “Matching Methods for Causal Inference: A Review and a Look Forward”. Statistical Science. 2010;25(1):1–21. doi: 10.1214/09-STS313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ward A, Johnson PJ. “Addressing Confounding Errors When Using Non-Experimental, Observational Data to Make Causal Claims”. Synthese. 2008;163(3):419–32. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

hesr0048-1487-SD1.pdf (1.5MB, pdf)
hesr0048-1487-SD2.doc (610KB, doc)

Articles from Health Services Research are provided here courtesy of Health Research & Educational Trust

RESOURCES