Abstract
Propensity score methods are increasingly used as a tool to control for confounding bias in pharmacoepidemiologic studies. The propensity score is a dimension reducing balancing score, creating treatment and reference groups that have comparable distributions of measured covariates. The purpose of this methods review is to provide an overview of the use of propensity score methods, including a summary of important data assumptions, various applications of the propensity score, and how to evaluate covariate balance. This article is intended for pharmacists and researchers who wish to receive an introduction to propensity score methods and be able to engage in high-level discussions on application and reporting.
Keywords: Pharmacoepidemiology, Evidence based medicine, Health services research
Purpose and estimation of propensity scores
Pharmacoepidemiology is the study of the utilization and effects of medical products in large numbers of people; it is a bridge science spanning both pharmacology and epidemiology.1 Often in pharmacoepidemiology, we want to investigate the potential impact of a treatment or therapy, holding all other external effects and noise constant. When the relationship between a treatment and outcome is mixed with the influence of a third variable, systematic error may result from a phenomenon called confounding bias. In general terms, confounders are pretreatment covariates that are risk factors for the outcome of interest and independently associated with the treatment.2 Confounders by definition affect treatment decisions.
Let us anchor these concepts with an example. Suppose we are conducting a retrospective observational new-user cohort study within commercial health insurance administrative billing data, examining the association of cyclooxygenase-2 (COX-2) inhibitors versus nonselective nonsteroidal anti-inflammatory drugs (NSAIDs) and hospital presentation for gastrointestinal (GI) bleeding. A simple comparison of crude GI bleeding rates between COX-2 inhibitor and NSAID users would be challenging to interpret, given a lack of adjustment for confounders, as treatment assignment is not random, and persons receiving these treatments may be very different from one another. In fact, numerous factors may distort the relationship between these analgesics and GI bleeding, e.g., patient age, comorbidities, prior history of GI symptoms or illness, and concomitant medications. Confounding can be addressed in study design (e.g., randomization, restriction) and/or study analysis (e.g., stratification, multivariable regression, propensity score methods).3 While this overview will focus on propensity score methods for nonexperimental studies, it is important to first briefly contextualize the impact of randomization in experimental studies.
Randomized clinical trials (RCT) are considered the gold standard in research design. By randomizing treatment allocation, one can minimize or eliminate the impact of confounding from both measured and unmeasured (whether measurable or immeasurable) factors. However, nonexperimental studies are commonly conducted to examine the effectiveness or safety of treatments, often due to a lack of resources, time, or ethics to examine in a trial. In the absence of randomization, there may be systematic differences between treated and reference participants, where we use ‘reference’ to include untreated or alternatively treated groups; our anchoring example has the latter design. After all, real-world decisions involving treatment selection are not a random process. Prescribers use clinical parameters, prognosis, and several other factors to weigh risks and benefits in order to make an evidence-based treatment decision. Accounting for these pretreatment differences in effectiveness/safety comparisons, and therefore potential confounding bias, is critical to drawing valid inferences on the association between a treatment and outcome. Historically, accounting for pretreatment differences in nonexperimental designs has been accomplished by adjusting for individual confounding factors in multivariable regression models. Alternatively, propensity score methods summarize confounding variables into a single metric that is used for confounding control in subsequent analyses. In contrast with multivariable regression, propensity score methods address confounding bias by predicting treatment probability, prior to analyzing the outcome, essentially separating the study design from the study analysis. This separation can reduce actual or suspected influence from the researchers on the results.4 It is important to note that traditional multivariable regression models provide similar and more efficient confounder adjustment vs. propensity score applications in cohort studies with large sample sizes and a sufficient number of outcome events.5 Under certain scenarios, such as those with relatively small sample sizes, few outcome events, or a large number of potential confounders, propensity scores may have advantages over traditional regression adjustment.
The propensity score is the probability of a study participant being assigned to the treatment group, conditional on pretreatment covariates.6 In RCTs, the true propensity score is known and is defined by the study design; in observational studies, the true propensity score is unknown, but can be estimated using measured covariates.7 In our anchoring example, the propensity score for each patient would be the probability of that patient receiving a COX-2 inhibitor versus a nonselective NSAID, conditional on their potential confounding variables. Propensity score methods efficiently offer a single summary of multiple confounding covariates. For any given value of a propensity score, all covariates will be balanced between treated and reference participants, making participants who have the same propensity score comparable to each other. In other words, if a group of study participants have the same probability of getting the treatment of interest, conditional on their characteristics, comparing those that ultimately did to those who did not get the treatment (or received an alternative treatment) within that group will yield results that are not confounded (assuming sufficient confounders are captured in the data and were used to construct the propensity score). A conceptual, though unproven benefit of using a propensity score (vs. traditional regression) to address confounding bias is that balancing measured covariates may lead to balance in unmeasured and unmeasurable covariates.8 A realized advantage is that the propensity score acts as a dimension reducing balancing score, creating treatment and reference groups that have comparable distributions of measured pretreatment covariates. The use of propensity scores makes it clearer to determine the extent to which differences in outcomes are due to the treatment of interest versus other influencing factors.9 For our anchoring example, an approach to addressing confounding may be to balance measured pretreatment covariates using propensity score methods, instead of including individual confounders in a traditional multivariable regression model. Given we are using as our data source administrative billing data (which is high-dimensional, i.e., has many potential covariates relative to a modest number of GI bleeding outcomes), the estimation of a propensity score would reduce the dimensionality, allowing us to generate groups with balanced covariates without simultaneously considering a large number of attributes between groups, as would be required with traditional regression adjustment.
Assumptions
The use of propensity score methods to address confounding requires a few assumptions to be made about the data. First, treatment allocation must temporally precede the outcome of interest; this temporal sequence is fundamental to avoid reverse causation. Next, all participants must have the potential to either receive or not receive the treatment (or receive the alternative treatment), conditional on their characteristics; this is referred to as positivity.10 An example that would violate the positivity assumption is the inclusion of participants with an absolute contraindication to a treatment under study; the potential for these participants to receive the treatment is zero, making it impossible to find comparator patients for this subpopulation. A third assumption, known as the Stable Unit Treatment Value Assumption (SUTVA), is that the possible outcome of a participant must not be affected by the treatment of another subject.11 An example that would violate SUTVA is spillover effects from studying clinical outcomes at the hospital unit level; while one unit may have received an intervention, the outcomes of patients in other units may be affected as providers may provide care in other units. The final assumption is that the possible outcome of a participant must not be associated with their probability of treatment assignment, conditional on all measured covariates; this is referred to as exchangeability and is equivalent to the notion that is that there must be no unmeasured confounding. Sufficient possible confounding covariates that affect treatment allocation and the outcome of interest must be captured in the data; this is not unique to propensity scores, but also applies to multivariable regression methods.12
When to consider propensity score methods
Propensity score methods have increasingly become more common in non-randomized studies to address confounding.13 However in studies that employed both propensity score methods and traditional regression adjustment to control for confounding, few found significant differences between the two approaches.14,15 Insight on potential indications for the use of propensity scores in pharmacoepidemiology in prior work has been outlined by Glynn et al.,16 including the ability to match or trim the study sample (described later), the ability to account for propensity score interactions, and use of the propensity score to correct for certain types of measurement error. Additionally, the application of propensity score methods can yield different treatment effect estimates than traditional regression adjustment; these estimates can also vary method to method (described later).
Propensity score methods were originally developed using large sample sizes (thousands or more), yet more recently, these methods are being applied to clinical studies with smaller sample sizes (hundreds or less). Simulation studies have shown that in small sample sizes (40–60 persons), propensity score methods still yield unbiased estimations of the treatment effect.17 Propensity score methods can also be helpful if the outcome is rare or the number of confounders is large, which is very common in pharmacoepidemiologic studies. The common rule-of-thumb of 1 covariate per 8–10 events can limit the number of confounders included in a traditional binary logistic model, resulting in inadequate confounding control18, although this criterion has been debated.19
Care should be taken when determining if propensity score methods are appropriate for a given study question and design. Strong knowledge of model assumptions and the plausible causal structure are necessary to ensure that propensity score methods have been applied correctly. A deep understanding of how selection bias must be addressed is required in order to determine covariate selection and to correctly specify the propensity score model. Ultimately, there is no consensus on when propensity score methods should or should not be used, versus traditional regression modelling.20
Variable and model selection to generate propensity scores
Our subsequent discussion will focus on binary treatments (treated vs. reference) and outcomes (presence vs. absence of a GI bleed). Guidance on the creation of propensity scores for other categorical and for continuous treatments are described elsewhere.21,22
The purpose of the propensity score is to achieve a balance of pretreatment covariates between the treated and reference groups. Since it is assumed that every participant has the possibility of receiving the treatment or not receiving the treatment, propensity scores fall in the range (0,1) (non-inclusive). If the propensity score creates a balance and the aforementioned assumptions hold, then the reference group will yield the closest proxy for what may have happened had the treatment group not actually been treated. The difference in outcomes between the treated and reference groups will then be an estimate of the true effect of the treatment.
Covariate selection
In order to estimate a participant’s propensity score, consideration must be given to which covariates are selected to generate the score. Covariates can be related to the treatment, the outcome, or both. As previously noted, confounders are covariates that affect outcomes and treatment assignments; not all covariates are confounders or warrant selection for inclusion in the propensity score. The covariate selection process should be based on subject matter knowledge, data quality, and the number of available covariates.23 Furthermore, covariates can have complex relationships among themselves, making the relationship between the treatment and outcome complex to understand.
Confounders must be adjusted for in order to prevent bias, yet the relationships between confounders and other covariates can complicate the covariate selection process. A recently proposed strategy for covariate selection involves including variables that are defined more broadly than confounders, i.e., are causes of the treatment or causes of the outcome or causes of both24 (only the latter meets the definition of a confounder). Important qualifications of this strategy include two phenomena that may increase or decrease bias. First, if there is residual unmeasured confounding (for observational studies, this will likely be the case), some causes of the treatment that are fully unrelated to the outcome, known as instrumental variables, may amplify bias if they are selected and thus controlled for. Second, if there is residual unmeasured confounding, covariates that are known proxies of the unmeasured confounding variable can reduce bias if included. As the first phenomenon discussed requires a deeper level of knowledge of how pretreatment covariates are related to the treatment and outcome, it may be more prudent to focus covariate selection based on how they are related to the outcome. The relationships between the treatment, outcome, confounders, and other covariates can become complicated; we recommend visualizing these relationships using a causal diagram, serving as a supportive tool to justify covariate selection through visual identification of confounders, proxies, instrumental variables and other covariate types.25,26 More information on covariate selection strategies are described elsewhere.23,24
Covariate selection is sometimes achieved through statistical methods alone, such as forward and backward selection or through machine learning techniques. These methods alone are not adequate to select variables to estimate a propensity score, as they cannot distinguish between what should and should not be controlled for conceptually (they merely assess how covariates predict treatment). For our anchoring example of COX-2 inhibitors versus NSAIDs and GI bleeding, we would select covariates for inclusion based on clinical knowledge and use of a causal diagram, excluding instrumental variables from the list of covariates to be selected.
Estimating the propensity score
Propensity scores are most commonly derived using a logistic regression model,7 where treatment selection is the dependent variable and selected covariates serve as independent variables. The purpose of this model is not to predict treatment, but to predict the probability of receiving the treatment of interest, conditional on observed covariates. The estimated propensity score from this model also serves as a balancing tool to adjust for confounding. As a point of distinction, the logistic regression model used to estimate the propensity score is separate from a subsequent model used to examine the association between treatment and outcome. For our anchoring example, we would use logistic regression to estimate the propensity score, but then use the propensity score in a Cox proportional hazards model to generate a hazard ratio for COX-2 inhibitor (vs. NSAID) use and GI bleeding.
High-dimensional propensity scores:
Observational data infrequently contain all information on important confounders that should be included in a propensity score, as such data sources are not typically collected for research purposes. Especially when using such datasets that contain thousands of potential covariates (e.g., administrative billing data), the high-dimensional propensity score is an increasingly popular data adaptive approach to address measured and unmeasured confounding, and has been applied in studies of varying sizes. This method identifies commonly occurring measured covariates likely to introduce bias that can collectively serve as proxies for unmeasured confounders, then uses those covariates to generate a propensity score.27 For example, in the absence of a covariate measuring rheumatoid arthritis severity (an indication for COX-2 inhibitors), frequency of hospitalization may serve as a proxy, with higher frequency of hospitalization often indicating more severe disease. Importantly, these empirically identified covariates can be supplemented with investigator-selected variables of clinical relevance and measures of intensity of healthcare utilization—all of which are then used to generate propensity scores. In a set of seminal examples published by Schneeweiss et al,27 the authors reported that the application of the high-dimensional propensity score algorithm produced results closer to the expected findings based on RCTs, compared with propensity score adjustment that used a more limited number of investigator-selected covariates. Despite this, it would be a misstatement to conclude that the application of high-dimensional propensity score methods is almost as good a tool as randomization.3
Determining the treatment effect to be estimated
An important aspect of the study that should be determined before applying the estimated propensity score (see below) is defining the group of individuals for whom we wish to estimate the average causal effect of the treatment. Different propensity score applications lend themselves more easily to different estimates of the causal effect, although with slight modification of standard implementation, some applications may be able to estimate additional causal effects. The average treatment effect (ATE) is the average effect of the treatment when the entire population moves from being untreated (or alternatively treated) to treated; of note, this is typically the treatment effect estimate obtained from RCTs. The average treatment effect for the treated (ATT) is the average effect of the treatment on those who ultimately received the treatment. It should be decided whether the ATE or ATT will be of primary interest, based on the research question. If the ATT were selected for our anchoring example, the final results of the study would reflect the rate of GI bleed incidence in the COX-2 inhibitor group relative to the rate in the nonselective NSAID group. However, if the ATE were selected, this result would reflect the change in rate of GI bleed incidence should all patients be switched from nonselective NSAIDs to COX-2 inhibitors. We are principally interested in estimating ATT, since it is preferable when patient’s characteristics are more likely to determine the treatment received.4
Applications of propensity scores
Once the propensity score has been estimated, one needs to decide how to use it in a model examining the association between treatment and outcome. Four different types of propensity score application methods are described below: 1) propensity score matching; 2) stratification on the propensity score; 3) inverse probability of treatment weighting; and 4) covariate adjustment using the propensity score. The pros and cons of each application are summarized in Table 1. Those interested in further detail should consult work by Austin; in general, matching or inverse probability of treatment weighting applications may be preferable to stratification or covariate adjustment using the propensity score, especially when estimating the relative effect of treatment on time-to-event outcomes.28
Table 1.
Pros and cons of each propensity score application
| PS Application | Pros | Cons |
|---|---|---|
| Matching | Reliably balances measured covariates in most circumstances Separates study design from study analysis |
Unmatched patients are excluded from the analysis; may affect power and generalizability Less precise if matched pairs do not have the exact same score |
| Stratification | Retains all patients in the study Produces effect estimates for each stratum Separates study design from study analysis |
Comparability of all strata may be more difficult in datasets with few outcomes, especially when dealing with a large number of strata Less precise if the propensity score is not the same within each stratum |
| Weighting | Retains all patients in the study Easy to implement Balances covariates perfectly Separates study design from study analysis |
Weights can become unstable if there are extreme values May be more sensitive to whether the propensity score has been accurately estimated |
| Covariate adjustment | Easy to implement Performs well in certain circumstances |
May be more sensitive to whether the propensity score has been accurately estimated Does not separate study design from study analysis |
Matching
Propensity score matching, one of the more popular applications, involves the creation of matched sets (e.g., pairs) of treated and reference participants who have the same or a very similar propensity score. Differences in outcomes are calculated at the set level, and the results from each set are averaged to calculate the overall effect. Typical matching approaches will estimate the ATT, but certain matching techniques will yield the ATE.20
The operational process of creating matched sets has critical decision nodes, each of which must be carefully considered in the context of the research question at hand. First, will you match with (or without) replacement? That is, once a reference participant is selected to be matched to a treated participant, will you permit that reference participant to be considered as a potential match for subsequent treated participants? Replacement increases the likelihood that each participant will be matched, especially if certain propensity score values uncommonly occur, yet can adversely affect variance estimation. Furthermore, simulations suggest that matching with replacement does not result in less biased measures of association compared to the best-performing matching methods that do not permit replacement.29 For this and other reasons, matching with replacement has been discouraged and therefore an infrequent application.29 Subsequent discussion thereby assumes the decision to match without replacement. Second, how will you order participants before initiating the matching algorithm? Rank ordering by highest to lowest propensity score, lowest to highest propensity score, or random order (listed as examples) can affect who is matched to whom and qualitatively alter conclusions.30 Third, will you form sets using optimal or greedy matching? In optimal matching, sets are formed to minimize the total within-set difference of the propensity score. In greedy matching, a type of nearest neighbor matching,31 participants are matched on decreasing levels of precision of the propensity score. Each treated participant is matched to one (or more) reference participant whose score equals that of the treated patient to at least the fifth digit of the propensity score. When all matches at the fifth digit are exhausted, the process then moves to the fourth digit, and so on.31 In general, optimal and greedy matching approaches produce similar balance in matched samples; greedy matching tends to be less computationally intensive. Fourth, will you impose a caliper threshold? Stated differently, will you rule out forming a set if the closest propensity score for a reference participant exceeds a certain absolute distance from the propensity score of the treated participant? The larger the permitted caliper distance, the more likely you will find a match for a treated participant, but the less likely the set may be truly comparable. Inexact matches may lead to bias. Yet, excluding large numbers of participants that do not meet your matching specifications will decrease statistical precision and result in an analytic sample that deviates from the target population. Even the loss of a few participants due to a lack of a match may alter the results. There are a number of proposed approaches to address these issues.32 It is important to note that there is no universally agreed upon definition of what is an acceptable caliper threshold.
Stratification
With stratification, the study sample is split into mutually exclusive groups in which the propensity score is roughly the same in each group. Most commonly, equally sized strata are created using quintiles or deciles of the propensity score; recent simulations suggest that, for binary outcomes, deciles further minimize bias and increase precision compared to quintiles.33 More sophisticated stratification approaches exist that may determine an optimal stratification solution (rather than simply relying on quantiles or deciles). 34 Estimates of the association between exposure and outcome are calculated within each stratum and then pooled for an overall treatment effect. Typical approaches will estimate the ATE, but the ATT can also be estimated.20 Stratification can be appealing since it is generally easy to implement and interpret, convincing to non-technical audiences, and easily accommodates additional model-based adjustments.35 Simulations using propensity score stratification for time-to-event outcomes, though, have suggested the biased estimation of hazard ratios, compared to matching (discussed above) and weighting (discussed below).28
Weighting
In propensity score weighting, the propensity score is used to calculate weights for each participant, resulting in a ‘pseudo-population’, where the distribution of baseline covariates is balanced across treatment groups. There are several weighting techniques (including standardized mortality ratio weights and fine stratification weights36), with the most common being inverse probability of treatment weighting (IPTW). The IPTW is calculated as the inverse of the participant’s probability of their observed treatment. For treated participants, the weight is the inverse of the propensity score (1/PS); for referent participants the weight is calculated as (1/(1-PS)), where 1-PS is the probability of receiving the referent. The IPTW easily can be used when there are more than two treatment groups and can also handle longitudinal studies of time-varying treatments subject to time-varying confounding. All patients are retained in the analysis with weighting applications, and the weights yield a perfect balance between groups for all measured covariates. Typical approaches will estimate the ATE, but alternative weighting approaches will estimate the ATT.20
Covariate adjustment
An intuitive application of the propensity score is to use the score as a covariate (instead of the individual covariates themselves) in the regression model, assessing the association between treatment and outcome. The outcome is regressed on the treatment variable and the propensity score variable. In order to estimate the ATE or the ATT, an interaction between the treatment and the propensity score should be included in the regression model, such that the treatment effect may be evaluated at mean values of the propensity score in the whole sample (ATE) or the treatment group (ATT). The use of covariate adjustment can be enhanced by including nonlinear terms from the propensity score model in the final outcome model.37 It is important to understand that covariate adjustment is sensitive to assumptions on how the covariates are distributed, how accurately the propensity score has been specified, and how accurately the final outcome regression model has been specified.9,37 An incorrectly specified model may lead to incorrect extrapolations of the results.38 Other propensity score applications have been shown to be more robust against model misspecification.20 For these reasons, using the propensity score as a covariate in regression adjustment has been discouraged.
Evaluating balance diagnostics
Once the propensity score is applied, it is critical to examine if it adequately balanced pretreatment covariates across the treatment groups. If imbalance remains, it is possible that the model used to create the propensity score was inadequately specified and modifications must be made. If important differences between treatment groups remain after checking for balance, the propensity score model should be modified. Such changes can involve including additional covariates in the propensity score model and including interaction terms. Another approach is to include covariates that exhibit residual imbalance in the outcome model used to evaluate treatment effects. The refinement of the propensity score model is usually an iterative process of estimation, application, and checking for balance.39 In this section, we will focus on diagnostics for pairwise propensity score matching. The methods for analyzing balance can readily be translated to stratification, weighting, and regression adjustment applications.40–42
First, as the propensity score is a balancing score, pretreatment covariates should be balanced between treated and reference groups in the matched sample. Standardized differences can be used to assess how similar pretreatment covariates are before and after the application of propensity scores. Standardized differences are preferable to p values because they show both the direction and the magnitude of the differences (p values can only show magnitude). Furthermore, p values are sensitive to sample size. As a rule of thumb, covariates with absolute standardized differences ≤ 0.1 are considered reasonably balanced.9
Next, it should be ensured that there are participants in both treatment groups across all values of the propensity score; this is called checking for common support.7 It can be accomplished by visually inspecting the distribution of the propensity score (i.e., a histogram or other type of graphic plot), assessing for substantial overlap between the treated and reference groups in the matched sample. If there are participants with extremely low or high treatment probabilities in one treatment group but not the other, this can yield a number of unmatched participants, increasing the level of uncertainty in the results. Trimming the tails of the propensity score distributions, especially these areas of nonoverlap, should be done prior to any propensity score application, as this has been shown to reduce unmeasured confounding.43 If participants with these propensity scores are trimmed from the analysis, the study population is now a subset of the original sample, which may decrease precision and affect generalizability.
Details surrounding the estimation, application, and balance diagnosis of the propensity score should be explicitly reported in a manuscript. Table 2 summarizes recommendations for reporting. Additional detail can be found here.44
Table 2.
Recommendations for reporting propensity score analyses37
| Description of the propensity score method | • Discuss why propensity scores methods are appropriate for the study • Select a treatment effect appropriate for the research question • Provide justification for which propensity score method is selected |
| Propensity score development | • List all covariates to be included in the propensity score model • Describe how covariates were selected to estimate propensity score ○ Include image of causal diagram, if applicable • Describe the model selected to estimate propensity score |
| Propensity score application | • Stratification ○ State the number of strata used and describe how covariates were compared within strata • Matching ○ Describe the matching ratio, ordering, the use of replacement, caliper distance ○ Report the number of patients for each treatment group before and after matching • Weighting ○ Describe the weighting method used and how covariates were compared between weighted groups • Regression Adjustment ○ Report any other variables or interaction terms that are included in the final outcome model |
| Balance Diagnostics | • Compare balance of baseline variables before and after propensity score use ○ Standardized differences are preferred to p-values • Report percentage of sample trimmed after checking for common support • Discuss any remaining imbalance and implications for generalizability of results |
Potential limitations and controversy in the use of propensity scores
Despite common use of the propensity score to address confounding, there are pitfalls and controversy. First, the exchangeability (i.e., no unmeasured confounding) assumption is very strong and will rarely be met in observational studies for numerous reasons. Treatment decisions are based on clinical assessment and a balance of risks and benefits. Commonly, the data sources that are used for observational studies are not created for research purposes, leaving many important variables for a given study unmeasured. Meeting the exchangeability assumption is important, as this indicates that the treated and reference groups are comparable. When treatment allocation has not been randomized, it is almost impossible to be certain that the data has captured every possible source of confounding. Thus, it is recommended that sensitivity analyses be conducted to assess how sensitive conclusions are to violations of the exchangeability assumption.45
Next, the issue of missing values in covariates is challenging, and can lead to biased estimates of the propensity score as well as biased estimates of treatment effects. The reason why values are missing is key to deciding how to best handle the missing data.46 For example, a laboratory value missing due to an unfortunately timed power outage versus missing due to the patient being too ill to undergo an assessment can have very different implications in the data. One method to handle missing covariate data is a complete case analysis, which estimates the propensity score by using only patients that have complete data. Multiple imputation is another approach to handling missingness: multiple copies of the dataset are made, the missing values are replaced by imputed values, and the propensity scores estimated from each dataset are combined together. In complete case analysis, a primary assumption is that missingness is not associated with observed or unobserved data. Multiple imputation approaches assume that missingness may be explained by a participant’s observed data. Sensitivity analyses may be conducted to consider the impact of missingness explained by unobserved data.
Finally, there is a concept known as the propensity score matching paradox, initially described by King and Nielsen.47 While the purpose of propensity score matching is to reduce the imbalance of covariates between treated and reference groups, the application of matching may inadvertently increase imbalance and statistical bias. This occurs when propensity score matching is applied to data where covariates were fairly balanced prior to matching, and covariates become imbalanced after pruning some of the worst-matched sets, leading to a biased estimate. In pharmacoepidemiology studies, it is very unlikely that covariates will be fairly balanced prior to matching. Ripollone et al48 further explored this phenomenon in two different pharmacoepidemiologic studies and found that the propensity score matching did not harm covariate balance, considering standard approaches to matching and caliper criteria. The paradox became more of a concern when pre-matched data had very little imbalance and only a few matched sets needed to be trimmed before covariates became imbalanced.48
Summary
The propensity score can be an effective tool to control for confounding bias in an observational study, and in some scenarios has advantages over traditional multivariable regression model adjustment. How a researcher uses the propensity score to balance covariates and what causal effect will be targeted strongly depend on the research question at hand. The temporality, positivity, SUTVA, and exchangeability assumptions must be met to the best of the ability of the researcher. When selecting the covariates to be included in estimating the propensity score, we recommend this process be based on clinical knowledge, ideally illustrated with a causal diagram. Each propensity score application has pros and cons; we recommend that the application be selected based on the causal effect to be estimated and characteristics of the data. Assessing how well the propensity score balances pretreatment variables is an iterative process, and balance checking should be conducted through visualizations.
Footnotes
Potential Conflicts of Interest: CEL is an Executive Committee Member of the University of Pennsylvania’s Center for Pharmacoepidemiology Research and Training. The Center receives funding from Pfizer and Sanofi to support trainee education. CEL recently received honoraria from the American College of Clinical Pharmacy Foundation and University of Florida, unrelated to the topic of this paper. CEL is a Special Government Employee of the US Food and Drug Administration and currently consults for the Reagan-Udall Foundation for the FDA.
References
- 1.Strom BL, Kimmel SE, Hennessy S. Pharmacoepidemiology. wiley; 2019. doi: 10.1002/9781119413431 [DOI] [Google Scholar]
- 2.Weinberg CR. Toward a Clearer Definition of Confounding. Am J Epidemiol. 1993;137(1):1–8. doi: 10.1093/oxfordjournals.aje.a117108 [DOI] [PubMed] [Google Scholar]
- 3.Kahlert J, Gribsholt SB, Gammelager H, Dekkers OM, Luta G. Control of confounding in the analysis phase – an overview for clinicians. Clin Epidemiol. 2017;9:195. doi: 10.2147/CLEP.S129886 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Benedetto U, Head SJ, Angelini GD, Blackstone EH. Statistical primer: propensity score matching and its alternatives. Eur J Cardio-Thoracic Surg. 2018;53(6):1112–1117. doi: 10.1093/EJCTS/EZY167 [DOI] [PubMed] [Google Scholar]
- 5.Stürmer T, Schneeweiss S, Brookhart MA, Rothman KJ, Avorn J, Glynn RJ. Analytic strategies to adjust confounding using exposure propensity scores and disease risk scores: Nonsteroidal antiinflammatory drugs and short-term mortality in the elderly. Am J Epidemiol. 2005;161(9):891–898. doi: 10.1093/aje/kwi106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. doi: 10.1093/biomet/70.1.41 [DOI] [Google Scholar]
- 7.Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011;46(3):399. doi: 10.1080/00273171.2011.568786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Brooks JM, Ohsfeldt RL. Squeezing the Balloon: Propensity Scores and Unmeasured Covariate Balance. Health Serv Res. 2013;48(4):1487–1507. doi: 10.1111/1475-6773.12020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083. doi: 10.1002/SIM.3697 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Petersen ML, Porter KE, Gruber S, Wang Y, Van Der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res. 2012;21(1):31–54. doi: 10.1177/0962280210386207 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rubin DB. Bias Reduction Using Mahalanobis-Metric Matching. Biometrics. 1980;36(2):293. doi: 10.2307/2529981 [DOI] [Google Scholar]
- 12.Ali MS, Groenwold RHH, Klungel OH. Propensity Score Methods and Unobserved Covariate Imbalance: Comments on “Squeezing the Balloon.” Health Serv Res. 2014;49(3):1074. doi: 10.1111/1475-6773.12152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Weitzen S, Lapane KL, Scd AYT, Hume Pharmd AL, Mor V. Principles for modeling propensity scores in medical research: a systematic literature review sherry weitzen Pharmacoepidemiology and Drug Safety Principles for modeling propensity scores in medical research: a systematic literature review {. 2004;13:841–853. doi: 10.1002/pds.969 [DOI] [PubMed] [Google Scholar]
- 14.Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol. 2005;58(6):550–559. doi: 10.1016/J.JCLINEPI.2004.10.016 [DOI] [PubMed] [Google Scholar]
- 15.Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol. 2006;59(5):437. doi: 10.1016/J.JCLINEPI.2005.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Glynn RJ, Schneeweiss S, Stürmer T. Indications for Propensity Scores and Review of Their Use in Pharmacoepidemiology. Basic Clin Pharmacol Toxicol. 2006;98(3):253. doi: 10.1111/J.1742-7843.2006.PTO_293.X [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pirracchio R, Resche-Rigon M, Chevret S. Evaluation of the Propensity score methods for estimating marginal odds ratios in case of small sample size. BMC Med Res Methodol 2012 121. 2012;12(1):1–10. doi: 10.1186/1471-2288-12-70 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol. 2003;158(3):280–287. doi: 10.1093/aje/kwg115 [DOI] [PubMed] [Google Scholar]
- 19.van Smeden M, de Groot JAH, Moons KGM, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol 2016 161. 2016;16(1):1–12. doi: 10.1186/S12874-016-0267-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Williamson E, Morley R, Lucas A, Carpenter J. Propensity scores: From naïve enthusiasm to intuitive understanding. In: Statistical Methods in Medical Research. Vol 21. SAGE PublicationsSage UK: London, England; 2012:273–293. doi: 10.1177/0962280210394483 [DOI] [PubMed] [Google Scholar]
- 21.Hirano K, Imbens GW. The Propensity Score with Continuous Treatments. In: Gelman A, Meng X-L, eds. Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives:. John Wiley & Sons Ltd; 2004:73–84. Accessed July 23, 2021. https://books.google.com/books?hl=en&lr=&id=hrPYDOLs9FEC&oi=fnd&pg=PR5&ots=Xo5XwYhllo&sig=QjhjdF2zD_hKTh4nUODEt-zFKjg#v=onepage&q&f=false [Google Scholar]
- 22.Imbens G The role of the propensity score in estimating dose-response functions. Biometrika. 2000;87(3):706–710. doi: 10.1093/BIOMET/87.3.706 [DOI] [Google Scholar]
- 23.Sauer BC, Brookhart MA, Roy J, Vanderweele T. A review of covariate selection for nonexperimental comparative effectiveness research. Pharmacoepidemiol Drug Saf. 2013;22(11):1139–1145. doi: 10.1002/pds.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol. 2019;34(3):211–219. doi: 10.1007/s10654-019-00494-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pearl J Causal Diagrams for Empirical Research. Biometrika. 1995;82(4):669. doi: 10.2307/2337329 [DOI] [Google Scholar]
- 26.Hernán MA, Hernández-Diaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for confounding evaluation: An application to birth defects epidemiology. Am J Epidemiol. 2002;155(2):176–184. doi: 10.1093/aje/155.2.176 [DOI] [PubMed] [Google Scholar]
- 27.Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512–522. doi: 10.1097/EDE.0b013e3181a663cc [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Austin PC. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med. 2013;32(16):2837. doi: 10.1002/SIM.5705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057. doi: 10.1002/SIM.6004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Komen JJ, Belitser SV., Wyss R, et al. Greedy caliper propensity score matching can yield variable estimates of the treatment-outcome association—A simulation study. Pharmacoepidemiol Drug Saf. 2021;30(7):934–951. doi: 10.1002/PDS.5232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rassen JA, Shelat AA, Myers J, Glynn RJ, Rothman KJ, Schneeweiss S. One-to-many propensity score matching in cohort studies. Pharmacoepidemiol Drug Saf. 2012;21(SUPPL.2):69–80. doi: 10.1002/PDS.3263 [DOI] [PubMed] [Google Scholar]
- 32.Rubin DB, Thomas N. Combining propensity score matching with additional adjustments for prognostic covariates. J Am Stat Assoc. 2000;95(450):573–585. doi: 10.1080/01621459.2000.10474233 [DOI] [Google Scholar]
- 33.Neuhäuser M, Thielmann M, Ruxton GD. The number of strata in propensity score stratification for a binary outcome. Arch Med Sci. 2018;14(3):695–700. doi: 10.5114/aoms.2016.61813 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Linden A A comparison of approaches for stratifying on the propensity score to reduce bias. J Eval Clin Pract. 2017;23(4):690–696. doi: 10.1111/JEP.12701 [DOI] [PubMed] [Google Scholar]
- 35.Adelson JL, McCoach DB, Rogers HJ, Adelson JA, Sauer TM. Developing and Applying the Propensity Score to Make Causal Inferences: Variable Selection and Stratification. Front Psychol. 2017;0(AUG):1413. doi: 10.3389/FPSYG.2017.01413 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Desai RJ, Franklin JM. Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: A primer for practitioners. BMJ. 2019;367. doi: 10.1136/bmj.l5657 [DOI] [PubMed] [Google Scholar]
- 37.Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiol Drug Saf. 2004;13(12):855–857. doi: 10.1002/pds.968 [DOI] [PubMed] [Google Scholar]
- 38.Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997;127(8 II SUPPL.):757–763. doi: 10.7326/0003-4819-127-8_PART_2-199710151-00064 [DOI] [PubMed] [Google Scholar]
- 39.Rosenbaum PR, Rubin DB. Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. J Am Stat Assoc. 1984;79(387):516. doi: 10.2307/2288398 [DOI] [Google Scholar]
- 40.Joffe MM, Have TR Ten, Feldman HI, Kimmel SE. Model Selection, Confounder Control, and Marginal Structural Models. 10.1198/000313004X5824. 2012;58(4):272–279. doi: 10.1198/000313004X5824 [DOI] [Google Scholar]
- 41.Morgan SL, Todd JJ. A Diagnostic Routine for the Detection of Consequential Heterogeneity of Causal Effects: 10.1111/j1467-9531200800204.x. 2008;38(1):231–281. doi: [DOI] [Google Scholar]
- 42.Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf. 2008;17(12):1202–1217. doi: 10.1002/pds.1673 [DOI] [PubMed] [Google Scholar]
- 43.Stürmer T, Wyss R, Glynn RJ, Brookhart MA. Propensity scores for confounder adjustment when assessing the effects of medical interventions using nonexperimental study designs. J Intern Med. 2014;275(6):570–580. doi: 10.1111/joim.12197 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yao XI, Wang X, Speicher PJ, et al. Reporting and Guidelines in Propensity Score Analysis: A Systematic Review of Cancer and Cancer Surgical Studies. JNCI J Natl Cancer Inst. 2017;109(8). doi: 10.1093/JNCI/DJW323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Rosenbaum PR, Rubin DB. Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome. J R Stat Soc Ser B. 1983;45(2):212–218. doi: 10.1111/j.2517-6161.1983.tb01242.x [DOI] [Google Scholar]
- 46.Choi J, Dekkers OM, le Cessie S. A comparison of different methods to handle missing data in the context of propensity score analysis. Eur J Epidemiol. 2019;34(1):23–36. doi: 10.1007/s10654-018-0447-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.King G, Nielsen R. Why Propensity Scores Should Not Be Used for Matching. Polit Anal. 2019;27(4):435–454. doi: 10.1017/pan.2019.11 [DOI] [Google Scholar]
- 48.Ripollone JE, Huybrechts KF, Rothman KJ, Ferguson RE, Franklin JM. Implications of the propensity score matching paradox in pharmacoepidemiology. Am J Epidemiol. 2018;187(9):1951–1961. doi: 10.1093/aje/kwy078 [DOI] [PMC free article] [PubMed] [Google Scholar]
