In this issue of the Journal, Yao and colleagues (1) nicely document the explosive increase in use of propensity score (PS) methods in studies of cancer risk and its treatments. The reporting guidelines they develop can aid authors, reviewers, and editors to enhance the consistency of reports from such studies. However, in their focus on guidelines for PS analysis, they pay limited attention to the fundamental role of the PS in study design and the challenges to valid study design in observational settings. As a complement to their guidelines for reporting PS analysis, Box1 shows some questions related to use of a PS for study design.
Box 1.
Questions related to the appropriate use of propensity scores in study design
Is the causal question clearly articulated?
Are the target strategy of interest and its initiation date clearly articulated?
Are the comparator groups and their initiation dates clearly articulated?
Are important time scales, including time on treatment, disease duration, and stage synchronized between compared treatments?
Are important confounders unmeasured or measured with meaningful error?
Is evaluation of sensitivity to unmeasured confounding planned?
Is there consideration of barriers to initiation of alternative strategies that might be associated with the outcome?
Is assessment of confounders different for those initiating the target vs comparator strategies?
Do all subjects included in the propensity score estimation have reasonable prior probabilities of receiving any of the compared strategies?
Is there considerable overlap in the propensity score distributions of those receiving the target vs comparator treatments?
If overlap is limited, can eligibility criteria be applied to identify a subgroup with greater equipoise?
Would trimming of subjects in the tails of the propensity score with limited overlap enhance equipoise?
Is treatment effect heterogeneity across levels of the propensity score likely?
If matching on the propensity score is implemented, is a tight caliper used?
Are the important predictors of the outcome balanced within propensity score strata?
The main challenge when using observational data to make valid causal comparisons involves our typically limited understanding of the mechanisms that lead to the decision to initiate treatment and the choices among alternative treatments. The PS seeks to characterize the probability of alternative treatments, but often we have incomplete knowledge and less than perfect measurement of the relevant risk factors that influence treatment. While expanded availability of large, administrative databases has increased opportunities for comparative effectiveness research, often important confounding variables are not measured adequately or have distributions that are imbalanced between compared groups. Careful consideration of PS distributions between treatment groups at the study design stage can reveal that a valid comparison may not be possible (2).
In a series of papers on the key role of the PS in the design of observational studies seeking causal conclusions about alternative treatments, Rubin, the co-inventor of the PS, concludes that the PS is a tool for study design (3–6). He argues that observational comparative effectiveness research should mimic randomized trials, with the PS used to model the unknown treatment assignment mechanism and further employed to identify study subjects who are balanced on important treatment determinants. Experience with use of a PS in pharmacoepidemiologic studies supports its primary role as a study design tool. Often analysis controlling for a PS, either by matching, stratification, or modeling, yields quite similar results to estimates of treatment effects obtained from multivariable regression methods (7–8). Intractable biases cannot be corrected if care is not taken to choose an appropriate comparison group, and attention is not paid at the design stage to synchronization of duration of disease, time on treatment, and comparability of surveillance of potential confounding variables between compared groups (9–12). Bias can also occur if attention is not paid to influential observations in the tails of the PS distribution, where balance between compared groups is often poor (13–15).
Design of a causal comparison begins with clear articulation of comparisons of interest, including specification of referent comparison treatments, along with the time of their initiation. While a PS has frequently been used to describe determinants of prevalent treatments, the coherence of the PS is challenged in this setting because factors associated with treatment initiation can often differ from those that predict persistence, and confounding by disease duration can be problematic (16,17). Clarity about time scales is necessary to ensure that treatment determinants included in the PS are measured before exposure status is determined, so that the PS accurately mirrors the treatment assignment decision. Clarity is also critical to avoid commonly occurring time-related biases in database studies, such as immortal time bias (18). For example, time-related biases in study design have likely played a role in the controversies regarding metformin use and its association with cancer risk in diabetics (19).
A good-fitting PS can do an excellent job in balancing measured confounders (20) but generally fails to balance those not measured or measured with substantial error (21). Recognition of limitations in covariate assessment at the design stage is critical, as are plans for sensitivity analysis. As well as risk factors for the study outcome, barriers to treatment (eg, socioeconomic status, frailty, and unrelated comorbidity, which can impact treatment decisions) (9,22) require consideration for inclusion in the PS. Another key concern is whether confounder surveillance and measurement are differential between treatment groups. Specifically, while observational studies of treatment effects have commonly employed nonuser (untreated) referent groups, surveillance and capture of covariate conditions is often less complete in individuals who passively choose not to start a treatment, relative to initiators. The assumption that a covariate such as a comorbid condition is not present if not noted in a health care database can be problematic if a comparison group has quite different surveillance.
Just as in the design of a randomized trial, careful consideration is needed for eligibility and exclusion criteria, with the goal of identifying alternatively treated subjects for whom all compared treatments are reasonable alternatives (11). After specification of these criteria, the main value of a PS may lie in its ability to evaluate whether such equipoise exists and to direct further refinement of the study population before analysis. Rubin notes the importance of the PS in this regard as valid causal inference (the positivity assumption) requires restriction of the study population to the region of overlap in PS distributions between treatment groups (4). Several authors have provided evidence that greater restriction by exclusion of subjects in the tails of the PS distribution with limited overlap between treatment groups can improve both precision and validity of effect estimates (13,23). Walker and colleagues further argue that valid comparative effectiveness research requires considerable patient and provider uncertainty regarding treatment choice, and they provide metrics based on the PS for evaluation of this uncertainty (24). Restrictions based on exclusion of potential subjects in the PS tails will reduce study size and affect representativeness. However, the potential to reduce the threat of bias through focus on a more homogeneous population with treatment uncertainty will often outweigh this concern.
Finally, several of the important items included in the guidelines for PS analysis of Yao and colleagues could be considered in the domain of study design. Specifically, the choice of caliper if matching on the PS is employed and the evaluation of balance in covariate distributions before and after matching or stratification can be evaluated before consideration of study outcomes.
Funding
Supported by National Institutes of Health grant AG023178.
Notes
The funder had no role in the writing of the editorial or the decision to submit it for publication. The author has received funding from grants to the Brigham and Women’s Hospital from AstraZeneca, Kowa, Novartis, and Pfizer Pharmaceuticals.
References
- 1. Yao XL, Wang X, Speicher PJ, et al. Reporting and guidelines in propensity score analysis: A systematic review of cancer and cancer surgical studies. J Natl Cancer Inst. 2017;1098: djw323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Longmore RB, Yeh RW, Kennedy KF, et al. Clinical referral patterns for carotid artery stenting versus carotid endarterectomy: Results from the Carotid Artery Revascularization and Endarterectomy Registry. Circ Cardiovasc Interv. 2011;41:88–94. [DOI] [PubMed] [Google Scholar]
- 3. Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997;127(8 part 2):757–763. [DOI] [PubMed] [Google Scholar]
- 4. Rubin DB. The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Stat Med. 2007;261:20–36. [DOI] [PubMed] [Google Scholar]
- 5. Rubin DB. For objective causal inference, design trumps analysis. Ann Appl Stat. 2008;23:808–840. [Google Scholar]
- 6. Rubin DB. On the limitations of comparative effectiveness research. Stat Med. 2010;2919:1991–1995. [DOI] [PubMed] [Google Scholar]
- 7. Glynn RJ, Schneeweiss S, Stürmer T.. Indications for propensity scores and review of their use in pharmacoepidemiology. Basic Clin Pharmacol Toxicol. 2006;983:253–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Stürmer T, Joshi M, Glynn RJ, et al. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol. 2006;595:437–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Glynn RJ, Knight EL, Levin R, Avorn J.. Paradoxical relations of drug treatment with mortality in older persons. Epidemiology. 2001;126:682–689. [DOI] [PubMed] [Google Scholar]
- 10. Glynn RJ, Schneeweiss S, Wang PS, Levin R, Avorn J.. Selective prescribing led to overestimation of the benefits of lipid-lowering drugs. J Clin Epidemiol. 2006;598:819–828. [DOI] [PubMed] [Google Scholar]
- 11. Schneeweiss S, Patrick AR, Stürmer T, et al. Increasing levels of restriction in pharmacoepidemiologic database studies of elderly and comparison with randomized trial results. Med Care. 2007;45(10 suppl 2):S131–S142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lund JL, Richardson DB, Stürmer T.. The active comparator, new user study design in pharmacoepidemiology: Historical foundations and contemporary application. Curr Epidemiol Rep. 2015;24:221–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Kurth T, Walker AM, Glynn RJ, et al. Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. Am J Epidemiol. 2006;1633:262–270. [DOI] [PubMed] [Google Scholar]
- 14. Lunt M, Solomon D, Rothman K, et al. Different methods of balancing covariates leading to different effect estimates in the presence of effect modification. Am J Epidemiol. 2009;1697:909–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Stürmer T, Rothman KJ, Avorn J, Glynn RJ.. Treatment effects in the presence of unmeasured confounding: Dealing with observations in the tails of the propensity score distribution—a simulation study. Am J Epidemiol. 2010;1727:843–854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kramer MS, Lane DA, Hutchinson TA.. Analgesic use, blood dyscrasias, and case-control pharmacoepidemiology. A critique of the International Agranulocytosis and Aplastic Anemia Study. J Chronic Dis. 1987;4012:1073–1085. [DOI] [PubMed] [Google Scholar]
- 17. Ray WA. Evaluating medication effects outside of clinical trials: New-user designs. Am J Epidemiol. 2003;1589:915–920. [DOI] [PubMed] [Google Scholar]
- 18. Suissa S. Immortal time bias in pharmaco-epidemiology. Am J Epidemiol. 2008;1674:492–499. [DOI] [PubMed] [Google Scholar]
- 19. Suissa S, Azoulay L.. Metformin and the risk of cancer: Time-related biases in observational studies. Diabetes Care. 2012;3512:2665–2673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Rosenbaum PR. Propensity score In: Armitage P, Colton T, eds. Encyclopedia of Biostatistics. 2nd ed.Chichester, UK: John Wiley and Sons; 2005:4267–4272. [Google Scholar]
- 21. Drake C. Effects of misspecification of the propensity score on estimators of treatment effect. Biometrics. 1993;494:1231–1236. [Google Scholar]
- 22. Redelmeier DA, Tan SH, Booth GL.. The treatment of unrelated disorders in patients with chronic medical diseases. N Engl J Med. 1998;33821:1516–1520. [DOI] [PubMed] [Google Scholar]
- 23. Crump RK, Hotz VJ, Imbens GW, Mitnik OA.. Dealing with limited overlap in estimation of average treatment effects. Biometrika. 2009;961:187–199. [Google Scholar]
- 24. Walker AM, Patrick AR, Lauer MS, et al. A tool for assessing the feasibility of comparative effectiveness research. Comp Effect Res. 2013;3:11–20. [Google Scholar]
