A Methodological Critique of the ProPublica Surgeon Scorecard

Mark W Friedberg; Peter J Pronovost; David M Shahian; Dana Gelb Safran; Karl Y Bilimoria; Marc N Elliott; Cheryl L Damberg; Justin B Dimick; Alan M Zaslavsky

. 2016 May 9;5(4):1.

A Methodological Critique of the ProPublica Surgeon Scorecard

Mark W Friedberg, Peter J Pronovost, David M Shahian, Dana Gelb Safran, Karl Y Bilimoria, Marc N Elliott, Cheryl L Damberg, Justin B Dimick, Alan M Zaslavsky

PMCID: PMC5158216 PMID: 28083411

Abstract

On July 14, 2015, ProPublica published its Surgeon Scorecard, which displays “Adjusted Complication Rates” for individual, named surgeons for eight surgical procedures performed in hospitals. Public reports of provider performance have the potential to improve the quality of health care that patients receive. A valid performance report can drive quality improvement and usefully inform patients' choices of providers. However, performance reports with poor validity and reliability are potentially damaging to all involved. This article critiques the methods underlying the Scorecard and identifies opportunities for improvement. Until these opportunities are addressed, the authors advise users of the Scorecard—most notably, patients who might be choosing their surgeons—not to consider the Scorecard a valid or reliable predictor of the health outcomes any individual surgeon is likely to provide. The authors hope that this methodological critique will contribute to the development of more-valid and more-reliable performance reports in the future.

On July 14, 2015, ProPublica published its Surgeon Scorecard (Wei, Pierce, and Allen, 2015), an online tool that displays “Adjusted Complication Rates” for individual, named surgeons for eight surgical procedures performed in hospitals.

Public reports of provider performance (or, performance reports) have the potential to improve the quality of health care that patients receive. Valid performance reports (i.e., reports that truly measure what they are advertised as measuring) can stimulate providers to make quality improvements and can help patients make better selections when choosing among health care providers. However, performance reports with poor measurement validity and reliability are potentially damaging to all involved. Therefore, it is important to critically examine the methods used to produce any performance report.

Measuring provider performance is challenging, but methods exist that can help ensure that performance reports are valid and display true differences in performance. This methodological critique of the ProPublica Surgeon Scorecard has three goals: to explain methodological issues in the Scorecard, to suggest ways in which the Scorecard can be improved, and to inform the public about these aspects of the Scorecard. An overview of our conclusions with respect to the first two goals follows. The third—to inform the public—exists because the Scorecard is currently available to the public, and, based on our critique, we hope patients who are choosing a surgeon will be better able to decide how much weight to give the data presented in the Scorecard.

Methodological Issues in the Scorecard:

The “Adjusted Complication Rates” reported in the Scorecard are not actually complication rates. Instead, the “Adjusted Complication Rate” is a combination of hospital readmissions for conditions plausibly related to surgery (93 percent of events) and deaths (approximately 7 percent of events) within 30 days. However, most serious complications occur during the index admission, and many complications occur within 30 days post-discharge but without a readmission, or occur beyond the 30-day period. Other than death, none of these complications—many of which represent the most significant surgical risks and greatest detriment to patient long-term quality of life (such as urinary incontinence or erectile dysfunction following radical prostatectomy)—is included in the Scorecard. Most importantly, failure to include complications occurring during the index hospitalization is an unprecedented and untested departure from usual practices in measuring surgical complications, and one that undoubtedly results in a large proportion of serious surgical complications being excluded from the ProPublica measure.
As currently constructed, the Scorecard masks hospital-to-hospital performance differences, thereby invalidating comparisons between surgeons in different hospitals. By setting the hospital random effects equal to 0 in calculating the “Adjusted Complication Rates,” the ProPublica Surgeon Scorecard masks hospital-to-hospital variation that actually is present (according to ProPublica's models), thereby misleading patients in a systematic, albeit unintended, fashion. Put another way, the current Scorecard methodology ignores any hospital-level performance variation that would reflect (a) critical aspects of care that are intrinsic to a hospital, such as the adequacy of anesthesia staff, nursing, infection control procedures, or equipment, or (b) systematic recruitment of surgeons with superior (or inferior) skills by hospitals.
The accuracy of the assignment of performance data to the correct surgeon in the ProPublica Surgeon Scorecard is questionable. Claims data, which form the basis of the Scorecard, are notoriously inaccurate in individual provider assignments, and the Scorecard, as originally published, included case assignments to nonsurgeons and to surgeons of the wrong subspecialty. There is reason to suspect that these readily detectable misattributions are symptoms of more pervasive misattributions of surgeries to individual surgeons, and that these errors are still present in the Scorecard.
The adequacy of the Scorecard's case-mix adjustment is questionable. The aggregate patient “Health Score,” which reflects ProPublica's overall estimate of inherent patient risk (and the only such estimate for three of the eight reported surgical procedures), has a coefficient estimate of 0, meaning this patient risk score has no effect in ProPublica's risk-adjustment models. A likely explanation is that ProPublica's case-mix adjustment method fails to capture important patient risk factors. None of ProPublica's methods accounts for the risk factors present in more-detailed surgical risk models derived from clinical data.
The Scorecard appears to have poor measurement reliability (i.e., it randomly misclassifies the performance of many surgeons). Measurement reliability, which assesses the ability to distinguish true differences in performance between providers, is a key determinant of provider performance misrepresentation due to chance and should be calculated for all performance reports. Calculating reliability is particularly critical when measuring the performance of individual providers, where the number of cases used to rate a provider's performance can be quite small. Based on the width of the confidence intervals presented in the Scorecard, measurement reliability appears to be quite low for the vast majority of surgeons, with random misclassification rates between the Scorecard's implied risk classes (low, medium, and high “Adjusted Complication Rates”) approaching 50 percent for some surgeons.

Ways to Improve the Scorecard:

Rename the “Adjusted Complication Rate” measures reported in the Scorecard. Using a name that is more indicative of what is actually being measured will reduce the risk that Scorecard users (e.g., patients and providers) will misinterpret data on display, for example, by believing that a surgeon with a relatively low “Adjusted Complication Rate” has a lower overall rate of complications (in-hospital, post-discharge, and long-term) for a given procedure than a surgeon with a higher “Adjusted Complication Rate.” In addition, ProPublica could attempt to perform scientifically credible validation of “Adjusted Complication Rates” as measures of true complication rates. If such efforts fail to validate the current measures, they might identify ways to improve these measures substantially.
Correct the statistical method for handling hospital contributions to the individual surgeon performance data presented in the Scorecard. Setting hospital random effects equal to 0 is a methodological decision with no good justification, and it should be corrected in the existing Scorecard.
Validate the assignment to individual surgeons of the surgeries, readmissions, and deaths that are counted in the “Adjusted Complication Rates.” A validation study, comparing claims-based surgeon assignments with those derived from medical records for a representative sample of surgeries, could determine the extent of misattributed events. An informed judgment about whether the rate of misattribution is acceptable for high-stakes public reporting could then be made.
Validate the case-mix adjustment methods used to generate “Adjusted Complication Rates” for each surgeon. Questions about the adequacy of risk adjustment could be addressed by methodologically rigorous validation, preferably using robust clinical registry data that exist for several of the ProPublica procedures. This exercise might also lead to the conclusion that stronger case-mix adjustment methods are needed to enable fair comparisons between providers.
Specify minimum acceptable thresholds for measurement reliability and abide by them. State-of-the-art performance reports require minimum reliability to be achieved before publicly reporting performance data, and such reports warn their users when reliability is low. Having no minimum measurement reliability criterion for performance reporting is a departure from best practice, and one that appears to impose a high risk of both misclassifying individual surgeons and misdirecting report users.
Eliminate the implicit categorization of surgeons as having low, medium, or high “Adjusted Complication Rates.” The distinctions between these categories lack inherent meaning and, as described above, appear to have exceedingly high random misclassification rates for many surgeons. As a consequence, the red exclamation points marking hospitals with one or more surgeons having high “Adjusted Complication Rates” also should be eliminated.

Conclusion

ProPublica's stated goals in producing the Surgeon Scorecard are laudable: “to provide patients, and the health care community, with reliable and actionable data points, at both the level of the surgeon and the hospital, in the form of a publicly available online searchable database” (Pierce and Allen, 2015). However, as with any performance report, the Scorecard's ability to achieve these goals is limited by the rigor of the methods and the adequacy of the underlying data. Our critique of the ProPublica Surgeon Scorecard has identified substantial opportunities for improvement. Until these opportunities are addressed, we would advise users of the Scorecard—most notably, patients who might be choosing their surgeons—not to consider the Scorecard a valid or reliable predictor of the health outcomes any individual surgeon is likely to provide.

It is important to advise patients to ask all prospective surgeons about the risks of poor surgical outcomes, for hospitals to monitor the quality of their staff members (including surgeons), and for providers to substantially improve their efforts to collect and share useful performance data publicly. We hope that publication of the ProPublica Surgeon Scorecard will contribute to these broader efforts, even though there is substantial reason to doubt the Scorecard's current usefulness as a source of information about individual surgeons' quality of care. We also hope that this critique of the methods underlying the ProPublica Surgeon Scorecard will contribute to the development of more-valid and reliable performance reports in the future.

Footnotes

The research described in this article was performed under the auspices of RAND Health.

References

Pierce O, Allen M. Assessing surgeon-level risk of patient harm during elective surgery for public reporting (as of August 4, 2015). White paper. ProPublica, 2015. As of September 11, 2015: https://static.propublica.org/projects/patient-safety/methodology/surgeon-level-risk-methodology.pdf [Google Scholar]
Wei S, Pierce O, Surgeon Allen M. Scorecard. Online tool. ProPublica, 2015. As of September 11, 2015: https://projects.ProPublica.org/surgeons/ [Google Scholar]

[B1] Pierce O, Allen M. Assessing surgeon-level risk of patient harm during elective surgery for public reporting (as of August 4, 2015). White paper. ProPublica, 2015. As of September 11, 2015: https://static.propublica.org/projects/patient-safety/methodology/surgeon-level-risk-methodology.pdf [Google Scholar]

[B2] Wei S, Pierce O, Surgeon Allen M. Scorecard. Online tool. ProPublica, 2015. As of September 11, 2015: https://projects.ProPublica.org/surgeons/ [Google Scholar]

PERMALINK

A Methodological Critique of the ProPublica Surgeon Scorecard

Mark W Friedberg

Peter J Pronovost

David M Shahian

Dana Gelb Safran

Karl Y Bilimoria

Marc N Elliott

Cheryl L Damberg

Justin B Dimick

Alan M Zaslavsky

Abstract

Methodological Issues in the Scorecard:

Ways to Improve the Scorecard:

Conclusion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Methodological Critique of the ProPublica Surgeon Scorecard

Mark W Friedberg

Peter J Pronovost

David M Shahian

Dana Gelb Safran

Karl Y Bilimoria

Marc N Elliott

Cheryl L Damberg

Justin B Dimick

Alan M Zaslavsky

Abstract

Methodological Issues in the Scorecard:

Ways to Improve the Scorecard:

Conclusion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases