Bertsimas and co-authors1 should be commended on the development of the Predictive Optimal Trees in Emergency Surgery Risk (POTTER) tool, and its most recent application in emergency general surgery.2 The use of machine learning-based risk prediction tools as the basis for this study opens the door to a timely discussion about the role of machine learning approaches in support of decision-making in surgery. The following critical concepts must be explored before the adoption of POTTER and similar models for use in clinical settings: strengths and limitations of machine learning methods, including mechanisms to fairly assess their performance; and the importance of inclusion of hospital or surgeon factors when modeling.
Machine learning-based tools can outperform traditional modeling methods, such as logistic regression.3–5 However, machine learning methods have limitations that can impact their utility in clinical use. For example, machine learning methods often lack the ability to provide principled measures of uncertainty around the predictions. Many machine learning methods do not rely on statistical models, unlike regression, to generate predictions. As a consequence, these machine learning-based predictions lack CIs, SEs, or other interpretable metrics of uncertainty quantification. This makes it difficult for surgeons to put the predictions into context for patients who need more than percentages to inform decisions.
In this case, the POTTER model relies on decision trees, a classical machine learning tool that does not come with uncertainty quantification, as evidenced by Tables 2 and 3.2 In this context, SEs or other measures of uncertainty would be useful to surgeons, for instance, to distinguish a numerical improvement from one with a certain chance of being clinically relevant. On another note, POTTER has only been evaluated on datasets of real patients. Evaluating POTTER on datasets of simulated patients would also be beneficial. The evaluation of machine learning methods on simulated data, where uncertainty can be controlled and modulated according to a specific design, is useful to form expectations about model performance in an idealized setting that mirrors the target patient population in important ways. In particular, simulations allow developers to probe and identify clinical scenarios or patient cohorts on which the model can be expected to perform poorly.
When further examining the POTTER tool, we noted that the model lacks information on the site of care. Although this is not unique to POTTER, the inability to factor in hospital-level variation in outcomes of emergency general surgery patients limits the accuracy of the predictions for specific patients who are treated in specific locations. Variation in hospital performance for older emergency general surgery patients has been demonstrated with a 2-fold difference in the relative risk of death and complications across hospitals.6,7 Among operative patients, standardized mortality rates have been found to vary by 3-fold across hospitals.8 The variation in outcomes is likely attributable to differences in the structures and processes of care within the hospitals.8,9 Using predictive analytics to counsel patients without adjustment for surgeon- or center-specific deviations in performance is akin to forecasting snow without knowledge of the temperature. At best, the predictions can give a general sense of what to expect from treatment. At worst, the predictions can grossly underestimate the risk and result in care that is discordant with the patient’s goals.
Another important limitation of the POTTER tool is the fact that only a limited set of performance measures for the model predictions were applied. The authors chose not to predict actual events, instead they only provide a risk prediction in percentage terms.2 For rarer events, such as mortality, more nuanced performance metrics are needed to understand the tradeoffs between true-positive and true-negative predictions.10 In many settings like this one, we want the risk model to minimize false negatives—cases in which a patient has an adverse event but the model erroneously flags the patient as not having an adverse event. This becomes more difficult when adverse events are rare. Measures of accuracy, such as the C-statistic, provide little information about the risk model’s performance in terms of false negatives. As such, these metrics of performance can overstate perceived model performance. Without additional metrics, such as the false-negative rate, we cannot fully understand the model’s performance in this setting.
POTTER represents a nice example of an early machine learning algorithm designed for use in surgery. The inclusion of machine learning methods in surgery to guide advances and improve patient care will only continue to become more sophisticated and common. As these algorithms gain in popularity, surgeons must be made aware of the strengths and weaknesses to evaluate when and how to use them in the context of clinical care. At some point in the not too distant future, these algorithms will be able to augment clinical judgment about diagnoses, appropriate treatment options, and optimal hospital selection. However, it will be a while before these algorithms are able to independently care for our patients. Therefore, as we have mastered other new technology (eg laparoscopic and robotic-assisted surgical procedures) through the decades, surgeons must also learn how to safely and responsibly interpret and apply these new clinical tools when caring for our patients.
Footnotes
Disclosure Information: Nothing to disclose.
REFERENCES
- 1.Bertsimas D, Dunn J, Velmahos GC, Kaafarani HMA. Surgical risk is not linear: derivation and validation of a novel, user-friendly, and machine-learning-based Predictive OpTimal Trees in Emersgency Surgery Risk (POTTER) calculator. Ann Surg 2018;268:574–583. [DOI] [PubMed] [Google Scholar]
- 2.El Hechi MW, Maurer LR, Levine J, et al. Validation of the AI-based Predictive Optimal Trees in Emergency Surgery Risk (POTTER) calculator in emergency general surgery and emergency laparotomy patients. J Am Coll Surg 2021;232:912–919. [DOI] [PubMed] [Google Scholar]
- 3.Goto T, Camargo CA Jr, Faridi MK, et al. Machine learning-based prediction of clinical outcomes for children during emergency department triage. JAMA Netw Open 2019;2:e186937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wong A, Young AT, Liang AS, et al. Development and validation of an electronic health record-based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment. JAMA Netw Open 2018;1:e181018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Elfiky AA, Pany MJ, Parikh RB, Obermeyer Z. Development and application of a machine learning approach to assess short-term mortality risk among patients with cancer starting chemotherapy. JAMA Netw Open 2018;1:e180926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ingraham AM, Cohen ME, Raval MV, et al. Variation in quality of care after emergency general surgery procedures in the elderly. J Am Coll Surg 2011;212:1039–1048. [DOI] [PubMed] [Google Scholar]
- 7.Smith M, Hussain A, Xiao J, et al. The importance of improving the quality of emergency surgery for a regional quality collaborative. Ann Surg 2013;257:596–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Becher RD, DeWane MP, Sukumar N, et al. Evaluating mortality outlier hospitals to improve the quality of care in emergency general surgery. J Trauma Acute Care Surg 2019;87:297–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ozdemir BA, Sinha S, Karthikesalingam A, et al. Mortality of emergency general surgical patients and associations with hospital structures and processes. Br J Anaesth 2016;116:54–62. [DOI] [PubMed] [Google Scholar]
- 10.Pencina MJ, D’Agostino RB Sr. Evaluating discrimination of risk prediction models: the C statistic. JAMA 2015;314:1063–1064. [DOI] [PubMed] [Google Scholar]
