Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 11.
Published in final edited form as: J Med Syst. 2024 Sep 21;48(1):91. doi: 10.1007/s10916-024-02106-7

Unveiling the Future of Postoperative Outcomes Prediction: The Role of Machine Learning and Trust in Healthcare

Ira S Hofer 1, David B Wax
PMCID: PMC12890313  NIHMSID: NIHMS2129894  PMID: 39304559

Introduction

Machine learning (ML) has swiftly emerged as a transformative force in healthcare, particularly in predicting postoperative outcomes. Since the first paper in the anesthesiology literature by Lee et al. in 2018 [1] dozens of papers have been published leveraging electronic health record data to predict a variety of perioperative outcomes such as mortality [2, 3], acute kidney injury (AKI) [2, 4, 5, 6], readmission [7, 8, 9] and others. However, despite the volume of papers with predictive models, a paucity of these have been implemented into clinical care. In this issue, Barker et al. provides a model to predict unplanned care escalations after surgery. This manuscript provides an excellent example of the promise of predictive models as well as the gaps that need to be bridged to deploy the models into clinical care.

Understanding Machine Learning in Healthcare

At its core, machine learning is a subset of artificial intelligence (AI) that enables systems to learn from data and make predictions or decisions without being explicitly programmed. ML models are specifically designed to be applied to new data (i.e., patients) that have not yet been seen in order to predict the future. In contrast, traditional statistical models are developed to best describe the source dataset; they are not intended to be extrapolated to new situations, though they are often used “off-label” in that way.

Machine learning models generally operate through a series of steps: data collection, data preprocessing, model training, and evaluation. Initially, a large volume of data—often comprising electronic health records (EHRs), imaging data, or patient demographics—is collected. These data are then preprocessed to handle missing values, normalize features, and address any biases. Following preprocessing, the data are divided into training and testing subsets. During the training phase, the model learns to associate input features with outcomes through algorithms such as regression analysis, decision trees, or neural networks. Finally, the model’s performance is evaluated using metrics like accuracy, precision, recall, and area under the receiver operating characteristic curve (AUC-ROC) to assess its predictive power. These metrics examine how accurate the model will be when applied to new patients.

Previous ML Models in Postoperative Outcome Prediction

A growing body of literature has examined ML models for predicting postoperative outcomes. Overall researchers have tried a variety of methods including neural networks, logistic regression and tree-based models such as random forest and gradient boosted trees. By and large, the tree-based models seem to perform the best, and generally have ROC AUCs in the 0.75–0.90 range.

Less commonly reported, but equally important are measures of model calibration such as the area under the precision recall curve (AUC-PR) and threshold measures like precision (also called positive predictive value), negative predictive value and sensitivity. These latter scores allow readers to understand how the model will perform on individual patients as opposed to simply discussing the group as a whole.

SHAP Scores and Feature Importance

To interpret ML models effectively, it is crucial to understand which features drive the predictions. SHAP (SHapley Additive exPlanations) scores are a valuable tool for this purpose. Developed from cooperative game theory, SHAP values decompose a prediction into contributions from individual features, providing insight into their importance.

For instance, in a model predicting postoperative infection risk, SHAP scores might reveal that preoperative glucose levels and the patient’s age are the most influential factors. This transparency allows clinicians to focus on key variables and integrate ML insights into their decision-making processes. By highlighting feature importance, SHAP scores facilitate the development of more interpretable and actionable models, ultimately enhancing their clinical utility [10].

Challenges and the Need for External Validation

Despite promising results, ML models are not without limitations. One significant challenge is the need for external validation. Models developed on one dataset may not generalize well to different populations or clinical settings. For example, a model trained on data from a single hospital may perform poorly when applied to a different institution with different patient demographics or treatment protocols.

External validation involves testing the model on independent datasets to ensure its robustness and generalizability. Without such validation, there is a risk of overfitting—where the model performs well on training data but poorly on new data. To address this, researchers advocate for rigorous validation strategies, including cross-validation and independent cohort testing, to confirm the model’s predictive accuracy across diverse settings.

Trust and Acceptance of ML in Healthcare

The integration of ML into healthcare presents a double-edged sword. On one hand, ML models offer the potential to enhance clinical decision-making and improve patient outcomes. On the other hand, there is a pressing need to build trust among healthcare professionals and patients.

Trust in ML systems hinges on several factors: model transparency, interpretability, and the demonstrated value of predictions. Clinicians need to understand how a model arrives at its recommendations and ensure that it aligns with their clinical judgment. Transparent models, such as those employing SHAP scores, contribute to this trust by making the decision-making process more understandable.

Furthermore, the effectiveness of ML models must be demonstrated through real-world applications. The adoption of ML technologies is likely to increase if these models consistently deliver accurate predictions and translate into tangible improvements in patient care. Collaboration between data scientists, clinicians, and policymakers is essential to facilitate the development of trustworthy ML systems and address ethical and practical concerns related to their use [11].

A Proposed Model

The paper by Barker et al. in this edition of the journal is an excellent example of what is done well in ML research as well as the gaps necessary for implementation [12]. Firstly, the authors are to be commended for a paper that identifies a clear clinical problem (unplanned upgrades of care) and produces a solidly performing model to predict it. The authors are also commended for using SHAP scores to provide visibility into the features driving model performance. Of note, many of these factors are highly correlated with patient complexity/illness (i.e. anesthesia time, Aldrete score) or direct markers of illness (i.e. tachycardia and hypotension).

However, this model is also not ready for deployment. While the authors report an AUC-ROC of 0.752 they do not report the AUC-PR curve which is critical for understanding how the model performs on the individual, as opposed to global, level. Because most patients are not “average” this crucial metric will be important for evaluating performance before implementation. Similarly missing are values for sensitivity and precision at various thresholds. Additionally, this model lacks external validation or evaluation for different subgroups (surgical procedures, demographic groups, etc.) making it impossible to evaluate or reduce bias. Thus, the manuscript provides a good baseline, but work remains before implementing the model into care.

Another noteworthy aspect of their model is that three of the top influential factors were vital signs that were “missing data”. Thus, a patient without a blood pressure or pulse oximetry measurement recorded in the EHR would be stratified at higher risk of post-discharge deterioration. It does not matter to the model whether the blood pressure was unmeasurable, not measured, or measured but not recorded. However, to a human anesthesiologist supervising the recovery room, these are important distinctions and so a missing value would probably prompt an investigation prior to making a disposition decision. This raises the aforementioned issue of how much such a model should be trusted and relied upon for decision making/support.

Conclusion

Machine learning holds immense potential to revolutionize the prediction of postoperative outcomes, offering more precise and personalized risk assessments. While existing models have shown promising results, their successful integration into clinical practice requires careful consideration of feature importance, external validation, and the establishment of trust. By addressing these challenges and fostering collaboration, the healthcare community can harness the power of ML to enhance patient care and surgical outcomes.

Funding

No external funding was received for this work.

Footnotes

Conflicts of Interest

There are no conflicts of interest to report.

Institutional Review Board Approval

Institutional Review Board approval was not required as this is not human subjects research.

Competing Interests

The authors declare no competing interests.

Data Availability

No datasets were generated or analysed during the current study.

References

  • 1.Lee CK, Hofer I, Gabel E, Baldi P, Cannesson M. Development and Validation of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality. Anesthesiology. 2018; 129:649–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bihorac A, Ozrazgat-Baslanti T, Ebadi A, Motaei A, Madkour M, Pardalos PM, Lipori G, Hogan WR, Efron PA, Moore F, Moldawer LL, Wang DZ, Hobson CE, Rashidi P, Li X, Momcilovic P. MySurgeryRisk: Development and Validation of a Machine-learning Risk Algorithm for Major Complications and Death After Surgery. Ann Surg 2019; 269: 652–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hill BL, Brown R, Gabel E, Rakocz N, Lee C, Cannesson M, Baldi P, Olde Loohuis L, Johnson R, Jew B, Maoz U, Mahajan A, Sankararaman S, Hofer I, Halperin E. An automated machine learning-based model predicts postoperative mortality using readily-extractable preoperative electronic health record data. Br J Anaesth 2019; 123: 877–886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gameiro J, Branco T, Lopes JA. Artificial Intelligence in Acute Kidney Injury Risk Prediction. J Clin Med 2020; 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hofer IS, Lee C, Gabel E, Baldi P, Cannesson M Development and validation of a deep neural network model to predict postoperative mortality, acute kidney injury, and reintubation using a single feature set. NPJ Digit Med 2020. 3:58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP, Chen KL, Yang CY, Lee OK. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Crit Care. 2020; 24:478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Frizzell JD, Liang L, Schulte PJ, Yancy CW, Heidenreich PA, Hernandez AF, Bhatt DL, Fonarow GC, Laskey WK. Prediction of 30-Day All-Cause Readmissions in Patients Hospitalized for Heart Failure: Comparison of Machine Learning and Other Statistical Approaches. JAMA Cardiol 2017; 2:204–209. [DOI] [PubMed] [Google Scholar]
  • 8.Misic VV, Gabel E, Hofer I, Rajaram K, Mahajan A. Machine Learning Prediction of Postoperative Emergency Department Hospital Readmission. Anesthesiology. 2020; 132: 968–980. [DOI] [PubMed] [Google Scholar]
  • 9.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. in Proceedings of the 31st International Conference on Neural Information Processing Systems 4768–4777 (Curran Associates Inc., Long Beach, California, USA, 2017). [Google Scholar]
  • 11.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019; 366:447–453. [DOI] [PubMed] [Google Scholar]
  • 12.Barker AB, Melvin RL, Godwin RC, Benz D, Wagener BM. Machine Learning Predicts Unplanned Care Escalations for Post-Anesthesia Care Unit Patients during the Perioperative Period: A Single-Center Retrospective Study. J Med Syst 2024;48:69. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.

RESOURCES