The complex task of modelling artificial intelligence workflows for forecasting postoperative risk

Marco Cascella

doi:10.1186/s44158-025-00287-2

editorial

. 2025 Nov 19;5:82. doi: 10.1186/s44158-025-00287-2

The complex task of modelling artificial intelligence workflows for forecasting postoperative risk

Marco Cascella ^1,^✉

PMCID: PMC12628563 PMID: 41258009

Given the substantial clinical and economic burden of postoperative complications and the current limitations in accurately predicting the risk of major adverse events through cause-and-effect principles [1], numerous studies have explored the potential of predictive analytics applications. The methodology used is varied and involves integrating clinical assessment elements with models of different mathematical complexity [2–6].

Among the set of artificial intelligence (AI) technologies, machine learning (ML) and related computational approaches (supervised, unsupervised, and reinforcement ML) can predict specific postoperative complications and events by dissecting large datasets that incorporate multiple variables, such as patient demographics, comorbidities, laboratory values, and surgical details. On the other hand, artificial neural networks and deep learning (DL) strategies can further enhance this predictive ability by processing high-dimensional data, including continuous time series of biosignals like heart rate, blood pressure, or oxygen saturation, collected during and after surgery [7, 8]. Interestingly, for multimodal DL modelling, natural language processing (NLP) and advanced transformer-derived large language models (LLMs) can be implemented to extract and analyze structured and unstructured clinical data, such as operative notes, discharge summaries, and radiology reports [9]. The ultimate goal of these approaches is to facilitate the detection of subtle trends for anticipating complications before they become clinically apparent, as well as for targeting interventions toward modifiable risk factors, and allocating pre- and postoperative resource use [6].

Developed data-driven models

Research provides valuable insights into postoperative risk stratification. Due to the availability of large-scale datasets for training, predictive models for risk calculation have been developed [6]. The MySurgeryRisk was designed and validated using records from more than 50,000 patients who underwent major inpatient surgery. The developers aimed to forecast key postoperative complications, including acute kidney injury (AKI), sepsis, venous thromboembolism, intensive care admission beyond 48 h, prolonged mechanical ventilation, wound complications, and neurological or cardiovascular events, as well as mortality within 24 months of surgery. The model reached an area under the curve (AUC) as high as 0.94, meaning that in 94% of randomly chosen pairs of patients, one who experienced the outcome and one who did not, the model correctly assigned a higher risk to the patient who experienced the outcome [10]. In a follow-up study, Brennan et al. [11] compared the usability and accuracy of the tool with physicians’ clinical judgment. Their results showed that the calculator outperformed clinicians’ initial risk estimates for nearly all postoperative complications (AUC 0.85 vs 0.69). Moreover, physicians’ assessments improved significantly after incorporating feedback from the ML model. Another research team developed the Predictive opTimal Trees in Emergency Surgery (POTTER) calculator. It is based on a decision-tree supervised ML algorithm trained on data from 382,960 patients in the National Surgical Quality Improvement Program (NSQIP) database. POTTER demonstrated strong discriminative ability for both morbidity (AUC 0.84) and mortality (AUC 0.92), surpassing the predictive performance of the ASA calculator, the Emergency Surgery Score, and the NSQIP risk tool [12]. A subsequent analysis conducted in the emergency setting showed that the calculator was able to predict mortality and morbidity, but mostly sepsis, respiratory failure, and AKI [13]. Other researchers developed and validated ML XGBoost models as part of the PERISCOPE AI system. The model was integrated into electronic health records (EHRs) to predict postoperative infections within 7 and 30 days. Specifically, the authors collected and processed EHR data across multiple surgical specialties, including 253,010 procedures and 23,903 infections occurring within 30 days. The models were trained at a single hospital and then validated/updated at two different centers, achieving a 30-day AUC of 0.82 and 0.91, respectively [14].

What artificial intelligence approach?

It remains to be determined whether models based on conventional ML algorithms or more complex neural networks provide superior performance. Anania et al. [5] evaluated ML models—based on decision trees and random forest algorithms—and DL architectures for predicting postoperative complications after laparoscopic surgery for colon cancer, using the multicenter CoDIG (ColonDx Italian Group) database with demographic, clinical, and surgical predictors. The DL approach achieved the best performance (accuracy 0.86, precision 0.84, recall 0.90, F1 0.87), and key predictors included intraoperative bleeding, blood transfusion, and fast-track recovery protocol adherence, with the absence of bleeding, intracorporeal anastomosis, and fast-track adherence linked to reduced risk.

Models with complex architecture offer undeniable advantages. In a recent study, Alba et al. [15] investigated the use of LLMs for perioperative risk prediction based on 84,875 preoperative clinical notes, complemented by external validation on the publicly available MIMIC-III dataset. Six major postoperative complications were considered: 30-day mortality, deep vein thrombosis, pulmonary embolism, pneumonia, AKI, and delirium. The authors implemented clinically oriented pretrained LLMs, including BioGPT, ClinicalBERT, and BioClinicalBERT, and benchmarked them against traditional NLP embeddings, including Word2Vec, Doc2Vec, GloVe, and FastText, which generate static word or document representations and lack the contextual understanding provided by transformer-based models. Pretrained LLMs significantly outperformed embeddings, with absolute AUC gains of up to 38.3% and Area Under the Precision–Recall Curve (AUPRC) of up to 33.2%, with further improvements obtained through fine-tuning strategies, including self-supervised, semi-supervised with label incorporation, and a unified foundation model using multi-task learning. Notably, the foundation model identified up to 39 additional high-risk patients per 100 compared to traditional NLP.

Other investigations implemented LLMs. Chung et al. [16] evaluated OpenAI GPT-4 Turbo's predictive performance across different clinical tasks, such as hospital and intensive care unit (ICU) admission, unplanned admission, hospital mortality, and durations of hospital and ICU stay. Using task-specific datasets derived from 2 years of retrospective EHRs at a quaternary care center, patient cases with at least one preoperative note were included. Various prompting strategies, including original notes, note summaries, few-shot, and chain-of-thought prompting, were compared. GPT-4 Turbo achieved moderate-to-high performance for categorical outcomes, with F1 scores achieving 0.86 for hospital mortality, although performance on duration prediction tasks was poor across all strategies. Convolutional neural networks (CNNs) are a class of DL architecture originally developed for image recognition but are increasingly applied in medicine to analyze complex, high-dimensional data. Due to multiple layers of convolutional filters, these architectures can automatically identify and combine relevant patterns in structured or unstructured inputs, such as physiological signals, time-series intraoperative data, or imaging, without requiring manual feature engineering. Fritz et al. [17] applied a CNN framework to predict postoperative outcomes. In a cohort of nearly 96,000 surgical patients, a multipath CNN leveraging patient characteristics, comorbidities, laboratory values, and intraoperative data predicted 30-day mortality with high accuracy (AUC 0.867). The model outperformed a simple DL neural network and supervised ML algorithms such as random forest, support vector machine, and logistic regression.

Similarly, Schickel et al. [18] hypothesized that multi-task DL models would outperform traditional ML approaches in predicting postoperative complications, and that incorporating high-resolution intraoperative physiological time-series data would yield more granular and personalized health representations, thereby improving prognostication compared with models based solely on preoperative data. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures, the authors compared DL models with random forests and XGBoost (i.e., supervised ML models) for predicting postoperative complications using preoperative, intraoperative, and perioperative patient data. Their findings demonstrated several notable advantages of DL across multiple experimental settings, supporting its utility in generating more precise patient health representations for enhanced surgical decision support. According to their approach, multi-task learning can improve computational efficiency without sacrificing predictive performance. Additionally, interpretability mechanisms using integrated gradients can be implemented to identify potentially modifiable risk factors for each complication, while dropout methods provide quantitative measures of prediction uncertainty, features that may improve clinical trust.

Nonetheless, while DL-based multi-task learning offers interpretability and uncertainty quantification, ML strategies, if well conducted, can achieve comparable predictive performance with lower computational demands, greater transparency, and easier integration into existing clinical workflows. The choice, therefore, depends on the specific clinical context, the nature and volume of available data, the required balance between accuracy and interpretability, and the resources and expertise available for model development, deployment, and maintenance (Table 1).

Table 1.

Features of machine learning and artificial neural architectures

Feature	Machine learning	Deep learning	Application for postoperative risk prediction
Definition	Algorithms that learn patterns from data to make predictions or decisions	Implement neural networks with multiple layers to learn complex patterns	Both can analyze patient data to predict the risk of postoperative complications
Data requirements	Works well with smaller datasets	Requires large datasets for optimal performance	Both can use hospital EHRs, although DL can leverage large, multi-center EHR datasets
Feature engineering	Manual preprocessing	Can automatically learn features from raw data	ML needs clinical variables chosen by experts; DL can extract patterns directly from raw EHR signals
Advantages	They provide interpretable risk factors	DL models can capture complex nonlinear relationships, but are less interpretable	ML performs well on structured EHR features; DL can capture subtle interactions across multiple data modalities (labs, vitals, notes). Both can be integrated into hospital dashboards
Training time	Faster to train	Requires longer training and more computational power	ML can quickly retrain new data; DL may need GPUs and more time
Interpretability	High (e.g., feature importance)	Low, requires specialized XAI techniques	DL needs extra tools to explain individual risk predictions

Open in a new tab

Abbreviations: ML machine learning, DL deep learning, HER electronic health record, XAI explainable AI

Building trust through uncertainty estimation and improving interpretability

Data quality and preprocessing

For this purpose, it is crucial to consider how to enhance the quality of data to be processed before data modelling and how the integration between clinicians and AI specialists should work in this direction. A crucial prerequisite for reliable modelling is the optimization of data quality before analysis. This complex process includes technical preprocessing steps such as cleaning, normalization, imputation of missing values, and dimensionality reduction. Nevertheless, it requires active clinical validation to ensure that variables are accurately defined and clinically meaningful [19]. For example, data analysts must handle both continuous and categorical (often binary) variables, ensuring appropriate preprocessing to avoid bias and information loss. The management of variables in classes is easier, as binary or categorical features can be directly encoded and integrated into predictive models. On the other hand, relying solely on binary encoding for comorbidities may oversimplify disease severity. For example, treating “diabetes” as a binary variable ignores glycemic control metrics, which are independently prognostic. This example introduces another pivotal methodological aspect. For risk assessment, comorbidities are often handled in a binary, linear, and sequential way, recorded simply as “present” or “absent” and tallied as individual risk factors. However, this oversimplifies reality. The impact of comorbidities is not always additive or proportional; instead, it can be non-linear, meaning that the combined effect of multiple conditions may be far greater (or, in some cases, less) than the sum of each condition’s risk. Thus, considering diabetes and chronic kidney disease each increases the likelihood of postoperative infection, but when present together. Notably, the POTTER score worked better when addressing non-linear relationships [13].

Probably, this multidisciplinary collaboration is the real methodological key for enhancing the subsequent modelling, and thereby system performance and clinical interpretability. For example, Krajnc et al. [20] proposed a clinically informed data preprocessing framework incorporating input from domain experts. Importantly, this integration of clinical knowledge into preprocessing strategies led to a performance improvement of ML models of up to 20% for accuracy.

Clinician acceptance of AI technology

Since acceptance is closely linked to the ability to understand technology, addressing how well clinicians understand this methodology and how effectively computer scientists have optimized the clinical data is of pivotal importance. The process of evolution that unites scientists in multidisciplinary research is advancing quickly. There are interesting findings and commendable scientific contributions. Recently, Arina et al. [21] demonstrated an innovative application of multi-objective symbolic regression (MOSR) to predict 1-year mortality in patients undergoing complex non-cardiac surgery. They combined preoperative clinical and functional variables to address the gap in long-term perioperative risk prediction. Additionally, the authors introduced a methodological innovation, as MOSR, based on genetic programming, provides interpretable mathematical models and is naturally suited to unbalanced datasets. By optimizing metrics (i.e., F1 score and binary cross-entropy), they prioritized balanced performance across false positives and false negatives, avoiding over-reliance on accuracy and other metrics that can be misleading in skewed class distributions.

Explainable AI

Despite these attempts, methodological gaps persist and limit the widespread applications of the results. The use of an explainable approach (i.e., explainable AI, XAI) should be the standard for all ML analyses and AI processes to ensure the reproducibility and generalizability of the model [22]. Most recent articles tend to offer this perspective. For example, in their study on LLMs, Alba et al. [15] used SHapley Additive exPlanations (SHAP) for explaining how tokens such as words or symbols affected the outcome prediction. Given the aim of improving transparency, these XAI techniques also serve to strengthen clinicians’ trust in AI systems by linking predictions to clinically interpretable phenomena.

Real-world validation

After model building and internal validation, the crucial third step of external validation is often missing. Until this is achieved, we will remain purely at a theoretical level, without a mandatory clinical translation. Simplified scoring tools or digital calculators derived from ML outputs could be used for model usability. External validation was excellently performed in the PERISCOPE study. To enhance the implementation of locally valid AI systems in clinical practice, the authors developed a framework that allows the safe scaling of locally updated AI models. According to the authors, it is mandatory to move beyond the “one-size-fits-all” approach by implementing a methodology that provides local and external validation for addressing differences in patient populations across hospitals [11] (Fig. 1).

Fig. 1 — Artificial Intelligence workflow for postoperative risk prediction. Data sources include structured (laboratory results, demographics, comorbidities) and unstructured inputs (clinical notes, physiological time-series signals) derived from electronic health records (EHRs) and perioperative monitoring systems. Preprocessing involves multiple steps: data cleaning and imputation of missing values, handling of class imbalance, feature engineering and dimensionality reduction, normalization/standardization, encoding of categorical variables, and dataset partitioning for training, validation, and testing. This crucial step should require active clinical validation. Machine learning (ML) and deep learning (DL) models are utilized to predict postoperative complications, with support from explainable AI (XAI) methods that enhance interpretability and trust, as well as uncertainty quantification, to improve reliability. External validation across different centers and populations ensures robustness and generalizability of the models. Continuous dynamic updates from real-time EHR and monitoring feeds enable adaptive recalibration of predictions. Ultimately, this approach can allow personalized perioperative interventions, delivering patient-specific recommendations, optimizing triage, and guiding resource allocation

Besides the real-world investigation, it must also be underlined that, although some ML approaches, such as MOSR, are interpretable in principle, the clinical usability of the generated equations is not fully demonstrated. Probably, a test on hypothetical patients’ profiles (i.e., ML-based simulations) could improve transparency. For example, while the variable selection process provides automated inclusion of all available features, there is the risk of incorporating noisy or collinear predictors. However, the “world” of variables is vast and complex. Complications following surgical procedures vary according to patient comorbidities, disease- or treatment-related factors, and the specific circumstances of the operation, such as trauma and emergency surgery. Dynamically changing intraoperative data should be carefully addressed [17]. Postoperative events can range from minor issues that are resolved quickly without intervention to severe, potentially life-threatening problems that may require a return to the operating room or lead to prolonged hospitalization. Therefore, it remains to be understood which features are indispensable, compared to marginal contributors, as well as how they should be managed during the preprocessing stage. For this aim, feature extraction techniques such as principal and discriminant component analysis can be implemented to capture dependencies between features. In high-dimensional datasets, where the number of predictors can be very large compared with the number of observations, such approaches reduce redundancy, focusing the analysis on a smaller set of orthogonal components. This crucial step helps prevent overfitting, allowing the model to generalize better to unseen data [23].

Could predictive analytics refine validated tools, improving consistency across settings and institutions?

Adopting AI techniques can serve as a powerful tool to refine already validated instruments such as the Clavien–Dindo classification (CDC) and the Comprehensive Complication Index (CCI). Specifically, AI can enhance these tools by enabling more precise and individualized complications grading through automated data extraction from EHRs, analysis of clinical notes, and integration of time-series physiologic data. LLMs and other DL architectures could be effectively implemented for this aim. Furthermore, AI algorithms, particularly those using ML, can detect subtle patterns and non-linear relationships between patient characteristics, intraoperative factors, and postoperative outcomes that may not be evident through conventional scoring systems. This step can offer a game-changing solution for dynamic, real-time updates of complication severity scores as new clinical information becomes available. More importantly, this combined approach could improve consistency across institutions. In other words, since the issue of contextualization is a major obstacle to the generalization of AI results, an AI model trained on pre-existing scores (and their underlying big data) and supplemented with local data could make it possible to adapt predictions to the specific characteristics of the local patient population, healthcare practices, and resource availability. This top-down and calibrated strategy could significantly accelerate the process of delivering more personalized decision-making.

Why should clinicians be motivated to use predictive tools?

In order to adopt new technology in such a sensitive area, clinicians often need to move past an initial, understandable skepticism. The first step is to have at least a general understanding of how technology works, and to consider that its maturity and complexity can vary widely. It means that clinicians must recognize that just as there are different specialties and expertise within various areas of medicine, similarly, technical experts possess varying levels of proficiency with these methods. This awareness can help foster realistic expectations, encourage constructive dialogue between clinical and technical teams, and ultimately build the trust needed for safe and effective integration into practice.

Human-in-the-loop artificial intelligence

Human involvement is a key aspect in all these approaches. It encompasses variable selection, data mining, and handling imbalance, imputation of missing data, choice of analytical tools, and the specific environments used (including software versions), which can influence the output of the analysis. Since large datasets, for instance, often exhibit class imbalance (e.g., many more cases without complications than with complications), technical approaches such as the Synthetic Minority Over-sampling Technique (SMOTE) should be applied to balance the data [23]. In other words, a dataset alone is not sufficient to define a model, and the model’s results must be interpreted with extreme caution in light of multiple possible methodological biases. Consequently, believing that ML or DL can perfectly predict postoperative complications by leveraging the full spectrum of preoperative and intraoperative patient-specific EHR data is a utopian notion.

Appropriate feature selection and processing are mandatory. Although model-agnostic interpretation methods can explain predictions regardless of the underlying algorithm, their application has often focused on statistical features, rather than highlighting clinically meaningful variables that physicians can directly translate into practice. This means that although models often explain which numerical features had the greatest weight, there is no explanation of which concrete clinical factors drove the prediction [24].

One critical issue is the quality and completeness of outcome data. Postoperative outcomes are often subject to misclassification. For example, adverse events may occur outside the primary institution and remain unrecorded, leading to “false negatives” in outcome ascertainment. This misclassification can systematically bias estimates of model accuracy. Simulation studies have demonstrated that such biases are particularly problematic when misclassification depends on patient-level predictors, a situation highly relevant in heterogeneous surgical populations. In this context, markers associated with both risk and data capture, such as socioeconomic status, hospital network fragmentation, or comorbidities, can impair the performance of predictive models, limiting their clinical utility. Several approaches can be adopted to address the issue of data complexity, preventing misclassified outcomes. The descriptive phase, for instance, should include sensitivity analyses, probabilistic outcome modeling, and linkage of multi-institutional datasets. Furthermore, models must be developed with an explicit awareness of data provenance and uncertainty, moving beyond simple accuracy metrics to incorporate reliability, calibration, and fairness [25].

Other important questions remain and must be addressed. For example, when is it appropriate to share prognostic information with patients and caregivers, and to make clinical management decisions regarding triage destination and resource allocation after surgery? Moreover, why should clinicians be motivated to use these predictive tools? One of the main barriers to adopting AI–based prediction models in clinical practice is the difficulty in interpreting their outputs [18, 26]. Patients, caregivers, and clinicians are far more likely to incorporate model-generated risk estimates into shared decision-making when they can understand the reasoning behind a prediction and when the results align with established medical knowledge and evidence. Techniques such as integrated gradients make these models more transparent by systematically perturbing input variables, measuring the resulting changes in the model’s outputs, and using this information to assign and communicate the relative importance of each feature [18]. By clarifying how predictions are generated, such methods can help reduce resistance to clinical adoption [26]. Furthermore, incorporating uncertainty metrics to show low variability across predictions can alleviate legitimate concerns, shared by both patients and clinicians, that an output might represent a rare but serious error, a risk for which AI systems are often criticized.

To address several methodological challenges, an optimal strategy may involve combining different techniques, ML models, and neural networks of varying complexity, as well as models based on distinct ML approaches [27]. In one study, risk factors were first identified using logistic regression. Subsequently, a Bayesian network model, consisting of directed arcs and nodes, was applied to analyze the relationships between risk factors and complications. Probability ratios for complications, calculated for a given node state relative to the baseline probability, were used to quantify the potential effects of risk factors on complications or complications on other complications. The authors recruited 19,223 participants and identified nodes, representing risk factors and complications, and direct relationships among them. Respiratory failure emerged as the central node, directly influenced by most risk factors and, in turn, directly affecting complications. The AUC for the network’s ability to predict complications exceeded 0.7 [28]. However, Bayesian network approaches in this field also present certain limitations. They rely heavily on the quality and completeness of the input data, and missing or biased data can distort the inferred relationships. In these frameworks, the assumption of conditional independence among variables may oversimplify the complex and interdependent nature of clinical events. In addition, although Bayesian networks are useful for uncovering probabilistic associations, they cannot establish causality. Since this flag is of pivotal importance and limits interpretability and clinical applicability, integrating expert knowledge with data is needed [29] (Table 2).

Table 2.

Key studies addressing ai-related applications for postoperative risk prediction

Author	AI methods	Setting/dataset	Main findings	Notes
Bihorac et al. [10]	ML risk algorithm	> 50,000 inpatient surgeries	Predicted AKI, sepsis, VTE, ICU > 48 h, ventilation, wound, neuro/cardiovascular events, and mortality. AUC up to 0.94	Outperformed physicians’ clinical judgment (Brennan et al. [8] follow-up study)
Bertsimas et al. [12]	Decision-tree supervised ML	NSQIP database, 382,960 patients	Strong discrimination for morbidity (AUC 0.84) and mortality (AUC 0.92), superior to ASA and NSQIP	External validation confirmed predictive accuracy, esp. for sepsis, respiratory failure, and AKI
van der Meijden et al. [14]	XGBoost ML, integrated in EHR	253,010 procedures, 23,903 infections	Predicted 7- and 30-day postoperative infections. AUC 0.82–0.91 across centers	Successfully externally validated and updated across hospitals
Anania et al. [5]	Decision trees, random forest, DL neural networks	Multicenter ColonDx Italian Group (colon cancer laparoscopic surgery)	DL achieved the best performance (accuracy 0.86, F1 0.87). Key predictors: intraoperative bleeding, transfusion, and fast-track adherence	Highlighted DL’s superior ability over ML
Alba et al. [15]	Pretrained LLMs (BioGPT, ClinicalBERT, BioClinicalBERT) vs. Word2Vec, GloVe, etc	84,875 preoperative notes + MIMIC-III	LLMs outperformed embeddings with AUC gains up to + 38.3%. Identified up to 39 more high-risk patients per 100	Used SHAP for explainability; a foundation model with multi-task learning
Chung et al. [16]	GPT-4 Turbo, prompting strategies	2 years of retrospective EHRs, quaternary center	Moderate-to-high performance on categorical tasks (F1 up to 0.86 for mortality), poor for duration prediction	Compared original notes, summaries, few-shot, and chain-of-thought prompts
Fritz et al. [17]	CNN (deep learning)	~ 96,000 surgical patients	Predicted 30-day mortality with AUC 0.867	Outperformed DL neural networks, RF, SVM, and logistic regression
Schickel et al. [18]	Multi-task DL, compared with RF, XGBoost	56,242 patients, 67,481 surgeries	DL was superior across multiple settings, with better health representations	Used integrated gradients for interpretability and dropout for uncertainty
Arina et al. [21]	Multi-objective symbolic regression (MOSR, genetic programming)	Non-cardiac complex surgeries	Predicted 1-year mortality with interpretable models suited to unbalanced datasets	Focused on balanced metrics (F1, cross-entropy) to avoid bias
Yu et al. [28]	Bayesian network model	19,223 participants, general surgery	Identified nodes/relationships among risk factors and complications. Respiratory failure central node. AUC > 0.7	Limitations: conditional independence assumption, cannot establish causality

Open in a new tab

Abbreviations: ML machine learning, DL deep learning, AI artificial intelligence, LLM large language model, AUC area under the curve, AUPRC area under the precision–recall curve, AKI acute kidney injury, VTE venous thromboembolism, ICU intensive care unit, EHR electronic health record, ASA American Society of Anesthesiologists, NSQIP National Surgical Quality Improvement Program, CNN convolutional neural network, SHAP SHapley Additive exPlanations, XAI explainable AI, MOSR multi-objective symbolic regression, CDC Clavien–Dindo classification, CCI Comprehensive Complication Index, RF random forest, SVM support vector machine

Preventing and forecasting adverse postoperative outcomes is a priority. However, considering current limitations, high-quality articles from multiprofessional research groups are urgently needed, as they must pave the way for a new generation of scientific research that merges clinical expertise and predictive analytics on topics of paramount importance. AI studies are often based on retrospective datasets and are primarily exploratory. This gap limits their strength of evidence. To ensure reliability, internal and external validation are needed. When these steps are fulfilled, AI-driven models have the potential to support clinicians and healthcare providers in choosing operative strategies and anticipating postoperative outcomes. Currently, the path toward large-scale usability remains long, and for now, we can mostly rely on descriptive models based on patient outcomes, and probably, the osmosis process between the descriptive and predictive approaches could improve both lines of research.

Acknowledgements

None.

Author’s contributions

Conceptualization, methodology, investigation, writing—review and editing, M.C.

Funding

This research received no external funding.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Dharap SB, Barbaniya P, Navgale S (2022) Incidence and risk factors of postoperative complications in general surgery patients. Cureus 14(11):e30975. 10.7759/cureus.30975 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wang J, Tozzi F, Ashraf Ganjouei A, Romero-Hernandez F, Feng J, Calthorpe L, Castro M, Davis G, Withers J, Zhou C, Chaudhary Z, Adam M, Berrevoet F, Alseidi A, Rashidian N (2024) Machine learning improves prediction of postoperative outcomes after gastrointestinal surgery: a systematic review and meta-analysis. J Gastrointest Surg 28(6):956–965. 10.1016/j.gassur.2024.03.006 [DOI] [PubMed] [Google Scholar]
3.Moll V, Khanna AK, Mathur P (2025) Artificial intelligence for the prediction of postoperative complications in the critically ill. Crit Care Sci 37:e20250025. 10.62675/2965-2774.20250025 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Huang Z, Han Y, Zhuang H, Jiang J, Zhou C, Yu H. Prediction models for postoperative pulmonary complications: a systematic review and meta-analysis. Br J Anaesth. 2025:S0007–0912(25)00265-X. 10.1016/j.bja.2025.04.025. [DOI] [PMC free article] [PubMed]
5.Anania G, Mascagni P, Chiozza M, Resta G, Campagnaro A, Pedon S, Silecchia G, Cuccurullo D, Bergamini C, Sica G, Nicola V, Alberti M, Ortenzi M, Reddavid R, Azzolina D; SICE CoDIG (ColonDx Italian Group). Deep learning neural network prediction of postoperative complications in patients undergoing laparoscopic right hemicolectomy with or without CME and CVL for colon cancer: insights from SICE (Società Italiana di Chirurgia Endoscopica) CoDIG data. Tech Coloproctol. 2025;29(1):135. 10.1007/s10151-025-03165-9.
6.Hassan AM, Rajesh A, Asaad M, Nelson JA, Coert JH, Mehrara BJ, Butler CE (2023) Artificial intelligence and machine learning in prediction of surgical complications: current state, applications, and implications. Am Surg 89(1):25–30. 10.1177/00031348221101488 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Razavian N, Knoll F, Geras KJ (2020) Artificial intelligence explained for nonexperts. Semin Musculoskelet Radiol 24(1):3–11. 10.1055/s-0039-3401041 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Benke K, Benke G (2018) Artificial intelligence and big data in public health. Int J Environ Res Public Health 15(12):2796. 10.3390/ijerph15122796 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Rashidi HH, Pantanowitz J, Hanna MG, Tafti AP, Sanghani P, Buchinsky A, Fennell B, Deebajah M, Wheeler S, Pearce T, Abukhiran I, Robertson S, Palmer O, Gur M, Tran NK, Pantanowitz L (2025) Introduction to artificial intelligence and machine learning in pathology and medicine: generative and nongenerative artificial intelligence basics. Mod Pathol 38(4):100688. 10.1016/j.modpat.2024.100688 [DOI] [PubMed] [Google Scholar]
10.Bihorac A, Ozrazgat-Baslanti T, Ebadi A, Motaei A, Madkour M, Pardalos PM, Lipori G, Hogan WR, Efron PA, Moore F, Moldawer LL, Wang DZ, Hobson CE, Rashidi P, Li X, Momcilovic P (2019) Mysurgeryrisk: development and validation of a machine-learning risk algorithm for major complications and death after surgery. Ann Surg 269(4):652–662. 10.1097/SLA.0000000000002706 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Brennan M, Puri S, Ozrazgat-Baslanti T, Feng Z, Ruppert M, Hashemighouchani H, Momcilovic P, Li X, Wang DZ, Bihorac A (2019) Comparing clinical judgment with the MySurgeryRisk algorithm for preoperative risk assessment: a pilot usability study. Surgery 165(5):1035–1045. 10.1016/j.surg.2019.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Bertsimas D, Dunn J, Velmahos GC, Kaafarani HMA (2018) Surgical risk is not linear: derivation and validation of a novel, user-friendly, and machine-learning-based predictive optimal trees in emergency surgery risk (POTTER) calculator. Ann Surg 268(4):574–583. 10.1097/SLA.0000000000002956 [DOI] [PubMed] [Google Scholar]
13.El Hechi MW, Maurer LR, Levine J, Zhuo D, El Moheb M, Velmahos GC, Dunn J, Bertsimas D, Kaafarani HM (2021) Validation of the artificial intelligence-based predictive optimal trees in emergency surgery risk (POTTER) calculator in emergency general surgery and emergency laparotomy patients. J Am Coll Surg 232(6):912-919.e1. 10.1016/j.jamcollsurg.2021.02.009 [DOI] [PubMed] [Google Scholar]
14.van der Meijden SL, van Boekel AM, Schinkelshoek LJ, van Goor H, Steyerberg EW, Nelissen RGHH, Mesotten D, Geerts BF, de Boer MGJ, Arbous MS, PERISCOPE Group (2024) Development and validation of artificial intelligence models for early detection of postoperative infections (PERISCOPE): a multicentre study using electronic health record data. Lancet Reg Health Eur 49:101163. 10.1016/j.lanepe.2024.101163 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Alba C, Xue B, Abraham J, Kannampallil T, Lu C (2025) The foundational capabilities of large language models in predicting postoperative risks using clinical notes. NPJ Digit Med 8(1):95. 10.1038/s41746-025-01489-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Chung P, Fong CT, Walters AM, Aghaeepour N, Yetisgen M, O’Reilly-Shah VN (2024) Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg 159(8):928–937. 10.1001/jamasurg.2024.1621 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Fritz BA, Cui Z, Zhang M, He Y, Chen Y, Kronzer A, Ben Abdallah A, King CR, Avidan MS (2019) Deep-learning model for predicting 30-day postoperative mortality. Br J Anaesth 123(5):688–695. 10.1016/j.bja.2019.07.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Shickel B, Loftus TJ, Ruppert M, Upchurch GR Jr, Ozrazgat-Baslanti T, Rashidi P, Bihorac A (2023) Dynamic predictions of postoperative complications from explainable, uncertainty-aware, and multi-task deep neural networks. Sci Rep 13(1):1224. 10.1038/s41598-023-27418-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Cascella M, Tracey MC, Petrucci E, Bignami EG (2023) Exploring artificial intelligence in anesthesia: a primer on ethics, and clinical applications. Surgeries 4(2):264–274. 10.3390/surgeries4020027 [Google Scholar]
20.Krajnc D, Spielvogel CP, Ecsedi B, Ritter Z, Alizadeh H, Hacker M, Papp L (2025) Clinician-driven automated data preprocessing in nuclear medicine AI environments. Eur J Nucl Med Mol Imaging 52(9):3444–3454. 10.1007/s00259-025-07183-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Arina P, Ferrari D, Tetlow N, Dewar A, Stephens R, Martin D, Moonesinghe R, Curcin V, Singer M, Whittle J, Mazomenos EB (2025) Mortality prediction after major surgery in a mixed population through machine learning: a multi-objective symbolic regression approach. Anaesthesia 80(5):551–560. 10.1111/anae.16538 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1):6. 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Bellini V, Cascella M, Cutugno F, Russo M, Lanza R, Compagnone C, Bignami EG (2022) Understanding basic principles of Artificial Intelligence: a practical guide for intensivists. Acta Biomed 93(5):e2022297. 10.23750/abm.v93i5.13626 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DK, Newman SF, Kim J, Lee SI (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760. 10.1038/s41551-018-0304-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Wang LE, Shaw PA, Mathelier HM, Kimmel SE, French B (2016) Evaluating risk-prediction models using data from electronic health records. Ann Appl Stat 10(1):286–304. 10.1214/15-AOAS891 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Balch JA, Loftus TJ (2023) Actionable artificial intelligence: overcoming barriers to adoption of prediction tools. Surgery 174(3):730–732. 10.1016/j.surg.2023.03.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Cascella M, Coluccia S, Monaco F, Schiavo D, Nocerino D, Grizzuti M, Romano MC, Cuomo A (2022) Different machine learning approaches for implementing telehealth-based cancer pain management strategies. J Clin Med 11(18):5484. 10.3390/jcm11185484 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yu X, Chen W, Han W, Wu P, Shen Y, Huang Y, Xin S, Wu S, Zhao S, Sun H, Lei G, Wang Z, Xue F, Zhang L, Gu W, Jiang J (2023) Prediction of complications associated with general surgery using a bayesian network. Surgery 174(5):1227–1234. 10.1016/j.surg.2023.07.022 [DOI] [PubMed] [Google Scholar]
29.Constantinou AC, Fenton N, Neil M (2016) Integrating expert knowledge with data in bayesian networks: preserving data-driven expectations when the expert variables remain unobserved. Expert Syst Appl 56:197–208. 10.1016/j.eswa.2016.02.050 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.

[CR1] 1.Dharap SB, Barbaniya P, Navgale S (2022) Incidence and risk factors of postoperative complications in general surgery patients. Cureus 14(11):e30975. 10.7759/cureus.30975 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Wang J, Tozzi F, Ashraf Ganjouei A, Romero-Hernandez F, Feng J, Calthorpe L, Castro M, Davis G, Withers J, Zhou C, Chaudhary Z, Adam M, Berrevoet F, Alseidi A, Rashidian N (2024) Machine learning improves prediction of postoperative outcomes after gastrointestinal surgery: a systematic review and meta-analysis. J Gastrointest Surg 28(6):956–965. 10.1016/j.gassur.2024.03.006 [DOI] [PubMed] [Google Scholar]

[CR3] 3.Moll V, Khanna AK, Mathur P (2025) Artificial intelligence for the prediction of postoperative complications in the critically ill. Crit Care Sci 37:e20250025. 10.62675/2965-2774.20250025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Huang Z, Han Y, Zhuang H, Jiang J, Zhou C, Yu H. Prediction models for postoperative pulmonary complications: a systematic review and meta-analysis. Br J Anaesth. 2025:S0007–0912(25)00265-X. 10.1016/j.bja.2025.04.025. [DOI] [PMC free article] [PubMed]

[CR5] 5.Anania G, Mascagni P, Chiozza M, Resta G, Campagnaro A, Pedon S, Silecchia G, Cuccurullo D, Bergamini C, Sica G, Nicola V, Alberti M, Ortenzi M, Reddavid R, Azzolina D; SICE CoDIG (ColonDx Italian Group). Deep learning neural network prediction of postoperative complications in patients undergoing laparoscopic right hemicolectomy with or without CME and CVL for colon cancer: insights from SICE (Società Italiana di Chirurgia Endoscopica) CoDIG data. Tech Coloproctol. 2025;29(1):135. 10.1007/s10151-025-03165-9.

[CR6] 6.Hassan AM, Rajesh A, Asaad M, Nelson JA, Coert JH, Mehrara BJ, Butler CE (2023) Artificial intelligence and machine learning in prediction of surgical complications: current state, applications, and implications. Am Surg 89(1):25–30. 10.1177/00031348221101488 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Razavian N, Knoll F, Geras KJ (2020) Artificial intelligence explained for nonexperts. Semin Musculoskelet Radiol 24(1):3–11. 10.1055/s-0039-3401041 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Benke K, Benke G (2018) Artificial intelligence and big data in public health. Int J Environ Res Public Health 15(12):2796. 10.3390/ijerph15122796 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Rashidi HH, Pantanowitz J, Hanna MG, Tafti AP, Sanghani P, Buchinsky A, Fennell B, Deebajah M, Wheeler S, Pearce T, Abukhiran I, Robertson S, Palmer O, Gur M, Tran NK, Pantanowitz L (2025) Introduction to artificial intelligence and machine learning in pathology and medicine: generative and nongenerative artificial intelligence basics. Mod Pathol 38(4):100688. 10.1016/j.modpat.2024.100688 [DOI] [PubMed] [Google Scholar]

[CR10] 10.Bihorac A, Ozrazgat-Baslanti T, Ebadi A, Motaei A, Madkour M, Pardalos PM, Lipori G, Hogan WR, Efron PA, Moore F, Moldawer LL, Wang DZ, Hobson CE, Rashidi P, Li X, Momcilovic P (2019) Mysurgeryrisk: development and validation of a machine-learning risk algorithm for major complications and death after surgery. Ann Surg 269(4):652–662. 10.1097/SLA.0000000000002706 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Brennan M, Puri S, Ozrazgat-Baslanti T, Feng Z, Ruppert M, Hashemighouchani H, Momcilovic P, Li X, Wang DZ, Bihorac A (2019) Comparing clinical judgment with the MySurgeryRisk algorithm for preoperative risk assessment: a pilot usability study. Surgery 165(5):1035–1045. 10.1016/j.surg.2019.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Bertsimas D, Dunn J, Velmahos GC, Kaafarani HMA (2018) Surgical risk is not linear: derivation and validation of a novel, user-friendly, and machine-learning-based predictive optimal trees in emergency surgery risk (POTTER) calculator. Ann Surg 268(4):574–583. 10.1097/SLA.0000000000002956 [DOI] [PubMed] [Google Scholar]

[CR13] 13.El Hechi MW, Maurer LR, Levine J, Zhuo D, El Moheb M, Velmahos GC, Dunn J, Bertsimas D, Kaafarani HM (2021) Validation of the artificial intelligence-based predictive optimal trees in emergency surgery risk (POTTER) calculator in emergency general surgery and emergency laparotomy patients. J Am Coll Surg 232(6):912-919.e1. 10.1016/j.jamcollsurg.2021.02.009 [DOI] [PubMed] [Google Scholar]

[CR14] 14.van der Meijden SL, van Boekel AM, Schinkelshoek LJ, van Goor H, Steyerberg EW, Nelissen RGHH, Mesotten D, Geerts BF, de Boer MGJ, Arbous MS, PERISCOPE Group (2024) Development and validation of artificial intelligence models for early detection of postoperative infections (PERISCOPE): a multicentre study using electronic health record data. Lancet Reg Health Eur 49:101163. 10.1016/j.lanepe.2024.101163 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Alba C, Xue B, Abraham J, Kannampallil T, Lu C (2025) The foundational capabilities of large language models in predicting postoperative risks using clinical notes. NPJ Digit Med 8(1):95. 10.1038/s41746-025-01489-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Chung P, Fong CT, Walters AM, Aghaeepour N, Yetisgen M, O’Reilly-Shah VN (2024) Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg 159(8):928–937. 10.1001/jamasurg.2024.1621 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Fritz BA, Cui Z, Zhang M, He Y, Chen Y, Kronzer A, Ben Abdallah A, King CR, Avidan MS (2019) Deep-learning model for predicting 30-day postoperative mortality. Br J Anaesth 123(5):688–695. 10.1016/j.bja.2019.07.025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Shickel B, Loftus TJ, Ruppert M, Upchurch GR Jr, Ozrazgat-Baslanti T, Rashidi P, Bihorac A (2023) Dynamic predictions of postoperative complications from explainable, uncertainty-aware, and multi-task deep neural networks. Sci Rep 13(1):1224. 10.1038/s41598-023-27418-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Cascella M, Tracey MC, Petrucci E, Bignami EG (2023) Exploring artificial intelligence in anesthesia: a primer on ethics, and clinical applications. Surgeries 4(2):264–274. 10.3390/surgeries4020027 [Google Scholar]

[CR20] 20.Krajnc D, Spielvogel CP, Ecsedi B, Ritter Z, Alizadeh H, Hacker M, Papp L (2025) Clinician-driven automated data preprocessing in nuclear medicine AI environments. Eur J Nucl Med Mol Imaging 52(9):3444–3454. 10.1007/s00259-025-07183-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Arina P, Ferrari D, Tetlow N, Dewar A, Stephens R, Martin D, Moonesinghe R, Curcin V, Singer M, Whittle J, Mazomenos EB (2025) Mortality prediction after major surgery in a mixed population through machine learning: a multi-objective symbolic regression approach. Anaesthesia 80(5):551–560. 10.1111/anae.16538 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1):6. 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Bellini V, Cascella M, Cutugno F, Russo M, Lanza R, Compagnone C, Bignami EG (2022) Understanding basic principles of Artificial Intelligence: a practical guide for intensivists. Acta Biomed 93(5):e2022297. 10.23750/abm.v93i5.13626 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, Liston DE, Low DK, Newman SF, Kim J, Lee SI (2018) Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2(10):749–760. 10.1038/s41551-018-0304-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Wang LE, Shaw PA, Mathelier HM, Kimmel SE, French B (2016) Evaluating risk-prediction models using data from electronic health records. Ann Appl Stat 10(1):286–304. 10.1214/15-AOAS891 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Balch JA, Loftus TJ (2023) Actionable artificial intelligence: overcoming barriers to adoption of prediction tools. Surgery 174(3):730–732. 10.1016/j.surg.2023.03.019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Cascella M, Coluccia S, Monaco F, Schiavo D, Nocerino D, Grizzuti M, Romano MC, Cuomo A (2022) Different machine learning approaches for implementing telehealth-based cancer pain management strategies. J Clin Med 11(18):5484. 10.3390/jcm11185484 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Yu X, Chen W, Han W, Wu P, Shen Y, Huang Y, Xin S, Wu S, Zhao S, Sun H, Lei G, Wang Z, Xue F, Zhang L, Gu W, Jiang J (2023) Prediction of complications associated with general surgery using a bayesian network. Surgery 174(5):1227–1234. 10.1016/j.surg.2023.07.022 [DOI] [PubMed] [Google Scholar]

[CR29] 29.Constantinou AC, Fenton N, Neil M (2016) Integrating expert knowledge with data in bayesian networks: preserving data-driven expectations when the expert variables remain unobserved. Expert Syst Appl 56:197–208. 10.1016/j.eswa.2016.02.050 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The complex task of modelling artificial intelligence workflows for forecasting postoperative risk

Marco Cascella

Developed data-driven models

What artificial intelligence approach?

Table 1.

Building trust through uncertainty estimation and improving interpretability

Data quality and preprocessing

Clinician acceptance of AI technology

Explainable AI

Real-world validation

Fig. 1.

Could predictive analytics refine validated tools, improving consistency across settings and institutions?

Why should clinicians be motivated to use predictive tools?

Human-in-the-loop artificial intelligence

Table 2.

Acknowledgements

Author’s contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The complex task of modelling artificial intelligence workflows for forecasting postoperative risk

Marco Cascella

Developed data-driven models

What artificial intelligence approach?

Table 1.

Building trust through uncertainty estimation and improving interpretability

Data quality and preprocessing

Clinician acceptance of AI technology

Explainable AI

Real-world validation

Fig. 1.

Could predictive analytics refine validated tools, improving consistency across settings and institutions?

Why should clinicians be motivated to use predictive tools?

Human-in-the-loop artificial intelligence

Table 2.

Acknowledgements

Author’s contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases