Trustworthy and Uncertainty-Aware AI for Predicting Respiratory Complications Following Total Hip and Knee Arthroplasty

Farnaz Rezvani*; Kate Towsen; Zoe Menezes; Anatea Einhorn; Jevaughn Davis; Puneet Gupta; Johannes F Plate; Chloe Fox; Nicole Myers*; Ahmad P Tafti*

. 2026 Feb 14;2025:1089–1099.

Trustworthy and Uncertainty-Aware AI for Predicting Respiratory Complications Following Total Hip and Knee Arthroplasty

Farnaz Rezvani* ¹, Kate Towsen ², Zoe Menezes ², Anatea Einhorn ³, Jevaughn Davis ⁵, Puneet Gupta ⁵, Johannes F Plate ⁶, Chloe Fox ², Nicole Myers* ², Ahmad P Tafti* ^2,⁴

PMCID: PMC12919624 PMID: 41726476

Abstract

Total hip and knee arthroplasty (THA/TKA) are among the fastest-growing surgeries in the United States, where they are designed to restore mobility and improve quality of life in individuals with joint disorders. Despite their benefits, these procedures may carry significant risks, including, but not limited to, major respiratory complications. Prompt identification of patients at increased risk is essential for optimizing preoperative treatment, reducing adverse outcomes, and increasing patient safety. In this study, we propose an uncertainty-aware and trustworthy artificial intelligence (AI) framework to predict the likelihood of major respiratory complications, including unplanned intubation, failure to wean from ventilation, and postoperative pneumonia occurring during the index hospitalization and within 30 days following both primary and revision THA and TKA procedures. Unlike traditional risk models, our framework explicitly quantifies prediction uncertainty while maintaining high interpretability, enabling proactive and personalized clinical interventions. We assessed four machine learning (ML) models, including Random Forest (RF), XGBoost (XGB), Logistic Regression (LR), and Artificial Neural Networks (ANNs) to predict three postoperative respiratory outcomes. The ML models demonstrated strong predictive performance, with RF achieving an F1-score of 0.87 for respiratory complications in THA, while ANNs outperformed other models in TKA, also attaining an F1-score of 0.87.

Keywords: Total Hip Arthroplasty (THA), Total Knee Arthroplasty (TKA), Respiratory Complications, Explainable AI (XAI), Uncertainty Quantification

1. Introduction

Total hip arthroplasty (THA) and total knee arthroplasty (TKA) are among the most frequently performed orthopedic procedures to treat osteoarthritis and other joint disorders^{1, 2}. While these surgeries substantially improve mobility and quality of life, they are associated with postoperative risks, including respiratory complications such as pneumonia, intubation, and ventilator dependence within 30 days post-surgery^{3, 4}. Early identification of high-risk patients is of paramount importance for optimizing preoperative planning, enhancing patient safety, reducing adverse outcomes, and minimizing prolonged hospitalization.

Artificial intelligence (AI) and machine learning (ML) have demonstrated strong potential in predicting postoperative complications in THA, TKA, and other orthopedic procedures^4–8. These models enable data-driven clinical decision-making and risk stratification. However, their clinical utility is often limited by a lack of transparency and interpretability^9–11. Moreover, conventional ML models generally provide deterministic predictions, failing to convey the uncertainty inherent in complex clinical scenarios. Integrating uncertainty quantification into ML pipelines enables not only risk prediction but also assessment of prediction confidence, an essential factor in clinical contexts where decisions are made in high-stakes settings, such as healthcare. To address these challenges, we propose an uncertainty-aware and explainable AI framework to predict the risk of postoperative respiratory complications following THA and TKA. Our approach incorporates Monte Carlo Dropout (MC Dropout)¹² and out-of-bag (OOB)¹³ strategies to build uncertainty-aware ML models, while SHapley Additive exPlanations (SHAP)¹⁴ and Local Interpretable Model-Agnostic Explanations (LIME)¹⁵ enhance ML models interpretability and explainability. Using data from the American College of Surgeons National Surgical Quality Improvement Program (NSQIP)¹⁶, we evaluate ML model performance in identifying at-risk patients and key clinical predictors.

This work builds on recent studies applying ML in orthopedic and surgical settings. Shah et al.⁸ used an AutoPrognosis framework to outperform Logistic Regression (LR) in predicting complications after THA. Onishchenko et al.¹⁷ introduced a Cardiac Comorbidity Risk Score model for major cardiac events in THA/TKA, though sensitivity limitations highlighted the need for more robust and explainable methods. Di Matteo et al.¹⁸ developed models to estimate hospital length of stay, supporting resource allocation and care planning. More broadly, ML has enhanced diagnostic accuracy, treatment precision, and cost-efficiency in healthcare¹⁹, though persistent concerns around generalizability, transparency, and validation remain. Brnabic et al.²⁰ reviewed ML models in clinical decision-making, emphasizing the lack of external validation and limited model diversity in most studies. Other work, such as Gutierrez-Naranjo et al.²¹, demonstrated the utility of ML in predicting surgical site infections, while Li et al.²² showed that XGBoost (XGB) outperformed other models in predicting surgical risk across operative stages. Our framework extends this body of work by explicitly integrating uncertainty and interpretability into the AI pipeline. The clinical and technical significance of our contributions are as follows.

Clinical Significance: By building interpretable and uncertainty-aware predictions of postoperative respiratory risk, our framework supports more informed and personalized clinical decisions, enhances patient safety, and facilitates early intervention for high-risk individuals in THA and TKA settings.
Technical Significance: This study advances the field of explainable AI by combining uncertainty quantification with ML models interpretability, thus improving reliability and promoting clinician trust which basically are the key steps in AI implementation.

2. Materials and Methods

2.1. Data Sources and Patient Cohort

This study leverages patient data from the NSQIP16, focusing on individuals who underwent THA or TKA as their primary procedure between 2021 and 2023. To construct the dataset, a systematic extraction of current procedural terminology (CPT) codes was performed. Initially, all CPT codes available in the database were analyzed, and the most frequently recorded procedures were identified. From this set, only the codes specifically linked to primary and revision THA or TKA were selected and aggregated to form the final dataset. Below is the list of CPT codes included in this study:

27125: Under Repair, Revision, and/or Reconstruction Procedures [pelvis and hip joint]
27130: Under Repair, Revision, and/or Reconstruction Procedures [pelvis and hip joint]
27134: Under Repair, Revision, and/or Reconstruction Procedures [pelvis and hip joint]
27236: Under Fracture and/or Dislocation Procedures [pelvis and hip joint]
27245: Under Fracture and/or Dislocation Procedures [pelvis and hip joint]
27447: Under Repair, Revision, and/or Reconstruction Procedures [knee joint]
27446: Under Repair, Revision, and/or Reconstruction Procedures [knee joint]
27487: Under Repair, Revision, and/or Reconstruction Procedures [knee joint]

Patients were included if their records had complete demographic, clinical, and surgical information, as well as documented outcomes within 30 days after surgery. Records with missing or incomplete key data were excluded to ensure reliable analysis.

2.2. Outcome Measures

This study examines three postoperative respiratory complications, including (1) postoperative pneumonia (OUPNEUMO), (2) unplanned intubation (REINTUB), and (3) failure to wean from ventilation (FAILWEAN), which were tracked based on NSQIP-defined postoperative evaluation criteria. To ensure data accuracy and consistency, all reported cases underwent clinical validation, reinforcing dataset reliability and maintaining alignment with established research methodologies. This rigorous verification process enhanced the credibility of outcome identification, reducing the risk of misclassifications, and ensuring robustness in subsequent analyses.

Since respiratory complications occurred less frequently than uneventful recoveries, the dataset exhibited a natural imbalance in outcome distribution. To prevent this skewed representation from influencing risk factor analysis, undersampling techniques were applied to address class imbalance and mitigate bias during model training. This approach ensured a more balanced representation of each class. As a result, the model was better equipped to learn from both outcomes equally, supporting more reliable and objective predictions of adverse events. To maintain uniformity in evaluation and comparison, patient outcomes were classified into two distinct categories: “1” for cases involving respiratory complications and “0” for patients who experienced an uneventful recovery.

2.3. Data Preprocessing and Feature Selection

The dataset included demographic attributes (e.g., sex, race, and age), patient history (e.g., hypertension, diabetes, and functional status), preoperative variables (e.g., anesthesia type, ASA classification, and transfusion history), and clinical and physiological markers (e.g., creatinine, WBC count, and partial thromboplastin time). To address missing values, continuous variables were imputed using a median replacement strategy or removing them, while categorical variables (e.g., orthopedic age-based classification, smoking status, diabetes classification, and anesthesia type) were numerically encoded to optimize their compatibility with ML models. Additionally, variable standardization was conducted using adaptive data mapping. Table 1 shows the predictive variables, while Table 2 enumerates the three outcome/target classes utilized for prediction, including OUPNEUMO, REINTUB, and FAILWEAN.

Table 1:

Feature definitions and dataset structure.

graphic file with name AMIASYMPROC-2025-9628-t1.jpg

Open in a new tab

Table 2:

Outcome (OUPNEUMO, REINTUB, and FAILWEAN).

graphic file with name AMIASYMPROC-2025-9628-t2.jpg

Open in a new tab

To explore variable distributions, Exploratory Data Analysis (EDA) ²³ was conducted, focusing on respiratory complications such as OUPNEUMO, REINTUB, and FAILWEAN. Since these complications occurred less frequently than uneventful recoveries, the dataset exhibited a natural class imbalance. To counteract the imbalanced dataset, a random undersampling strategy was implemented separately for both hip and knee cohorts, building an equitable distribution of complication (class 1) and non-complication (class 0) cases.

After undersampling, 4,292 cases for OUPNEUMO (2,146 with no pneumonia and 2,146 with pneumonia), 1,288 cases for REINTUB (644 without unplanned reintubation and 644 with reintubation), and 604 cases for FAILWEAN (302 without prolonged ventilator dependence and 302 with prolonged dependence) were included in the final balanced dataset for the hip surgery cohort. Similar to this, the dataset for the knee surgery cohort was balanced to contain 860 cases for OUPNEUMO (430 as class 0 and 430 as class 1), 342 instances for REINTUB (171 as class 0 and 171 as class 1), and 146 cases for FAILWEAN (73 as class 0 and 73 as class 1). Additionally, sex-based distribution analysis highlighted that females had a slightly higher proportion of cases across all respiratory complications in both hip and knee cohorts in Figure 1. Further analysis revealed a consistent sex disparity, with females exhibiting a slightly higher prevalence in all categories.

Figure 1: — Study cohort and sex-based distribution.

Building on the preprocessing steps and study cohort selection, we performed an exploratory analysis to assess the distribution of age groups across different race and sex categories for both hip and knee datasets. The percentage distribution of age, race, and sex groups in hip and knee patients is shown in Figure 2 and Figure 3.

Figure 2: — Respiratory risk distribution by age, race, and sex in THA.

Figure 3: — Respiratory risk distribution by age, race, and sex in TKA.

2.4. Machine Learning Models

To construct a robust predictive framework, we implemented and compared multiple supervised learning algorithms as follows with varied complexities and interpretability levels. We utilized Random Forest (RF)¹³, XGB²⁴, Artificial Neural Networks (ANNs)²⁵, and LR²⁶. To fine-tune the performance of our ML models, we employed Optuna²⁷, a flexible and efficient framework for automated hyperparameter optimization. Optuna leverages advanced search strategies, including the Tree-structured Parzen Estimator (TPE)²⁸, to systematically identify well-performing parameter configurations while minimizing computational overhead. This approach contributes to improved model generalization by mitigating overfitting and maintaining an effective trade-off between complexity and predictive performance.

2.5. Uncertainty Awareness and AI Explainability Analysis

To address predictive uncertainty, we incorporated model-specific strategies for quantifying confidence in outputs. Within the ANNs, we applied MC Dropout¹², performing multiple stochastic forward passes to estimate the variability in predictions. For the RF model, we utilized OOB¹³ estimation, which offers an internal measure of model reliability by evaluating performance on unused data subsets from the bootstrap sampling process. In terms of interpretability, we employed two complementary explainability methods, including SHAP¹⁴ was used to assess both global and local feature contributions, providing insight into how input variables influence predictions. Additionally, LIME¹⁵ was used to generate instance-level explanations, building transparency for individual cases. These techniques collectively supported model interpretability and clinical trust. Full details are presented in the Experimental Validation and Results section.

3. Experimental Validation and Results

This section presents the experimental evaluation and performance comparison of the ML models. To ensure robustness and generalizability, and also to minimize overfitting, a 5-fold Cross-Validation (CV) strategy was implemented. The evaluation metrics used included accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC-AUC). All analyses were conducted using Python in the Google Colab environment to ensure reproducibility and scalability. We assessed four ML models, including RF, XGB, LR, and ANNs to predict three major postoperative respiratory complications, including OUPNEUMO, REINTUB, and FAILWEAN. Tables 3 and 4 report results for hip and knee arthroplasty, respectively. As we can see, RF yielded the best predictive accuracy in the THA cohort, where its uncertainty was quantified using OOB estimation. However, the ANNs model performed best in the TKA cohort, where we applied MC Dropout to estimate predictive uncertainty through probabilistic forward passes.

Table 3:

ML models performance in predicting respiratory complications in THA.

graphic file with name AMIASYMPROC-2025-9628-t3.jpg

Open in a new tab

Table 4:

ML models performance in predicting respiratory complications in TKA.

graphic file with name AMIASYMPROC-2025-9628-t4.jpg

Open in a new tab

Figures 4 and 5 illustrate ROC curves that assess ML models efficacy in forecasting postoperative respiratory problems subsequent to THA and TKA. These figures illustrate the effectiveness of each model in achieving a balance between sensitivity and specificity at different categorization thresholds. In conclusion, RF and ANNs consistently exhibited the most favorable trade-offs. In the THA cohort, ANNs achieved the strongest performance overall, with AUC values of 0.92 (OUPNEUMO), 0.95 (REINTUB), and 0.95 (FAILWEAN). LR also performed well, reaching the AUCs of 0.88, 0.92, and 0.92 for the respective outcomes. RF maintained reliable discrimination with AUCs of 0.86 (OUPNEUMO), 1.90 (REINTUB), and 0.90 (FAILWEAN), while XGB demonstrated modest performance, scoring AUCs of 0.77, 0.83, and 0.88 for the same outcomes. In the TKA cohort, models performance was more variable. ANNs continued to yield favorable results, especially for OUPNEUMO (AUC = 0.90), though performance declined for REINTUB (AUC = 0.83), and FAILWEAN (AUC = 0.95). LR maintained strong and consistent outcomes with AUCs of 0.85, 0.90, and 0.91, reinforcing its robustness. RF models demonstrated competitive effectiveness with AUCs of 0.82 (OUPNEUMO), 1.91 (REINTUB), and 0.83 (FAILWEAN). In contrast, XGB showed the weakest results in this group, particularly for FAILWEAN (AUC = 0.62) and OUPNEUMO (AUC = 0.76), although performance for REINTUB (AUC = 0.88) remained respectable. These comparative results highlight the superior and stable performance of ANNs and LR across both surgical populations, emphasizing their utility for clinical prediction of respiratory complications. Meanwhile, the sensitivity of XGB to dataset complexity suggests its limited generalizability in certain preoperative contexts.

Figure 4: — ROC curves for the ML models, predicting respiratory risks in THA.

Figure 5: — ROC curves for the ML models, predicting respiratory risks in TKA.

To enhance ML model transparency, we applied both SHAP and LIME to interpret the behavior of our ML models across the THA and TKA cohorts. In the THA cohort and based on Figure 6, SHAP analysis identified CDARREST, PRHCT, and INOUT as the most influential predictors for combined respiratory complications. Higher values in these features were associated with increased predicted risk. Additional contributors, such as ASACLAS, ANESTHES, and PRCREAT highlighted the relevance of preoperative condition and laboratory indicators. This global interpretability supports clinical trust by aligning model behavior with established medical knowledge. In the TKA cohort in Figure 7, SHAP analysis revealed that INOUT, CDMI, and Age had the strongest impact on model output, with higher feature values linked to elevated respiratory risk. Other important variables, such as PULEMBOL, ASACLAS, and HYPERMED, also aligned with known clinical risk factors. The consistent pattern of SHAP values across features (with red indicating higher risk and blue indicating lower risk) reinforces the model’s clinical relevance in identifying high-risk patients undergoing knee arthroplasty.

Figure 6: — SHAP summary of the RF model for combined respiratory complications prediction in THA.

Figure 7: — SHAP summary of the ANNs model for combined respiratory complication prediction in TKA.

In the THA cohort, LIME visualization, Figure 8 illustrates how specific features influenced an individual prediction. CDARREST, CDMI, and PULEMBOL were major contributors to increased predicted risk, while ASACLAS > 3.0 and higher age values showed modest protective effects. The absence of comorbidities such as HXCHF and normal lab results (e.g., PRWBC, PRHCT) further supported a lower-risk outcome. Similarly, in the TKA cohort, the LIME visualization, Figure 9 illustrates how individual features influenced the model’s prediction. Risk-increasing variables such as OTHDVT, CDARREST, CDMI, PULEMBOL, and HXCHF shifted the prediction toward the event class. In contrast, features such as DISCANCR, ASACLAS, and Age acted as protective factors, lowering the predicted risk. Additional variables such as HXCOPD and PRHCT had moderate effects. This local explanation highlights the model’s ability to incorporate clinical variables into interpretable, patient-specific risk assessments in knee arthroplasty.

Figure 9: — LIME explanation of the ANNs model using a random sample for combined respiratory complications in TKA.

Figure 10 presents the OOB-based uncertainty estimation for the RF model across respiratory complications in the THA cohort. The chart compares OOB accuracy scores for REINTUB, FAILWEAN, and OUPNEUMO predictions, along with the weighted OOB and combined CV accuracy benchmarks. The model achieved consistent OOB scores: 0.81 (REINTUB), 0.79 (FAILWEAN), and 0.77 (OUPNEUMO), indicating relatively high confidence in its predictions. The weighted OOB score (0.79) closely aligns with the Combined CV Accuracy (0.81), suggesting stability and good generalizability.

Figure 11 illustrates the distribution of predictive uncertainty derived from MC Dropout in the ANNs model for the TKA cohort. The x-axis in this figure represents the standard deviation of the model’s output across multiple stochastic forward passes, reflecting uncertainty in individual predictions. The distribution is concentrated around lower standard deviation values, indicating that the majority of predictions were made with high confidence. Only a small fraction of cases exhibited elevated uncertainty levels, suggesting that the model was generally consistent in its output. This insight supports the use of ANNs in this clinical decision support by highlighting where the ML model’s predictions are most reliable.

4. Conclusion and Outlook

In this study, we developed and evaluated an uncertainty-aware and explainable AI framework for predicting major respiratory complications following THA and TKA. Our findings highlight the importance of integrating uncertainty quantification and explainability into ML-driven clinical decision support systems. By leveraging MC Dropout and OOB estimation, we enhanced ML models reliability, allowing surgeons and clinicians to assess prediction confidence within ML models. Furthermore, the incorporation of SHAP and LIME facilitated the interpretability and transparency of our models, offering facts and insights into key risk factors and their contributions to individual patient outcomes.

Explainability and uncertainty are especially important in high-risk clinical settings like THA and TKA, where treatment decisions can affect patient safety and recovery. While traditional ML models provide deterministic predictions, our framework accounts for the inherent variability in patient data, thus making trust and usability among clinicians and ML models. The ability to communicate not only the likelihood of complications but also the confidence in these predictions empowers healthcare providers to make informed and patient-centered decisions. This will significantly help clinicians decide when to act, when to monitor, or when to collect more information. In practice, these outputs could be computationally integrated into the electronic health record (EHR) and shown through a simple interface, such as a dashboard. Surgeons, clinicians, and anesthesiologists could view a patient’s risk score, a confidence level (e.g., high, medium, or low), and a list of top contributing factors or features to help guide decisions before and after surgery. Despite these advancements, our study carries some limitations. First, although we employed robust ML methodologies, our dataset was derived from a single national registry (NSQIP), which may introduce biases related to patient demographics, surgical techniques, and institutional practices. Additionally, while we applied uncertainty quantification techniques, further exploration is needed to evaluate their impact on real-world clinical workflows and decision-making. Lastly, our ML models were trained on retrospective data, and prospective validation in diverse clinical environments remains as a very essential next step.

Future work will focus on facilitating the clinical adoption of AI-powered decision support tools. This includes integrating our models into EHR systems for real-time risk prediction, conducting external validation across multiple healthcare institutions, and assessing the impact of AI-assisted decision-making on patient outcomes. Furthermore, enhancing models robustness through federated learning approaches could improve generalizability. By prioritizing transparency, uncertainty awareness, and clinical relevance, we aim to advance the responsible implementation of AI in orthopedic surgery, ultimately improving patient safety and surgical outcomes in THA and TKA settings.

Acknowledgment

The authors thank the American College of Surgeons, particularly the National Surgical Quality Improvement Program (NSQIP) for the datasets utilized in this research contribution.

Figures & Tables

References

[1].Gademan Maaike GJ, Hofstede Stefanie N, Vliet Vlieland Thea PM, Nelissen Rob GHH, Marang-Van de Mheen Perla J. “Indication criteria for total hip or knee arthroplasty in osteoarthritis: a state-of-the-science overview,”. BMC musculoskeletal disorders. 2016;17:pp. 1–11. [Google Scholar]
[2].Liu Jiabin, Wilson Lauren, Poeran Jashvant, Fiasconaro Megan, Kim David H, Yang Elaine, Memtsoudis Stavros. “Trends in total knee and hip arthroplasty recipients: a retrospective cohort study,”. Regional Anesthesia & Pain Medicine. 2019;44(9):pp. 854–859. [Google Scholar]
[3].Waterman Brian R, Belmont Julia O, Jr, Bader Philip J, Schoenfeld Andrew J. “The total joint arthroplasty cardiac risk index for predicting perioperative myocardial infarction and cardiac arrest after primary total knee and hip arthroplasty,”. The Journal of Arthroplasty. 2016;31(6):pp. 1170–1174. [Google Scholar]
[4].Kataria Rahul, Iniguez Reniell, Foy Michael, Sood Anshum, Gonzalez Mark E. “Preoperative risk factors for postoperative cardiac arrest following primary total hip and knee arthroplasty: A large database study,”. Journal of Clinical Orthopaedics and Trauma. 2021;16:pp. 244–248. [Google Scholar]
[5].Devana Sai K, Shah Akash A, Lee Changhee, Roney Andrew R, van der Schaar Mihaela, SooHoo Nelson F. “A novel, potentially universal machine learning algorithm to predict complications in total knee arthroplasty,”. Arthro-plasty today. 2021;10:pp. 135–143. [Google Scholar]
[6].Lee Lok Sze, Chan Ping Keung, Wen Chunyi, Fung Wing Chiu, Cheung Amy, Chan Vincent Wai Kwan, Cheung Man Hong, Fu Henry, Yan Chun Hoi, Chiu Kwong Yuen. “Artificial intelligence in diagnosis of knee osteoarthritis and prediction of arthroplasty outcomes: a review,”. Arthroplasty. 2022;4(1):pp. 16. [Google Scholar]
[7].Kim Annie, Wang Hongtao, Myers Nicole, Gupta Puneet, Steuer Fritz, Kann Michael R, Cong Ting, Liu Hongfang, Tafti Ahmad P. “Predicting unplanned return to operating room following primary total shoulder arthroplasty: Insights from fair and explainable ensemble machine learning,”. Studies in health technology and informatics. 2024;318:pp. 156–160. [Google Scholar]
[8].Shah Akash A, Devana Sai K, Lee Changhee, Kianian Reza, van der Schaar Mihaela, SooHoo Nelson F. “Development of a novel, potentially universal machine learning algorithm for prediction of complications after total hip arthroplasty,”. The Journal of arthroplasty. 2021;36(5):pp. 1655–1662. [Google Scholar]
[9].Amann Julia, Blasimme Alessandro, Vayena Effy, Frey Dietmar, Madai Vince I Precise4Q Consortium. “Explainability for artificial intelligence in healthcare: a multidisciplinary perspective,”. BMC medical informatics and decision making. 2020;20:pp. 1–9. [Google Scholar]
[10].Amirian Soheyla, Carlson Luke A., Gong Matthew F., Lohse Ines, Weiss Kurt R., Plate Johannes F., Tafti Ahmad P. “Explainable ai in orthopedics: Challenges, opportunities, and prospects,”. 2023.
[11].Saraswat Deepti, Bhattacharya Pronaya, Verma Ashwin, Prasad Vivek Kumar, Tanwar Sudeep, Sharma Gulshan, Bokoro Pitshou N, Sharma Ravi. “Explainable ai for healthcare 5.0: opportunities and challenges,”. IEEE Access. 2022;10:pp. 84486–84517. [Google Scholar]
[12].Gal Yarin, Ghahramani Zoubin. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,”. international conference on machine learning. PMLR. 2016:pp. 1050–1059. [Google Scholar]
[13].Breiman Leo. “Random forests,”. Machine learning. 2001;45:pp. 5–32. [Google Scholar]
[14].Lundberg Scott M, Lee Su-In. “A unified approach to interpreting model predictions,”. Advances in neural information processing systems. 2017;30 [Google Scholar]
[15].Ribeiro Marco Tulio, Singh Sameer, Guestrin Carlos. “Why should i trust you? explaining the predictions of any classifier,”. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:pp. 1135–1144. [Google Scholar]
[16].American College of Surgeons. “National surgical quality improvement program (nsqip),”. Accessed: 2025-02-19, 2025, https://www.facs.org/quality-programs/acs-nsqip.
[17].Onishchenko Dmytro, Rubin Daniel S, van Horne James R, Parker Ward R, Chattopadhyay Ishanu. “Cardiac comorbidity risk score: Zero-burden machine learning to improve prediction of postoperative major adverse cardiac events in hip and knee arthroplasty,”. Journal of the American Heart Association. 2022;11(15):pp. e023745. [Google Scholar]
[18].Matteo Vincenzo Di, Tommasini Tobia, Morandini Pierandrea, Savevski Victor, Grappiolo Guido, Loppini Mattia. “Machine learning prediction model to predict length of stay of patients undergoing hip or knee arthroplasties: Results from a high-volume single-center multivariate analysis,”. Journal of Clinical Medicine. 2024;13(17):pp. 5180. [Google Scholar]
[19].Naskar Sweet, Sharma Suraj, Kuotsu Ketousetuo, Halder Suman, Pal Goutam, Saha Subhankar, Mondal Shubhadeep, Biswas Ujjwal Kumar, Jana Mayukh, Bhattacharjee Sunirmal. “The biomedical applications of artificial intelligence: an overview of decades of research,”. Journal of Drug Targeting. 2024:pp. 1–32. [Google Scholar]
[20].Brnabic Alan, Hess Lisa M. “Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making,”. BMC medical informatics and decision making. 2021;21:pp. 1–19. [Google Scholar]
[21].Gutierrez-Naranjo Jose M, Moreira Alvaro, Valero-Moreno Eduardo, Bullock Travis S, Ogden Liliana A, Zelle Boris A. “A machine learning model to predict surgical site infection after surgery of lower extremity fractures,”. International Orthopaedics. 2024;48(7):pp. 1887–1896. [Google Scholar]
[22].Li Ben, Eisenberg Naomi, Beaton Derek, Lee Douglas S, Aljabri Badr, Verma Raj, Wijeysundera Duminda N, Rotstein Ori D, de Mestral Charles, Mamdani Muhammad, et al. “Using machine learning (xgboost) to predict outcomes after infrainguinal bypass for peripheral artery disease,”. Annals of Surgery. 2024;279(4):pp. 705–713. [Google Scholar]
[23].Tukey John Wilder, et al. Exploratory data analysis. Vol. 2. Springer; 1977. [Google Scholar]
[24].Chen Tianqi, Guestrin Carlos. “Xgboost: A scalable tree boosting system,”. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016:pp. 785–794. [Google Scholar]
[25].Rosenblatt Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.,”. Psychological review. 1958;65(6):pp. 386. [Google Scholar]
[26].Cox David R. “The regression analysis of binary sequences,”. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1958;20(2):pp. 215–232. [Google Scholar]
[27].Akiba Takuya, Sano Shotaro, Yanase Toshihiko, Ohta Takeru, Koyama Masanori. “Optuna: A next-generation hyperparameter optimization framework,”. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019:pp. 2623–2631. [Google Scholar]
[28].Bergstra James, Bardenet Rémi, Bengio Yoshua, Kégl Balázs. “Algorithms for hyper-parameter optimization,”. Advances in neural information processing systems. 2011;24 [Google Scholar]

[r1-9628] [1].Gademan Maaike GJ, Hofstede Stefanie N, Vliet Vlieland Thea PM, Nelissen Rob GHH, Marang-Van de Mheen Perla J. “Indication criteria for total hip or knee arthroplasty in osteoarthritis: a state-of-the-science overview,”. BMC musculoskeletal disorders. 2016;17:pp. 1–11. [Google Scholar]

[r2-9628] [2].Liu Jiabin, Wilson Lauren, Poeran Jashvant, Fiasconaro Megan, Kim David H, Yang Elaine, Memtsoudis Stavros. “Trends in total knee and hip arthroplasty recipients: a retrospective cohort study,”. Regional Anesthesia & Pain Medicine. 2019;44(9):pp. 854–859. [Google Scholar]

[r3-9628] [3].Waterman Brian R, Belmont Julia O, Jr, Bader Philip J, Schoenfeld Andrew J. “The total joint arthroplasty cardiac risk index for predicting perioperative myocardial infarction and cardiac arrest after primary total knee and hip arthroplasty,”. The Journal of Arthroplasty. 2016;31(6):pp. 1170–1174. [Google Scholar]

[r4-9628] [4].Kataria Rahul, Iniguez Reniell, Foy Michael, Sood Anshum, Gonzalez Mark E. “Preoperative risk factors for postoperative cardiac arrest following primary total hip and knee arthroplasty: A large database study,”. Journal of Clinical Orthopaedics and Trauma. 2021;16:pp. 244–248. [Google Scholar]

[r5-9628] [5].Devana Sai K, Shah Akash A, Lee Changhee, Roney Andrew R, van der Schaar Mihaela, SooHoo Nelson F. “A novel, potentially universal machine learning algorithm to predict complications in total knee arthroplasty,”. Arthro-plasty today. 2021;10:pp. 135–143. [Google Scholar]

[r6-9628] [6].Lee Lok Sze, Chan Ping Keung, Wen Chunyi, Fung Wing Chiu, Cheung Amy, Chan Vincent Wai Kwan, Cheung Man Hong, Fu Henry, Yan Chun Hoi, Chiu Kwong Yuen. “Artificial intelligence in diagnosis of knee osteoarthritis and prediction of arthroplasty outcomes: a review,”. Arthroplasty. 2022;4(1):pp. 16. [Google Scholar]

[r7-9628] [7].Kim Annie, Wang Hongtao, Myers Nicole, Gupta Puneet, Steuer Fritz, Kann Michael R, Cong Ting, Liu Hongfang, Tafti Ahmad P. “Predicting unplanned return to operating room following primary total shoulder arthroplasty: Insights from fair and explainable ensemble machine learning,”. Studies in health technology and informatics. 2024;318:pp. 156–160. [Google Scholar]

[r8-9628] [8].Shah Akash A, Devana Sai K, Lee Changhee, Kianian Reza, van der Schaar Mihaela, SooHoo Nelson F. “Development of a novel, potentially universal machine learning algorithm for prediction of complications after total hip arthroplasty,”. The Journal of arthroplasty. 2021;36(5):pp. 1655–1662. [Google Scholar]

[r9-9628] [9].Amann Julia, Blasimme Alessandro, Vayena Effy, Frey Dietmar, Madai Vince I Precise4Q Consortium. “Explainability for artificial intelligence in healthcare: a multidisciplinary perspective,”. BMC medical informatics and decision making. 2020;20:pp. 1–9. [Google Scholar]

[r10-9628] [10].Amirian Soheyla, Carlson Luke A., Gong Matthew F., Lohse Ines, Weiss Kurt R., Plate Johannes F., Tafti Ahmad P. “Explainable ai in orthopedics: Challenges, opportunities, and prospects,”. 2023.

[r11-9628] [11].Saraswat Deepti, Bhattacharya Pronaya, Verma Ashwin, Prasad Vivek Kumar, Tanwar Sudeep, Sharma Gulshan, Bokoro Pitshou N, Sharma Ravi. “Explainable ai for healthcare 5.0: opportunities and challenges,”. IEEE Access. 2022;10:pp. 84486–84517. [Google Scholar]

[r12-9628] [12].Gal Yarin, Ghahramani Zoubin. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,”. international conference on machine learning. PMLR. 2016:pp. 1050–1059. [Google Scholar]

[r13-9628] [13].Breiman Leo. “Random forests,”. Machine learning. 2001;45:pp. 5–32. [Google Scholar]

[r14-9628] [14].Lundberg Scott M, Lee Su-In. “A unified approach to interpreting model predictions,”. Advances in neural information processing systems. 2017;30 [Google Scholar]

[r15-9628] [15].Ribeiro Marco Tulio, Singh Sameer, Guestrin Carlos. “Why should i trust you? explaining the predictions of any classifier,”. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:pp. 1135–1144. [Google Scholar]

[r16-9628] [16].American College of Surgeons. “National surgical quality improvement program (nsqip),”. Accessed: 2025-02-19, 2025, https://www.facs.org/quality-programs/acs-nsqip.

[r17-9628] [17].Onishchenko Dmytro, Rubin Daniel S, van Horne James R, Parker Ward R, Chattopadhyay Ishanu. “Cardiac comorbidity risk score: Zero-burden machine learning to improve prediction of postoperative major adverse cardiac events in hip and knee arthroplasty,”. Journal of the American Heart Association. 2022;11(15):pp. e023745. [Google Scholar]

[r18-9628] [18].Matteo Vincenzo Di, Tommasini Tobia, Morandini Pierandrea, Savevski Victor, Grappiolo Guido, Loppini Mattia. “Machine learning prediction model to predict length of stay of patients undergoing hip or knee arthroplasties: Results from a high-volume single-center multivariate analysis,”. Journal of Clinical Medicine. 2024;13(17):pp. 5180. [Google Scholar]

[r19-9628] [19].Naskar Sweet, Sharma Suraj, Kuotsu Ketousetuo, Halder Suman, Pal Goutam, Saha Subhankar, Mondal Shubhadeep, Biswas Ujjwal Kumar, Jana Mayukh, Bhattacharjee Sunirmal. “The biomedical applications of artificial intelligence: an overview of decades of research,”. Journal of Drug Targeting. 2024:pp. 1–32. [Google Scholar]

[r20-9628] [20].Brnabic Alan, Hess Lisa M. “Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making,”. BMC medical informatics and decision making. 2021;21:pp. 1–19. [Google Scholar]

[r21-9628] [21].Gutierrez-Naranjo Jose M, Moreira Alvaro, Valero-Moreno Eduardo, Bullock Travis S, Ogden Liliana A, Zelle Boris A. “A machine learning model to predict surgical site infection after surgery of lower extremity fractures,”. International Orthopaedics. 2024;48(7):pp. 1887–1896. [Google Scholar]

[r22-9628] [22].Li Ben, Eisenberg Naomi, Beaton Derek, Lee Douglas S, Aljabri Badr, Verma Raj, Wijeysundera Duminda N, Rotstein Ori D, de Mestral Charles, Mamdani Muhammad, et al. “Using machine learning (xgboost) to predict outcomes after infrainguinal bypass for peripheral artery disease,”. Annals of Surgery. 2024;279(4):pp. 705–713. [Google Scholar]

[r23-9628] [23].Tukey John Wilder, et al. Exploratory data analysis. Vol. 2. Springer; 1977. [Google Scholar]

[r24-9628] [24].Chen Tianqi, Guestrin Carlos. “Xgboost: A scalable tree boosting system,”. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016:pp. 785–794. [Google Scholar]

[r25-9628] [25].Rosenblatt Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.,”. Psychological review. 1958;65(6):pp. 386. [Google Scholar]

[r26-9628] [26].Cox David R. “The regression analysis of binary sequences,”. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1958;20(2):pp. 215–232. [Google Scholar]

[r27-9628] [27].Akiba Takuya, Sano Shotaro, Yanase Toshihiko, Ohta Takeru, Koyama Masanori. “Optuna: A next-generation hyperparameter optimization framework,”. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019:pp. 2623–2631. [Google Scholar]

[r28-9628] [28].Bergstra James, Bardenet Rémi, Bengio Yoshua, Kégl Balázs. “Algorithms for hyper-parameter optimization,”. Advances in neural information processing systems. 2011;24 [Google Scholar]

PERMALINK

Trustworthy and Uncertainty-Aware AI for Predicting Respiratory Complications Following Total Hip and Knee Arthroplasty

Farnaz Rezvani*, MS

Kate Towsen

Zoe Menezes

Anatea Einhorn

Jevaughn Davis

Puneet Gupta

Johannes F Plate, MD, PhD

Chloe Fox, MS

Nicole Myers*, RN, MS

Ahmad P Tafti*, PhD, FAMIA

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources and Patient Cohort

2.2. Outcome Measures

2.3. Data Preprocessing and Feature Selection

Table 1:

Table 2:

Figure 1:

Figure 2:

Figure 3:

2.4. Machine Learning Models

2.5. Uncertainty Awareness and AI Explainability Analysis

3. Experimental Validation and Results

Table 3:

Table 4:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Figure 11:

4. Conclusion and Outlook

Acknowledgment

Figures & Tables

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases