Artificial Intelligence for Clinically Meaningful Outcome Prediction in Orthopedic Research: Current Applications and Limitations

Seong Jun Jang; Jake Rosenstadt; Eugenia Lee; Kyle N Kunze

doi:10.1007/s12178-024-09893-z

. 2024 Apr 8;17(6):185–206. doi: 10.1007/s12178-024-09893-z

Artificial Intelligence for Clinically Meaningful Outcome Prediction in Orthopedic Research: Current Applications and Limitations

Seong Jun Jang ¹, Jake Rosenstadt ², Eugenia Lee ³, Kyle N Kunze ^1,^✉

PMCID: PMC11091035 PMID: 38589721

Abstract

Purpose of Review

Patient-reported outcome measures (PROM) play a critical role in evaluating the success of treatment interventions for musculoskeletal conditions. However, predicting which patients will benefit from treatment interventions is complex and influenced by a multitude of factors. Artificial intelligence (AI) may better anticipate the propensity to achieve clinically meaningful outcomes through leveraging complex predictive analytics that allow for personalized medicine. This article provides a contemporary review of current applications of AI developed to predict clinically significant outcome (CSO) achievement after musculoskeletal treatment interventions.

Recent Findings

The highest volume of literature exists in the subspecialties of total joint arthroplasty, spine, and sports medicine, with only three studies identified in the remaining orthopedic subspecialties combined. Performance is widely variable across models, with most studies only reporting discrimination as a performance metric. Given the complexity inherent in predictive modeling for this task, including data availability, data handling, model architecture, and outcome selection, studies vary widely in their methodology and results. Importantly, the majority of studies have not been externally validated or demonstrate important methodological limitations, precluding their implementation into clinical settings.

Summary

A substantial body of literature has accumulated demonstrating variable internal validity, limited scope, and low potential for clinical deployment. The majority of studies attempt to predict the MCID—the lowest bar of clinical achievement. Though a small proportion of models demonstrate promise and highlight the utility of AI, important methodological limitations need to be addressed moving forward to leverage AI-based applications for clinical deployment.

Keywords: Machine learning, Minimal clinically important difference, Patient-reported outcome measures, Artificial intelligence, Orthopedic surgery

Introduction

Patient-reported outcome measures (PROMs) play a critical role in determining the success of treatment interventions and are a central focus for evaluating the value of care rendered [1]. Indeed, the Centers for Medicare & Medicaid Services (CMS) have recently finalized a landmark national policy to standardize and expand the collection and reporting of PROMs following total joint arthroplasty (TJA), with the overall goals of enhancing clinical care, shared decision-making, and quality measurement for these common elective procedures [1]. These metrics generally include evaluations of patient quality of life, pain, satisfaction, and function, thereby providing important information for clinicians regarding the effectiveness, quality and value of various surgical and non-surgical treatment interventions [2–4]. As such, the ability to identify which patients will experience clinically important changes in their health state after an intervention, as assessed through PROMs, has been a recent focus in the orthopedic surgery literature and of policy makers as it may aid in clinical decision making [5–8].

A notable limitation to the utilization of PROMs in clinical and research settings is their interpretability. The overall meaning of raw numeric scores can be both challenging to communicate to patients and to interpret across different centers [9]. Perhaps most importantly, standard PROMs inherently lack patient perception and experience in terms of whether the achieved total score is meaningful at a personal level. Therefore, a shift towards utilizing clinically significant outcome (CSO) measures has been adopted in recent years. These metrics are statistical transformations of standard PROMs, utilizing distribution- or anchor-based methods to integrate a form of patient-perceived change into their interpretation [9]. Emphasis on quantifying clinically meaningful outcome thresholds for various treatments in the form of the minimal clinically important difference (MCID), patient acceptable symptom state (PASS), and substantial clinical benefit (SCB) have presented an excellent opportunity for better understanding treatment response [10].

The interpretation of CSOs, which represent discrete entities capturing a health change during a treatment-related experience, is inherently influenced by a complex interaction among various medical and social determinants such demographic and clinical profiles, treatment techniques, geographic location, rehabilitation, patient expectations, and surgeon–patient interactions [11, 12]. The ability to appropriately consider increasingly complex and large amounts of data representing these unique combinations of factors may unlock the ability to more accurately predict health-state changes and distinguish which specific factors most influence the ability to achieve a CSO. This, in turn, holds the potential to optimize patient management in each of the various stages of musculoskeletal episodes of care. The complexity of this task has poised the application of artificial intelligence (AI) to address these challenges. Machine learning (ML), a subset of AI that leverages unique statistical techniques to learn from complex data patterns and optimize the prediction of a specific output, may better handle large and complex data interactions and leverage insight into achievement of CSOs [13, 14]. Numerous recent studies have applied ML techniques to develop prognostic clinical prediction models to this end [15–18].

The current review synthesizes contemporary literature from orthopedic surgical subspecialties to critically assess the applications, performance, and limitations of ML techniques leveraged for the prediction of CSOs after treatment interventions. This review also highlights important limitations demonstrated in this body of literature. In doing so, another primary aim of this review is to determine how to focus future efforts to increase the quality and value of clinical prediction models developed for CSO achievement such that implementation into clinical settings may be feasible.

Total Joint Arthroplasty

Considerable effort has been directed towards applying AI to accomplish a variety of tasks pertaining to TJA including predicting clinical outcomes [19], identifying surgical implants [20], generating synthetic radiographs [21], and automating measurements for clinical analysis [22]. Notably, a large body of literature has specifically aimed to use ML to predict PROMs at various time points after both total hip and knee arthroplasty. The following section provides a concise overview of this literature, highlighting the applications, performance, and limitations of these prediction models.

Model Applications

Seven studies were identified that developed prognostic clinical prediction models for PROMs after TJA (Table 1). Of the seven studies, three (42.9%) investigated combined cohorts of patients undergoing THA and TKA [15, 16, 23], three (42.9%) focused on isolated TKA populations [24–26], while one (14.3%) considered an isolated THA population [27]. The greater proportion of literature leveraging AI to predict CSO after TKA may reflect a traditionally higher dissatisfaction rate observed after TKA compared to THA [28]. Indeed, this discrepancy in patient-perceived health changes could certainly elicit interest in better identifying which patients would ultimately benefit from TKAs to better guide patient selection and enhance treatment decision making in this more challenging cohort. Cohort sizes used for algorithm training and testing varied largely across studies, ranging between 587 [24] and 64,634 patients [23].

Table 1.

Machine learning literature for predicting patient-reported outcomes in hip and knee replacement surgery

Article	Surgery	Cohort (N)	Models	PROM of interest	CSO	Time	Top results
Fontana et al. (2019) [16]	THA /TKA	7239 THA 6480 TKA	LASSO, SVM, RF	SF-36 physical component scores (PCS) SF-36 mental component scores (MCS) Hip and knee disability and osteoarthritis outcome scores (HOOS/KOOS JR)	MCID	2 years	PCS: LASSO, AUROC 0.78 MCS: LASSO, AUROC 0.89 HOOS JR: LASSO, AUROC 0.77 KOOS JR: LASSO, AUROC 0.75
Huber et al. (2019) [23]	THA /TKA	30,524 THA 34,110 TKA	LR, XGB, MSAENET, RF, NNET, NB, KNN, LB	Visual analog scale (VAS) Q score (sum of Oxford hip and knee scores)	MCID	NR	THA: - VAS: XGB, AUROC 0.87 - Q: XGB, AUROC 0.78 TKA: - VAS: XGB/MSAENET, AUROC 0.83 - Q: XGB/MSAENET, AUROC 0.71
Kunze et al. (2020) [27]	THA	616 THA	SGB, RF, SVM, NNET, ENP	Patient-reported health state (PRHS)	MCID	2 years	PRHS: RF, AUROC 0.97 Calibration intercept 0.05, Calibration slope 1.45, Brier score 0.054
Harris et al. (2021) [24]	TKA	587 TKA	LR, LASSO, GBM, QDA	KOOS Total KOOS Jr KOOS sub-scales	MCID	1 year	KOOS JR: QDA, AUROC 0.60 KOOS Total: QDA AUROC 0.66 - ADL: LR, AUROC 0.76 - Pain: All ML, AUROC 0.72 - Symptom: LASSO, AUROC 0.72 - QoL: GBM, AUROC 0.71 - Recreation: GBM, AUROC 0.52
Katakam et al. (2022) [25]	TKA	744 TKA	SGB, RF, SVM, NN, ENP	KOOS-PS	MCID	1 year	KOOS-PS: ENP, AUROC 0.77 Calibration intercept -0.02 Calibration slope 1.15 Brier score 0.14
Zhang et al. (2022) [26]	TKA	2840 TKA	RF, XGB, SVM, LASSO	SF-36 PCS SF-36 MCS WOMAC	MCID	2 years	PCS: XGB, AUROC 0.77 MCS: XGB/SVM, AUROC 0.95 WOMAC: RF/LASSO, AUROC 0.89
Langenberger et al. (2023) [15]	THA/TKA	1843 THA 1546 TKA	NNet, GBM, LASSO, RF, LR, Ridge regression, Elastic net	EuroQol five-dimension five-level questionnaire (EQ-5D- 5L) EQ visual analogue scale (EQ-VAS) HOOS-PS KOOS-PS	MCID	1 year	THA: -EQ-5D- 5L: GBM, AUROC 0.81 -EQ-VAS: LASSO, AUROC 0.84 -HOOS-PS: Ridge, AUROC 0.71 TKA: -EQ-5D- 5L: RF AUROC 0.80 -EQ-VAS: Elastic net, AUROC 0.76 -KOOS-PS: Elastic net, AUROC 0.76

Open in a new tab

LASSO Logistic least absolute shrinkage and selection operator; SVM support vector machine; RF random forest; LR logistic regression; XGB extreme gradient boosting, MSAENET multi-step elastic-net; ENP elastic-net penalized logistic regression; NNET neural net, LB logistic boost, NB Naïve Bayes, KNN K-nearest neighbors, SGB stochastic gradient boosting; GBM gradient boosting machines; QDA quadratic discriminant analysis, WOMAC Western Ontario and McMaster University Osteoarthritis Index (WOMAC); MLP multi-layer perception

All seven studies developed clinical prediction models with the labeled dependent outcome being the MCID [15, 16, 23–27]. Interestingly, no studies developed prediction models for the PASS or SCB for any outcome. Critically, no studies utilized the same type PROMs, which may reflect the variance in accessibility to data, differences in institutional or single-surgeon PROM collection patterns, or a strategic attempt to add a novel contribution to the literature. Follow-up for all studies ranged from “initial follow-up” to minimum of 2 years, while follow-up in one study was not reported. No study reported a minimum follow-up time exceeding 2 years.

Model Architectures

Commonly used model architectures included random forest (RF), support vector machines (SVM), least absolute shrinkage and selection operators (LASSO), and gradient boosting machines (GBM). All but one study [29] compared the performance of multiple models and subsequently determined the best-performing model based on comparative metrics. Similarly, all but one study utilized a non-ML comparison model (i.e., traditional logistic regression) to determine whether the ML model demonstrated any discernable predictive advantage over a non-ML technique utilizing within the same dataset (Table 2). Three (42.9%) studies demonstrated that ML techniques were not superior to non-ML techniques. Langenberger et al. demonstrated that the performance of ML models was not superior to logistic regression models for various PROMs [15]. Interestingly, Langenberger et al. [15] and Zheng et al. [26] also demonstrated that simply applying thresholds to preoperative PROMs was superior or equivalent to ML models in predicting which patients would achieve MCID after surgery, highlighting that preoperative PROMs are often the most important variables that may provide insight into the propensity for a patient to achieve a CSO.

Table 2.

Study characteristics and limitations of studies in hip and knee replacement surgery

Article	Validation	Non-ML comparisons	Interpretability/feature analysis	Code available	Missing data/adjustments	Unique study characteristics	Other limitations
Fontana et al. (2019) [16]	Internal	Comparison against dummy models	Yes, feature analysis	No	Imputation	Addition of data from four different time points High pre-op MCID exclusion	Large feature space Single institution
Huber et al. (2019) [23]	Internal	Comparison against National Health Services (NHS) linear regression model	Yes, variable importance	No	Removal of missing data, synthetic minority over-sampling technique	Sensitivity analysis Testing cohort separated by time	Missing variables (BMI)
Kunze et al. (2020) [27]	Internal	Null model	Yes, variable importance	No	> 30%, exclusion < 30%, imputation	Web application	Small cohort, limited data
Harris et al. (2021) [24]	Internal	Logistic regression	No	No	NR	VA hospitals from three separate regions	Only VA data
Katakam et al. (2022) [25]	Internal	Null model	Yes, partial dependence plots	No	< 30%, imputation	Web application Five sites	Small cohort
Zhang et al. (2022) [26]	Internal	Preoperative PROMs thresholds	Yes, variable importance	No	Imputation, upsampling	Determination against preoperative PROM cutoffs	Single institution
Langenberger et al. (2023) [15]	Internal	Logistic regression, preoperative PROM scores	Yes, SHAP analysis	No	> 30%, exclusion < 30%, imputation	9 hospitals represented	Limited follow-up

Open in a new tab

SHAP Sharp additive explanation

Model Performance

The performance of the models varied across both THA and TKA studies. For THA, the AUROC of models ranged from 0.71 to 0.97; whereas for TKA, the AUROC of models ranged from 0.52 to 0.87. Importantly, every study differed in the type of PROMs investigated, the definition of MCID and satisfaction, the variables inputted for modeling, and the selected models to be trained and validated. These methodological differences likely contribute to the wide range in performance identified among these studies. Despite these differences, the majority of studies did include a form of feature analysis or interpretation of the variable contributing most to the predictions of each model for interpretability.

Model Limitations

Critically, no study was externally validated or published their code for external confirmation and research use. However, two studies deployed open-access online tools using their developed models for public use [25, 27]; it is recommended that interactive applications derived from clinical prediction tools remain as educational demonstrations and avoid being implemented into clinical workflows until external validation is achieved. Other limitations in this body of literature include non-standardized selection of CSO thresholds or use of thresholds from external populations that may not generalize to the study population, variation in methods of data handling and imputation, small sample sizes in several studies, and a lack of consensus as to which PROM may be most useful in most accurately capturing CSO.

Sports Medicine

Within sports medicine, the use of ML has primarily been applied to better understanding the propensity for predicting achievement of the MCID while elucidating major predictive factors associated with MCID achievement.

Model Applications

Across the 15 identified papers, study cohorts included patients undergoing hip arthroscopy for femoroacetabular impingement syndrome (FAIS) (n = 7) [17, 30–34•, 35], anterior cruciate ligament (ACL) reconstructions (n = 3) [36–38], anatomic and reverse total shoulder arthroplasties (TSA) (n = 2) [39, 40], osteochondral allograft (OCA) transplantations (n = 2) [41, 42]), and rotator cuff repairs (RCR) (n = 2) [43, 44]. Of the 15 studies, 11 (73.3%) developed clinical prediction models for the MCID, three (20%) for both the MCID and SCB, and one (6.7%) for the MCID, PASS, and SCB (Table 3). All but one study [31] applied feature selection or variable importance during model development in order determine the most predictive variables for the achievement of the CSO of interest. Overall, data was obtained from various sources including both single surgeon registries, multiple-institution combined registries, or administrative databases, resulting in patient cohort sizes between 135 [41] and 5774 [39]. Follow-up for included studies ranged between 3 months and 2 years.

Table 3.

Machine learning literature for predicting patient-reported outcomes in sports medicine surgeries

Article	Surgery	Cohort (N)	Models	PROM of interest	CSO	Time	Top results
Kunze et al. (2021) [32]	ACLR	442	SGM, RF, NNET, SVM, AGB, ENP	International Knee Documentation Committee (IKDC) score	MCID	2 years minimum	IKDC: ENP, AUROC 0.82 Calibration intercept 0.10 Calibration slope: 1.15 Brier score 0.07
Ye et al. (2022) [37]	ACLR	432	LR, NB, RF, XGBs	Lysholm score IKDC score	MCID	2 years minimum	Lysholm score nonachievement: XGB, AUROC 0.93 Accuracy 0.91 IKDC Nonachievement: XGB, AUROC 0.94 Accuracy 0.95
Nwachukwu et al. (2020) [17]	FAIS	898	LASSO	Harris Hip Score (mHHS), Hip Outcome score–Activities of Daily Living (HOS-ADL) HOS–Sport Specific (HOS-SS)	MCID	2 years	HOS-ADL: AUROC 0.89 HOS-SS: AUROC 0.85 mHHS: AUROC 0.84
Kunze et al. (2021) [32]	FAIS	1,118	SGM, RF, AGB, NNET, ENP	HOS-SS	MCID	2 years minimum	HOS-SS: ENP, AUROC 0.77 Calibration intercept 0.07 Calibration slope 1.22 Brier score 0.14
Kunze et al. (2022) [45••]	FAIS	859	RF, ENP, NNET, SGB, XGB, SVM	International Hip Outcome Tool-12 (IHOT-12)	MCID, SCB, PASS	2 years minimum	MCID: ENP, AUROC 0.78 - Calibration intercept 0.02 - Calibration slope 0.85 - Brier score 0.14 PASS: RF, AUROC 0.74 - Calibration intercept 0.03 - Calibration slope 0.79 - Brier score 0.21 SCB: XGB, AUROC 0.71 - Calibration intercept -0.03 - Calibration slope 0.95 - Brier score 0.22
Kunze et al. (2021) [32]	FAIS	818	RF, SGB, NNET, ENP, SVM	HOS-ADL	MCID	2 years	HOS-ADL: SCB, AUROC 0.84 - Calibration intercept 0.20 - Calibration slope 0.83 - Brier score 0.13
Kunze et al. (2021) [32]	FAIS	935	RF, SGB, NNET, ENP, SVM	Visual analog scale (VAS) score	MCID	2 years minimum	VAS: NNET, AUROC 0.94 - Calibration intercept -0.43 - Calibration slope 1.07 - Brier score: 0.050
Ramkumar et al. (2020) [31]	FAIS	1735	RF	mHHS HOS-ADL HOS-SS International Hip Outcome Tool (iHOT-33)	MCID	1 year, 2 years	mHHS [1 year AUROC] \| [2 years AUROC] - Alpha: 0.490 \| 0.539, Coronal: 0.503 \| 0.572, Femoral: 0.532 \| 0.495, McKibbin: 0.520 \| 0.511, Hip impingement: 0.533 \| 0.486 HOS-ADL - Alpha: 0.487 \| 0.538, Coronal: 0.502 \| 0.568, Femoral: 0.504 \| 0.489, McKibbin: 0.509 \| 0.538, Hip impingement: 0.540 \| 0.516 HOS-SS - Alpha: 0.515 \| 0.540, Coronal: 0.500 \| 0.569, Femoral: 0.510 \| 0.544, McKibbin: 0.517 \| 0.546, Hip impingement: 0.521 \| 0.516 iHOT-33 - Alpha: 0.534 \| 0.529, Coronal: 0.512 \| 0.570, Femoral: 0.522 \| 0.519, McKibbin: 0.524 \| 0.554, Hip impingement: 0.521 \| 0.505
Pettit et al. (2023) [30]	FAIS	1917	RF, LR, NNET, SVM, GBM	iHOT-12	MCID	6 months minimum	iHOT-12: RF, AUROC 0.75
Ramkumar et al. (2021) [41]	OCA	135 1 year, 153 2 years	NB, XGBs, RF, LR, ensemble classifier	IKDC KOS-ADL SF-36 MCS SF-36 PCS	MCID, SCB	1 year, 2 years	IKDC - MCID: 1 year—NB, AUROC 0.72, 2 years—LR, AUROC 0.88 - SCB: 1 year—NB, 0.74, 2 years—LR, 0.92 KOS-ADL - MCID: 1 year—Isotonic, 0.74, 2 years—RF, 0.86 - SCB: 1 year—RF, 0.84, 2 years—NB, 0.92 SF-36 MCS - MCID: 1 year—LR, 0.73, 2 years—Sigmoid, 0.94 - Not applicable for SCB SF-36 PCS - MCID: 1 year—Ensemble, 0.81, 2 years—LR, 0.90 - Not applicable for SCB
Ramkumar et al. (2021) [42]	OCA	153	NB, XGBs, RF, LR, ensemble classifier	IKDC KOS-ADL SF-36 MCS SF-36 PCS	MCID, SCB	2 years	IKDC - MCID: LR, AUROC 0.84 - SCB: NB, AUROC 0.90 KOS-ADL - MCID: RF, AUROC 0.88 - SCB: NB, AUROC 0.86 SF-36 MCS - MCID: RF, AUROC 0.60 - Not applicable for SCB SF-36 PCS - MCID: LR, AUROC 0.68 - Not applicable for SCB
Alaiti et al. (2023) [43]	RCR	474	RF, LightGBM, decision tree classifier, extra trees classifier, XGB, KNN, CatBoost, LR	American Shoulder and Elbow Surgeons (ASES)	MCID	2 years	ASES: RF, AUROC 0.68
Potty et al. (2023) [44]	RCR	631	Linear regression, ridge regression, LASSO, SVM, KNN, RF, XGB	ASES	MCID, SCB	3 mo, 6 mo, 12 mo	Predicted change in outcome scores that fell within MCID and SCB thresholds from literature for ASES score; percentage of patients predicted within the MCID for 3-, 6-, and 12-months post operation was 52%, 54%, and 69%, respectively. The percentage of patients predicted within SCB for 3, 6, and 12 months was 73%, 78%, and 87%, respectively
Kumar et al. (2020) [40]	aTSA, rTSA	4782	Linear regression, XGB, and wide and deep	ASES University of California Los Angeles (UCLA) CONSTANT pain score Global shoulder function (GSF) score VAS pain	MCID	1 year minimum	Mean absolute error (MAE), wide and deep - ASES: ± 10.1 to 11.3 points - UCLA: ± 2.5 to 3.4 - Constant: ± 7.3 to 7.9 - GSF: ± 1.0 to 1.4 - VAS ± 1.2 to 1.4 MCID, XGBoost - ASES: AUROC 0.88 - UCLA: AUROC 0.88 - Constant: AUROC 0.94 - GSF: AUROC 0.91 - VAS: AUROC 0.86 SCB, XGBoost - ASES: AUROC 0.80 - UCLA: AUROC 0.83 - Constant: AUROC 0.88 - GSF: AUROC 0.85 - VAS: AUROC 0.86
Kumar et al. (2021) [39]	aTSA, rTSA	5774	XGB	ASES CONSTANT pain score GSF score VAS pain	MCID	3 months minimum	MAE, XGB - ASES: ± 11.7 full model, ± 12.0 abbreviated model - Constant: ± 8.9, ± 9.8 - GSF ± 1.4, ± 1.5 - VAS: ± 1.3, ± 1.4 MCID, XGB AUROC - ASES: 0.90 full model, 0.88 abbreviated model - Constant: 0.95, 0.94 - GSF: 0.88, 0.87 - VAS pain score: 0.87, 0.87 SCB, XGB AUROC - ASES: 0.84 full model, 0.82 abbreviated model - Constant: 0.90, 0.89 - GSF: 0.84, 0.83 - VAS: 0.82, 0.85

Open in a new tab

LASSO Logistic least absolute shrinkage and selection operator; SVM support vector machine; RF random forest; LR logistic regression; XGB extreme gradient boosting, ENP elastic-net penalized logistic regression; NNET neural net, NB Naïve Bayes, KNN K-nearest neighbors, SGB stochastic gradient boosting; GBM gradient boosting machines; AGB adaptive gradient boosting, GAM generalized additive model

Model Architectures and Performance

Of the 15 studies, three (20%) utilized a single ML model [17, 31, 39], while the remaining 12 compared multiple models. The most common ML algorithms used were random forest (RF), XGBoost (XGB), neural networks (NN), SVM, stochastic gradient boosting (SGB), and elastic-net penalized logistic regression (ENPLR) (Table 3). Overall, model performance was observed to range from fair to excellent across the various treatment interventions upon internal validation and using various algorithm architectures. For example, when considering achievement of the MCID for the International Knee Documentation Committee (IKDC) scores, Kunze et al. [36] determined that the ENPLR model (AUROC: 0.82) demonstrated the best relative performance, whereas Ye et al.[37] reported that an XGB model (AUROC: 0.94) demonstrated the best relative performance. Given that patient populations and model training often vary considerably, there is no current method of directly comparing model performance across studies.

Notably, Kunze et al. [34•] have performed the only investigation to date that developed independent prediction models for the MCID, PAS, and SCB, providing insight into the propensity to achieve a wide range of magnitudes of clinical health change. This group reported that the best performing algorithm to predict achievement of the MCID for the IHOT-12 score after hip arthroscopy for FAIS was an ENPLR model (AUROC: 0.78), while an RF model performed best for predicant PASS achievement (AUROC: 0.74), and an XGB model performed best for predicting SCB achievement (AUROC: 0.71) [34•].

Overall, model performance varied considerably across ACLR, hip arthroscopy for FAIS, OCA, RCR, and aTSA/rTSA interventions. For ACLR, AUROC values ranged between 0.82 and 0.94. Of the seven FAIS studies, all but one study used a combination of demographic and imaging-based features to develop clinical prediction models. Interestingly, Ramkumar et al. [31] investigated the ability of radiographic angles to predict achievement of MCID for the Harris hip score (mHHS), hip Outcome Score–Activities of Daily Living (HOS-ADL), HOS–Sport Specific (HOS-SS), and the International Hip Outcome Tool (iHOT-33). AUROC values for this study ranged between 0.49 and 0.57, suggesting that these features did not predict CSO achievement any better than random chance [31]. The other six studies the developed ML models on FAIS datasets reported AUROC values ranging between 0.71 and 0.94, suggesting fair to excellent performance.

The two studies that developed prediction models to predict CSO after OCA for chondral and osteochondral defects of the knee were derived from the same patient population from a single institution. These two studies reported AUROC values ranging between 0.60 and 0.90 for the best models for both MCID and SCB achievement [41, 42]. For RCR, one study reported an AUROC of 0.68 for their best model [43]. Potty et al. [44] utilized ML techniques to predict the mean change in the ASES score at several postoperative time points that fell within pre-established MCID and SCB values from the literature. They reported that the percentage of patients predicted within the MCID for 3, 6, and 12 months was 52%, 54%, and 69%, respectively, while the percentage of patients predicted within SCB for 3, 6, and 12 months was 73%, 78%, and 87%, respectively [44].

For aTSA/rTSA, one study reported AUROC values ranging between 0.80 and 0.94 for 5 different PROMs [40]. The other aTSA/rTSA study sought to evaluate the difference between a full- versus abbreviated-input model (291 versus 19 variables, respectively) and found no significant difference between AUROC between the models [39].

Model Limitations

Major limitations of sports medicine studies that have developed prediction models for CSO achievement include lack of generalizability (single-surgeon or homogenous cohorts), exclusion of possible significant untested variables, selection bias due to low follow-up rates, and missing data (Table 4). Although most studies used a form of imputation to fill in missing data points, the proportion of patients lost to follow-up is significant. Furthermore, each study chose different variables to include in their analysis making comparisons across populations difficult. Even within the same interventions, the most predictive variables for each model did not overlap, making it difficult to compare studies and draw conclusions. Only one paper [41] published their code for open-source use. Finally, only a single study externally validation their algorithm on an independent population of patients, which limits model generalizability. Kunze et al. [45••] performed external validation of their ML clinical prediction model for MCID achievement for the HOS-SS after hip arthroscopy utilizing a geographically unique and independent population of 154 patients. This group demonstrated good performance on external validation, with an AUROC of 0.80 and appropriate calibration metrics. Such efforts represent essential steps forward toward introducing prognostic models into clinical workflows.

Table 4.

Study characteristics and limitations of studies in sports medicine surgeries

Article	Validation	Non-ML comparisons	Interpretability/feature analysis	Code available	Missing data/adjustments	Unique study characteristics	Other limitations
Kunze et al. (2021) [32]	Internal	Null model	Yes, variable importance	No	Imputation	Web application	Single institution
Ye et al. (2022) [37]	Internal	NR	Yes, SHAP analysis	No	Removal of missing data	Top models shared same top predictors	Single surgeon Missing variables (preoperative psychological evaluation, timing of return to sports)
Nwachukwu et al. (2020) [17]	Internal	NR	Yes, feature selection	No	Removal of missing data	Used PatientIQ to analyze data	Single surgeon Used one ML model Selection bias from removing patients with incomplete data
Kunze et al. (2021) [32]	Internal	Null model	Yes, feature selection, LIME plot, variable importance	No	Imputation	Web application Included data from community hospitals	Only data from 4 hospitals
Kunze et al. (2022) [45••]	Internal	Null model	Yes, variable importance	No	Imputation	Web application Evaluated PASS	Selection bias from removing patients with incomplete data
Kunze et al. (2021) [32]	External	Null model	Yes, feature selection, LIME plot	No	Imputation	Web application	Single institution
Kunze et al. (2021) [32]	Internal	Null model	Yes, feature selection, LIME plot	No	Imputation	Web application	Large proportion of missing data Missing variable (hip osteoarthritis)
Ramkumar et al. (2020) [31]	Internal	Preoperative PROM	No	No	NR	Radiographic indices as main variable	45% 2 year follow-up rate
Pettit et al. (2023) [30]	Internal	NR	Yes, feature importance	No	Imputed	Used nationwide registry (UK)	Limited patient data for primary outcome (37%)
Ramkumar et al. (2021) [41]	Internal	NR	Yes, SHAP analysis	Yes	> 80%, exclusion	Measures at multiple timepoints	Small cohort Single institution
Ramkumar et al. (2021) [42]	Internal	NR	Yes, SHAP analysis	No	> 80%, exclusion	Includes preoperative imaging as variable	Small cohort Single institution
Alaiti et al. (2023) [43]	Internal	NR	Yes, SHAP analysis	No	Removal of missing data at 12 and 24 months	Web application	Single institution
Potty et al. (2023) [44]	Internal	NR	Yes, SHAP analysis	No	NR	Measures at multiple timepoints	Single institution Variable surgical techniques
Kumar et al. (2020) [40]	Internal	Mean absolute error	Yes, feature importance	No	Imputation	30 sites	Excluded patients with revisions, humeral fractures, and hemiarthroplasty
Kumar et al. (2021) [39]	Internal	Full versus abbreviated input variables	Yes, feature selection	No	Imputation	30 sites	Excluded patients with revisions, humeral fractures, and hemiarthroplasty One ML algorithm

Open in a new tab

SHAP Sharp additive explanation

Spine

The current body of AI research concerning CSO after spine surgery encompasses a variety of surgical procedures and pathologies from disc herniations to spinal metastases. As highlighted below, this wide variety has led to heterogeneity in the literature concerning model applications, architecture, and performance.

Model Applications

A total of eleven studies used ML algorithms to predict CSO achievement in patients undergoing spine surgery (Table 5). Of the studies, four concerned outcomes after surgical treatment of degenerative cervical myelopathy (DCM) [46–49], two after lumbar disc herniation (LDH) [50, 51], two after lumbar fusion (LF) [52, 53], two which included combined cohorts that underwent multiple lumbar procedures [54–56] All eleven studies used their primary outcome as the MCID [46–56], while none attempted to build models for predicting the PASS or SCB. A wide variety of outcomes were measured, with the most common being the numeric rating scale for back pain (NRS-BP) and leg pain (NRS-LP), the visual analog scale (VAS), the Oswestry Disability Index (ODI), and the Core Outcome Measures Index (COMI). Sample sizes varied considerably and were generally small, ranging between 40 [48] and 4307 [55•]. Mean follow-up ranged between 6 months and 2 years (Table 5).

Table 5.

Machine learning literature for predicting patient-reported outcomes in orthopedic spine surgery

Article	Surgery	Cohort (N)	Models	PROM of interest	CSO	Time	Top results
Merali et al. (2019) [46]	DCM	757	RF	SF-6D Modified Japanese Orthopaedic Association (mJOA) scale	MCID	6 months, 1 year, 2 years	*SF-6D: RF, AUROC 0.73 mJOA: RF, AUROC 0.67
Siccoli et al. (2019) [56]	LSS	635	RF, XGB, Bayesian GLM, BT, KNN, simple GLM, NNET	Numeric rating scale for back pain (NRS-BP) Numeric rating scale for leg pain (NRS-LP) Dutch Oswestry Disability Index (ODI)	MCID	6 weeks, 1 year	*NRS-BP: XGB, AUROC 0.79 Brier score 0.18 F1-score 0.87 NRS-LP: RF, AUROC 0.72 Brier score 0.20 F1-score 0.80 ODI: BGLM, AUROC 0.68 Brier score 0.20 F1-score 0.67
Staartjes et al. (2019) [50]	LDH	422	NNET, LR	NRS-BP NRS-LP Dutch ODI	MCID	1 year	NRS-BP: NNET, AUROC 0.90 NRS-LP: NNET, AUROC 0.87 Dutch ODI: NNET, AUROC 0.84
Karhade et al. (2021) [54]	LDH/LSS	532 LDH 374 LSS	SGB, RF, SVM, NNET, ENP	PROMIS physical function Pain interference Pain intensity	MCID	< 1 year	PROMIS Physical function: ENP, AUROC 0.79 Calibration intercept 0.00 Calibration slope 1.36 Brier score 0.15 Pain interference: NNET/ENP, AUROC 0.74 Calibration intercept, ENP 0.05 Calibration slope, ENP 1.07 Brier score, ENP 0.16 Pain intensity: NNET, AUROC 0.70 Calibration intercept 0.01 Calibration slope 1.08 Brier score 0.14
Berjano et al. (2021) [52]	LF	1243	RF	Good early outcome (ODI) Excellent early outcome (combined ODI, SF-36, COMI back)	MCID	6 months	Good early outcome: RF, AUROC 0.842 Excellent early outcome: RF, AUROC 0.808
Khan et al. (2021) [47]	DCM	173	GBM, MARS (Earth)	SF-36 mental component summary (MCS) SF-36 physical component summary (PCS)	MCID	1 year	SF-36 MCS: GBM, AUROC 0.77 Calibration intercept 0.38 Calibration slope 0.71 SF-36 PCS: Earth, AUROC 0.78 Calibration intercept 0.11 Calibration slope 1.06
Staartjes et al. (2022) [53]	LF	1115	ENETGLM	ODI/Core Outcome Measures Index (COMI) functional impairment NRS-BP NRS-LP	MCID	1 year	ODI/COMI: ENETGLM, AUROC 0.67 Calibration intercept − 0.07 Calibration slope 0.63 NRS-BP: ENETGLM, AUROC 0.72 Calibration intercept − 0.38 Calibration slope 1.10 NRS-LP, ENETGLM, AUROC 0.64 Calibration intercept 0.14 Calibration slope 0.49
Zhang et al. (2022) [26]	DCM	40	SVM	SF-36 MCS SF-36 PCS Neck Disability Index (NDI) North American Spine Society (NASS) mJOA	MCID	2 years	SF-36 MCS: SVM, AUROC 0.90 F1-score 0.90 SF-36 PCS: SVM, AUROC 0.86 F1-score 0.86 NDI: SVM, AUROC 0.65 F1-score 0.67 NASS: SVM, AUROC 0.98 F1-score 0.94 mJOA: SVM, AUROC 0.98 F1-score 0.86
Pedersen et al. (2022) [51]	LDH	1968	DL, DT, RF, BT, SVM	EuroQol-5D (EQ-5D) ODI VAS-BP VAS-LP	MCID	1 year	EQ-5D: SVM, AUROC 0.89 ODI: BT/SVM, AUROC 0.77 VAS-LP: SVM, AUROC 0.78 VAS-BP: RF, AUROC 0.87
Park et al. (2023) [49]	DCM	916	LR, SVM, DT, RF, ET, GNB, KNN, MP, XGB	VAS-NP	MCID	3 months, 2 years	*VAS-NP: LR, AUROC 0.77 F1-score 0.78
Halicka et al. (2023) [55•]	LSH or LSS	1041 LDH 2413 LSS 853 both	RF	Back pain Leg pain COMI	MCID	3–24 months	Back pain: RF, AUROC 0.68 Calibration intercept − 0.06 Calibration slope 1.20 Brier score 0.22 Leg pain: RF, AUROC 0.66 Calibration intercept 0.01 Calibration slope 0.96 Brier score 0.20 COMI: RF, AUROC 0.62 Calibration intercept 0.06 Calibration slope 0.94 Brier score 0.23

Open in a new tab

RF Random forest; XGB extreme gradient boosting; GLM generalized linear models; BT boosted trees; KNN K-nearest neighbors; NNET neural net; LR logistic regression; DT decision tree; SGB stochastic gradient boosting; SVM support vector machine; ENP elastic-net penalized logistic regression; GBM gradient boosting machines; MARS multivariate adaptive regression splines; ENETGLM elastic net generalized linear models; LASSO logistic least absolute shrinkage and selection operator; Ridge CV ridge cross validation; DL deep learning; GLMM generalized linear mixed model; CHAID Chi-square automatic interaction detection; ET extra trees; GNB Gaussian naive Bayes; MLP multi-layer perceptron; *Only reporting results for last time point

Model Architectures and Performance

Of the various ML architectures used for training models used, the most common were RF, SVM, and variations of gradient boosting. In studies investigating outcome after LDH, AUROC for MCID prediction ranged between 0.77 and 0.90; in studies investigating outcomes after LF, AUROC for MCID prediction ranged between 0.64 and 0.84; for MCID after LSS, the AUROC was 0.79 for the NRS-BP, 0.72 for NRS-LP, and 0.68 for ODI; for MCID after DCM, AUROC ranged between 0.65 and 0.98; and for multiple lumbar procedures, AUROC ranged between 0.62 and 0.79. Only four (36.4%) studies provided model calibration metrics. Other metrics reported included the F1-score and Brier score, which were variably reported.

Two studies reported on external validation of their algorithms, which were demonstrated to confer comparable performance in unique populations (Table 6). Staartjes et al. [53] used a large multinational, multicenter dataset (11 centers across 7 countries) to predict MCID achievement for the ODI, COMI, NRS-BP, and NRS-LP at 1 year after LF surgery. The authors reported that the discrimination for all outcomes was poor to fair, with AUROC ranging between 0.64 and 0.72. The positive predictive value of the model was relatively high (0.81–0.90), but the negative predictive value was low (0.23–0.39) [53]. Furthermore, model calibration performance was lower than that of the original model. Halicka et al. [55•] also performed external validation on a unique single-institution cohort, with AUROC for NRS-BP being 0.68, NRS-LP 0.66, and COMI 0.62. Model calibration metrics were comparable to that of the internally validated model.

Table 6.

Study characteristics and limitations of studies in spine surgery

Article	Validation	Non-ML comparisons	Interpretability/feature analysis	Code available	Missing data/adjustments	Unique study characteristics	Other limitations
Merali et al. (2019) [46]	Internal	NR	Yes, relative importance	No	> 5%, exclusion < 5%, imputation	International cohort	Large feature space Follow-up rate 71.2% One ML model
Siccoli et al. (2019) [56]	Internal	NR	Yes, variable importance	Yes	Imputation	Extensive number of outcomes predicted	Single center, single surgeon
Staartjes et al. (2019) [50]	Internal	Logistic regression	Yes, variable importance	No	Demographic data < 20% imputation	Trained ML on radiological data (MRI)	Small cohort, single surgeon
Karhade et al. (2021) [54]	Internal	Null model	Yes, variable importance	No	< 30% imputation	Web application Individual explanations	Small sample size
Berjano et al. (2021) [52]	Internal	NR	Yes, mean decrease Gini index	No	Imputation	Leveraged baseline self-reported variables for prediction	Single center Short follow-up One ML model
Khan et al. (2021) [47]	Internal	Logistic regression	Yes, relative importance	No	Imputation	Multicenter dataset	Small cohort
Staartjes et al. (2022) [53]	External	NR	Yes, variable importance	No	Imputation	Multinational, multicenter dataset Web application	No sample size calculation for external validation
Zhang et al. (2022) [26]	Internal	NR	Yes, feature selection	No	< 0.5% imputation	Trained ML on radiological data (MRI)	Small cohort One ML model
Pedersen et al. (2022) [51]	Internal	Comparison against MARS and logistic regression	No	No	NR	Relatively large sample size	Single institution
Park et al. (2023) [49]	Internal	NR	Yes, SHAP analysis	No	Imputation	Large number of algorithms compared	Small sample size
Halicka et al. (2023) [55•]	External	Logistic regression, linear regression	Yes, relative importance	Yes	Imputation (multivariate imputation by chained equations (MICE))	Individual explanations	Single center One ML model

Open in a new tab

SHAP Shapley additive explanations

Model Limitations

The most significant limitation among the studies was the lack of external validation, as observed with the majority of existing literature in other orthopedic subspecialties. As discussed above, only 2 of 11 studies (18.2%) externally validated their models [53, 55•]. Thus, the conclusions of the majority of the studies regarding model performance cannot currently be generalized. A second observed limitation was small sample sizes, with studies developing models on study populations as small as 40 patients. These practices, without appropriately calculating minimum sample sizes for clinical prediction model creation, may lead to model performance bias. Other limitations consistent with the body of literature concerning CSO achievement includes incomplete analysis or disclosure of important modeling metrics such as calibration and Brier score; failure to disclose full source code; and lack of transparency and consistency in data handling.

Trauma; Oncology; Pediatric; Hand, Foot, and Ankle; and Other Disciplines

Outside of sports medicine, spine, and TJA subspecialties, literature concerning the development of prognostic clinical prediction models for CSO achievement is scarce. Indeed, no articles were identified in pediatric orthopedic surgery, foot and ankle surgery, or musculoskeletal oncology. One article was identified within orthopedic trauma [57], while two articles were identified concerning hand surgery [58, 59]. The study interventions included those for upper extremity fracture management, thumb carpometacarpal osteoarthritis (CMC OA), and carpal tunnel decompression (CTD) surgery. All three studies evaluated the MCID for various PROMs, including QuickDASH, patient-reported outcomes measurement information system (PROMIS), and Michigan hand outcomes questionnaire (MHQ). Patient sample sizes ranged from 734 [57] to 1489 [58], and they included data from single institutions and national registries (Table 7).

Table 7.

Machine learning literature for predicting patient-reported outcomes in trauma, hand, and wrist surgeries

Article

Surgery

Cohort (N)

Models

PROM of interest

CSO

Time

Table 8.

Study characteristics and limitations of studies in trauma, hand, and wrist surgeries

Article

Validation

Non-ML comparisons

Interpretability/feature analysis

Code available

Missing data/adjustments

Unique study characteristics

Other limitations

Brinkman et al. (2023) [57]

Internal

Linear regression

Yes, variable importance

not mentioned

Measure at multiple timepoints

Single center

Arbitrary variable selection

Trained model using variables 2–4 weeks post-injury

Loos et al. (2022) [58]

Internal

Logistic regression

Yes, feature selection

Yes

Imputation

Performed non-responder analysis

Web application for hand function

Excluded patients who underwent revision surgery

Follow-up rate 55.4%

Only 1 variable used in RF model

Harrison et al. (2022) [59]

Internal

None reported

Yes, SHAP analysis

Yes

Performed missing data analysis

Web application

CHAID decision tree analysis

Single center

Follow-up rate 58.7%

Open in a new tab

Discussion

The main findings of the current review are as follows: (1) within the TJA literature, applications of ML have been directed at optimizing predictive performance for achieving CSO for a wide range of metrics, with most studies reporting reasonable internal validity despite a wide range in performance; (2) within the sports medicine literature, applications of ML for CSO prediction have been targeted at ACLR, hip arthroscopy, cartilage restoration, RCR, and TSA, with narrow use cases and variable performance; (3) within the spine literature, applications of ML for CSO prediction have involved the degree of change in pain and function predominately after treatment interventions for lumbar spine pathology and cervical myelopathy, with a wide range of performance and reporting conduct; (4) a paucity of effort has been directed towards applying AI to anticipate CSO achievement in the orthopedic surgical subspecialties of foot and ankle, hand, trauma, pediatrics, and musculoskeletal oncology; (5) the transparency of reporting and methodological conduct of studies that have developed prognostic clinical prediction models for CSO prediction is generally poor and demonstrates considerable inter-study variably, with information concerning data handling, code availability, and external validation lacking.

The use of ML on datasets for prediction of CSO in the hip and knee arthroplasty literature considered an extensive variety of outcome assessment tools, likely highlighting a trend of interest in applying ML that exceeds the clinical translation and utility of the models developed. This is supported by the overwhelming proportion of these models that were not externally validated, precluding the implementation of these models into clinical pathways. When considering the internal validation and performance, several models were developed with good to excellent discrimination capabilities. However, only two studies included other important performance metrics, such as model calibration, to further support the validity of their models. A scoping assessment of the methodological conduct of ML studies within this literature is exceedingly important as regulatory bodies begin to place emphasis on the use of clinically significant outcome metrics to define value and quality as payment models shift towards value-based care. Advanced predictive modeling, including ML, can leverage insight into the personalized trajectories of outcome improvement, which should be considered when defining value-based care targets and treatment quality. Additionally, understanding the current state of the ML literature for predicting CSO is important as AI-assisted diagnosis and decision-making becomes adopted as billable interventions in reimbursement models. Based on the current review, published ML models in the hip and knee arthroplasty literature remain imperfect and unfocused with variable reporting. To move forward, consensus on important metrics to be used is needed, as well as increased methodological transparency and external validation.

Within the sports medicine literature, a narrow scope of applications was investigated; although similar to the TJA literature, studies demonstrated variable internal validity, and only a single study achieving external validation of their previously developed model. Importantly, duplicative efforts in developing ML clinical prediction tools are not clinically useful unless a substantial increase in performance is observed, and this narrow scope of direction hinders advancement in the field. Indeed, Oeding et al. [60•] recently identified a similar trend in studies concerning computer vision tasks within orthopedic sports medicine surgery, where the authors reviewed 55 studies in this domain. They found that 65% of studies developed computer vision applications that aimed to detect similar pathologies, while the diagnostic performance demonstrated excellent internal validity. Likewise, only 7% of studies were externally validated. These findings suggest that similar to ML studies within this subspecialty, current models have a narrow conceptual framework, limited clinically applicability, and require improved methodological conduct (such as external validation and prospective assessments) to confer important clinical utility. This review highlights key knowledge gaps in reviewing the current literature that should be used to drive future research and create holistic AI strategies. Prior to initiating the development of additional clinical prognostic models, collaborators should determine several important criteria: (1) upon model development, is the metric being predicted clinically relevant, and is there a current or future need for a prediction model by providers or payers? (2) where will patient data be obtained from, what limitations and assumptions exist in the data, and what methods will be implemented to address these shortcomings? (3) what ethical and statistical biases may be present in the data and how will these be accounted for? (4) is it feasible to update and reinforce model training? (5) is it possible to externally validate the model? (6) what are the plans and infrastructure for model deployment and what resources are available to facilitate and maintain deployment? Failure to consider, plan for, and address these important considerations will result in the continued proliferation of ML models without potential for clinical utility.

As with most of the current literature pertaining to developing prediction models for CSO achievement in orthopedic surgery, the spine literature also demonstrated variable transparency and reporting conduct as well as performance evaluation. The scope of applications for ML was wide, though in terms of outcome metrics, a high degree of overlap was observed in PROM selection. Furthermore, all studies investigated the ability to optimize prediction of the MCID, whereas no studies predicted the PASS or SCB. This a recurrent theme identified throughout this review and a concern raised by those performing clinical outcomes research. Indeed, Rossi et al. [61] caution that reliance on the MCID for clinical significance may not be sufficient and neither addresses whether patients are satisfied nor if the perceived benefit gained is substantial. Future studies should shift their focus towards clinical significance metrics that aim to capture higher thresholds of improvement. Additionally, in the spine literature, I = internal validity for various use cases ranged from poor to excellent in terms of AUROC, with variable reporting of additional performance metrics, and only two studies pursuing external validation. Future research is needed to accomplish external validation of these models, while increased awareness of reporting guidelines and consensus on model development, determination of MCID thresholds, and PROM use is imperative to confer meaningful clinical applications.

Interestingly, other musculoskeletal subspecialties outside of sports medicine, spine, and TJA have not observed the same period of recent heightened expectations, hype, and academic productivity. It is unclear as to why other subspecialties have not devoted similar effort to investigating the utility of AI for predicting CSO achievement in their respective specialties. It is plausible that these other subspecialties do not have the same access to readily available registries that can facilitate AI research, or that the patient populations are more difficult to evaluate through utilizing PROMs [62]. For example, subspecialties such as orthopedic trauma surgery, oncology, and pediatrics are challenged with treating diverse patient populations, conditions, and disease presentations. In specialties such as pediatric orthopedics, PROM collection may be limited by the inability to complete PROMs or difficulty with conveying important perceptions of their health state given the young age of patients [62]. In orthopedic oncology, the true effect of orthopedic treatment interventions may be convoluted by prior or ongoing medical treatments, palliative circumstances, or unique oncologic pathologies, among many other factors. This heterogeneity presents a considerable barrier to creating a homogenous registry with large enough data to facilitate AI studies.

Among all disciplines and studies reviewed in this current synthesis, substantial methodological limitations were observed, which may be facilitated by a mismatch between the innovation triggers with excitement surrounding a new area of AI research and a lack of domain expertise in model development and data analysis. Indeed, the accessibility of coding and software has enabled those without formal data science backgrounds to be able to conduct AI analyses, which may predispose some authors to overlook key model components or fail to adhere to model reporting guidelines. Authors should be encouraged to collaborate with several members with complimentary skillsets when performing AI research such that appropriate scientific conduct is not omitted and clinically relevant use cases are pursued. Unfortunately, current prognostic clinical prediction models developed for CSO are narrow in scope and lack the foundation to be deployed for meaningful clinical uses.

Conclusion

A proliferation of prognostic clinical prediction models for achievement of CSO after musculoskeletal treatment interventions has been observed over a short period of time, with several models demonstrating high internal performance; however, there remains a mismatch between the considerable production of new literature with hype surrounding the topic and translational capabilities, with publication of prediction models often demonstrating limited and meaningless applications, poor methodological conduct, and inability for deployment in clinical settings. Furthermore, the scope of applications is often limited, with a considerable proportion of studies attempting to predict the MCID—the lowest bar of clinical significance. Future studies are required to expand the scope and clinical importance of applications, while there is a critical need for expert consensus and formal AI guidelines to avoid perpetuating models with poor methodological conduct and limited potential for clinical impact.

Author contributions

KNK: Supervision, methodology, writing of initial manuscript, revision of initial manuscript SJ: Writing of initial manuscript, revision of initial manuscript JR: Writing of initial manuscript, revision of initial manuscript EL: Writing of initial manuscript, revision of initial manuscript

Declarations

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

1.Pasqualini I, Piuzzi NS. New CMS policy on the mandatory collection of patient-reported outcome measures for total hip and knee arthroplasty by 2027: what orthopaedic surgeons should know. J Bone Joint Surg Am. 2024. [DOI] [PubMed]
2.Makhni EC. Meaningful clinical applications of patient-reported outcome measures in orthopaedics. J Bone Joint Surg Am. 2021;103(1):84–91. doi: 10.2106/JBJS.20.00624. [DOI] [PubMed] [Google Scholar]
3.Porter ME. What is value in health care? N Engl J Med. 2010;363(26):2477–2481. doi: 10.1056/NEJMp1011024. [DOI] [PubMed] [Google Scholar]
4.Makhni EC, Baumhauer JF, Ayers D, Bozic KJ. Patient-reported outcome measures: how and why they are collected. Instr Course Lect. 2019;68:675–680. [PubMed] [Google Scholar]
5.Chung AS, Copay AG, Olmscheid N, Campbell D, Walker JB, Chutkan N. Minimum Clinically important difference: current trends in the spine literature. Spine (Phila Pa 1976). 2017;42(14):1096–105. [DOI] [PubMed]
6.Copay AG, Chung AS, Eyberg B, Olmscheid N, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part I: Upper Extremity: A Systematic Review. JBJS Rev. 2018;6(9):e1. doi: 10.2106/JBJS.RVW.17.00159. [DOI] [PubMed] [Google Scholar]
7.Copay AG, Eyberg B, Chung AS, Zurcher KS, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part II: lower extremity: a systematic review. JBJS Rev. 2018;6(9):e2. doi: 10.2106/JBJS.RVW.17.00160. [DOI] [PubMed] [Google Scholar]
8.Baumhauer JF, Bozic KJ. Value-based healthcare: patient-reported outcomes in clinical decision making. Clin Orthop Relat Res. 2016;474(6):1375–1378. doi: 10.1007/s11999-016-4813-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kunze KN, Bart JA, Ahmad M, Nho SJ, Chahla J. Large heterogeneity among minimal clinically important differences for hip arthroscopy outcomes: a systematic review of reporting trends and quantification methods. Arthroscopy. 2021;37(3):1028–37 e6. doi: 10.1016/j.arthro.2020.10.050. [DOI] [PubMed] [Google Scholar]
10.Menendez ME, Sudah SY, Cohn MR, Narbona P, Ladermann A, Barth J, et al. Defining minimal clinically important difference and patient acceptable symptom state after the latarjet procedure. Am J Sports Med. 2022;50(10):2761–2766. doi: 10.1177/03635465221107939. [DOI] [PubMed] [Google Scholar]
11.Bernstein DN, Karhade AV, Bono CM, Schwab JH, Harris MB, Tobert DG. Sociodemographic factors are associated with patient-reported outcome measure completion in orthopaedic surgery: an analysis of completion rates and determinants among new patients. JB JS Open Access. 2022;7(3):e22.00026. [DOI] [PMC free article] [PubMed]
12.Jolback P, Rolfson O, Mohaddes M, Nemes S, Karrholm J, Garellick G, et al. Does surgeon experience affect patient-reported outcomes 1 year after primary total hip arthroplasty? Acta Orthop. 2018;89(3):265–271. doi: 10.1080/17453674.2018.1444300. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The Academy Workshop. Radiology. 2019;291(3):781–791. doi: 10.1148/radiol.2019190613. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Padash S, Mickley JP, Vera Garcia DV, Nugen F, Khosravi B, Erickson BJ, et al. An overview of machine learning in orthopedic surgery: an educational paper. J Arthroplasty. 2023;38(10):1938–1942. doi: 10.1016/j.arth.2023.08.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Langenberger B, Schrednitzki D, Halder AM, Busse R, Pross CM. Predicting whether patients will achieve minimal clinically important differences following hip or knee arthroplasty. Bone Joint Res. 2023;12(9):512–521. doi: 10.1302/2046-3758.129.BJR-2023-0070.R2. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Fontana MA, Lyman S, Sarker GK, Padgett DE, MacLean CH. Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res. 2019;477(6):1267–1279. doi: 10.1097/CORR.0000000000000687. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Nwachukwu BU, Beck EC, Lee EK, Cancienne JM, Waterman BR, Paul K, et al. Application of machine learning for predicting clinically meaningful outcome after arthroscopic femoroacetabular impingement surgery. Am J Sports Med. 2020;48(2):415–423. doi: 10.1177/0363546519892905. [DOI] [PubMed] [Google Scholar]
18.Kunze KN, Krivicich LM, Clapp IM, Bodendorfer BM, Nwachukwu BU, Chahla J, et al. Machine learning algorithms predict achievement of clinically significant outcomes after orthopaedic surgery: a systematic review. Arthroscopy. 2022;38(6):2090–2105. doi: 10.1016/j.arthro.2021.12.030. [DOI] [PubMed] [Google Scholar]
19.El-Othmani MM, Zalikha AK, Shah RP. Comparative analysis of the ability of machine learning models in predicting in-hospital postoperative outcomes after total hip arthroplasty. J Am Acad Orthop Surg. 2022;30(20):e1337–e1347. doi: 10.5435/JAAOS-D-21-00987. [DOI] [PubMed] [Google Scholar]
20.Rouzrokh P, Mickley JP, Khosravi B, Faghani S, Moassefi M, Schulz WR, et al. THA-AID: deep learning tool for total hip arthroplasty automatic implant detection with uncertainty and outlier quantification. J Arthroplasty. 2023;9(4):966–73. [DOI] [PubMed]
21.Khosravi B, Rouzrokh P, Mickley JP, Faghani S, Larson AN, Garner HW, et al. Creating high fidelity synthetic pelvis radiographs using generative adversarial networks: unlocking the potential of deep learning models without patient privacy concerns. J Arthroplasty. 2023;38(10):2037–43 e1. doi: 10.1016/j.arth.2022.12.013. [DOI] [PubMed] [Google Scholar]
22.Jang SJ, Fontana MA, Kunze KN, Anderson CG, Sculco TP, Mayman DJ, et al. An interpretable machine learning model for predicting 10-year total hip arthroplasty risk. J Arthroplasty. 2023;38(7S):S44–50. [DOI] [PubMed]
23.Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak. 2019;19(1):3. doi: 10.1186/s12911-018-0731-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Harris AHS, Kuo AC, Bowe TR, Manfredi L, Lalani NF, Giori NJ. Can machine learning methods produce accurate and easy-to-use preoperative prediction models of one-year improvements in pain and functioning after knee arthroplasty? J Arthroplasty. 2021;36(1):112–7 e6. doi: 10.1016/j.arth.2020.07.026. [DOI] [PubMed] [Google Scholar]
25.Katakam A, Karhade AV, Collins A, Shin D, Bragdon C, Chen AF, et al. Development of machine learning algorithms to predict achievement of minimal clinically important difference for the KOOS-PS following total knee arthroplasty. J Orthop Res. 2022;40(4):808–815. doi: 10.1002/jor.25125. [DOI] [PubMed] [Google Scholar]
26.Zhang S, Lau BPH, Ng YH, Wang X, Chua W. Machine learning algorithms do not outperform preoperative thresholds in predicting clinically meaningful improvements after total knee arthroplasty. Knee Surg Sports Traumatol Arthrosc. 2022;30(8):2624–2630. doi: 10.1007/s00167-021-06642-4. [DOI] [PubMed] [Google Scholar]
27.Kunze KN, Karhade AV, Sadauskas AJ, Schwab JH, Levine BR. Development of machine learning algorithms to predict clinically meaningful improvement for the patient-reported health state after total hip arthroplasty. J Arthroplasty. 2020;35(8):2119–2123. doi: 10.1016/j.arth.2020.03.019. [DOI] [PubMed] [Google Scholar]
28.Bourne RB, Chesworth BM, Davis AM, Mahomed NN, Charron KD. Patient satisfaction after total knee arthroplasty: who is satisfied and who is not? Clin Orthop Relat Res. 2010;468(1):57–63. doi: 10.1007/s11999-009-1119-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Farooq H, Deckard ER, Ziemba-Davis M, Madsen A, Meneghini RM. Predictors of patient satisfaction following primary total knee arthroplasty: results from a traditional statistical model and a machine learning algorithm. J Arthroplasty. 2020;35(11):3123–3130. doi: 10.1016/j.arth.2020.05.077. [DOI] [PubMed] [Google Scholar]
30.Pettit MH, Hickman SHM, Malviya A, Khanduja V. Development of machine-learning algorithms to predict attainment of minimal clinically important difference after hip arthroscopy for femoroacetabular impingement yield fair performance and limited clinical utility. Arthroscopy. 2023;40(4):1153–63. [DOI] [PubMed]
31.Ramkumar PN, Karnuta JM, Haeberle HS, Sullivan SW, Nawabi DH, Ranawat AS, et al. Radiographic indices are not predictive of clinical outcomes among 1735 patients indicated for hip arthroscopic surgery: a machine learning analysis. Am J Sports Med. 2020;48(12):2910–2918. doi: 10.1177/0363546520950743. [DOI] [PubMed] [Google Scholar]
32.Kunze KN, Polce EM, Rasio J, Nho SJ. Machine learning algorithms predict clinically significant improvements in satisfaction after hip arthroscopy. Arthroscopy. 2021;37(4):1143–1151. doi: 10.1016/j.arthro.2020.11.027. [DOI] [PubMed] [Google Scholar]
33.Kunze KN, Polce EM, Nwachukwu BU, Chahla J, Nho SJ. Development and internal validation of supervised machine learning algorithms for predicting clinically significant functional improvement in a mixed population of primary hip arthroscopy. Arthroscopy. 2021;37(5):1488–1497. doi: 10.1016/j.arthro.2021.01.005. [DOI] [PubMed] [Google Scholar]
34.Kunze KN, Polce EM, Clapp IM, Alter T, Nho SJ. Association between preoperative patient factors and clinically meaningful outcomes after hip arthroscopy for femoroacetabular impingement syndrome: a machine learning analysis. Am J Sports Med. 2022;50(3):746–756. doi: 10.1177/03635465211067546. [DOI] [PubMed] [Google Scholar]
35.Kunze KN, Polce EM, Clapp I, Nwachukwu BU, Chahla J, Nho SJ. Machine learning algorithms predict functional improvement after hip arthroscopy for femoroacetabular impingement syndrome in athletes. J Bone Joint Surg Am. 2021;103(12):1055–1062. doi: 10.2106/JBJS.20.01640. [DOI] [PubMed] [Google Scholar]
36.Kunze KN, Polce EM, Ranawat AS, Randsborg PH, Williams RJ, 3rd, Allen AA, et al. Application of machine learning algorithms to predict clinically meaningful improvement after arthroscopic anterior cruciate ligament reconstruction. Orthop J Sports Med. 2021;9(10):23259671211046575. doi: 10.1177/23259671211046575. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Ye Z, Zhang T, Wu C, Qiao Y, Su W, Chen J, et al. Predicting the objective and subjective clinical outcomes of anterior cruciate ligament reconstruction: a machine learning analysis of 432 patients. Am J Sports Med. 2022;50(14):3786–3795. doi: 10.1177/03635465221129870. [DOI] [PubMed] [Google Scholar]
38.Martin RK, Wastvedt S, Pareek A, Persson A, Visnes H, Fenstad AM, et al. Predicting subjective failure of ACL reconstruction: a machine learning analysis of the Norwegian Knee Ligament Register and patient reported outcomes. J ISAKOS. 2022;7(3):1–9. doi: 10.1016/j.jisako.2021.12.005. [DOI] [PubMed] [Google Scholar]
39.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. Using machine learning to predict clinical outcomes after shoulder arthroplasty with a minimal feature set. J Shoulder Elbow Surg. 2021;30(5):e225–e236. doi: 10.1016/j.jse.2020.07.042. [DOI] [PubMed] [Google Scholar]
40.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. What is the accuracy of three different machine learning techniques to predict clinical outcomes after shoulder arthroplasty? Clin Orthop Relat Res. 2020;478(10):2351–2363. doi: 10.1097/CORR.0000000000001263. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Ramkumar PN, Karnuta JM, Haeberle HS, Owusu-Akyaw KA, Warner TS, Rodeo SA, et al. Association between preoperative mental health and clinically meaningful outcomes after osteochondral allograft for cartilage defects of the knee: a machine learning analysis. Am J Sports Med. 2021;49(4):948–957. doi: 10.1177/0363546520988021. [DOI] [PubMed] [Google Scholar]
42.Ramkumar PN, Karnuta JM, Haeberle HS, Rodeo SA, Nwachukwu BU, Williams RJ. Effect of preoperative imaging and patient factors on clinically meaningful outcomes and quality of life after osteochondral allograft transplantation: a machine learning analysis of cartilage defects of the knee. Am J Sports Med. 2021;49(8):2177–2186. doi: 10.1177/03635465211015179. [DOI] [PubMed] [Google Scholar]
43.Alaiti RK, Vallio CS, Assunção JH, Andrade e Silva FBd, Gracitelli MEC, Neto AAF, et al. Using machine learning to predict nonachievement of clinically significant outcomes after rotator cuff repair. Orthop J Sports Med. 2023;11(10):23259671231206180. [DOI] [PMC free article] [PubMed]
44.Potty AG, Potty ASR, Maffulli N, Blumenschein LA, Ganta D, Mistovich RJ, et al. Approaching artificial intelligence in orthopaedics: predictive analytics and machine learning to prognosticate arthroscopic rotator cuff surgical outcomes. J Clin Med. 2023;12(6). [DOI] [PMC free article] [PubMed]
45.Kunze KN, Kaidi A, Madjarova S, Polce EM, Ranawat AS, Nawabi DH, et al. External validation of a machine learning algorithm for predicting clinically meaningful functional improvement after arthroscopic hip preservation surgery. Am J Sports Med. 2022;50(13):3593–3599. doi: 10.1177/03635465221124275. [DOI] [PubMed] [Google Scholar]
46.Merali ZG, Witiw CD, Badhiwala JH, Wilson JR, Fehlings MG. Using a machine learning approach to predict outcome after surgery for degenerative cervical myelopathy. PLoS ONE. 2019;14(4):e0215133. doi: 10.1371/journal.pone.0215133. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Khan O, Badhiwala JH, Witiw CD, Wilson JR, Fehlings MG. Machine learning algorithms for prediction of health-related quality-of-life after surgery for mild degenerative cervical myelopathy. Spine J. 2021;21(10):1659–1669. doi: 10.1016/j.spinee.2020.02.003. [DOI] [PubMed] [Google Scholar]
48.Zhang JK, Jayasekera D, Javeed S, Greenberg JK, Blum J, Dibble CF, et al. Diffusion basis spectrum imaging predicts long-term clinical outcomes following surgery in cervical spondylotic myelopathy. Spine J. 2023;23(4):504–512. doi: 10.1016/j.spinee.2022.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Park C, Mummaneni PV, Gottfried ON, Shaffrey CI, Tang AJ, Bisson EF, et al. Which supervised machine learning algorithm can best predict achievement of minimum clinically important difference in neck pain after surgery in patients with cervical myelopathy? A QOD study. Neurosurg Focus. 2023;54(6):E5. doi: 10.3171/2023.3.FOCUS2372. [DOI] [PubMed] [Google Scholar]
50.Staartjes VE, de Wispelaere MP, Vandertop WP, Schroder ML. Deep learning-based preoperative predictive analytics for patient-reported outcomes following lumbar discectomy: feasibility of center-specific modeling. Spine J. 2019;19(5):853–861. doi: 10.1016/j.spinee.2018.11.009. [DOI] [PubMed] [Google Scholar]
51.Pedersen CF, Andersen MO, Carreon LY, Eiskjaer S. Applied machine learning for spine surgeons: predicting outcome for patients undergoing treatment for lumbar disc herniation using PRO data. Global Spine J. 2022;12(5):866–876. doi: 10.1177/2192568220967643. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Berjano P, Langella F, Ventriglia L, Compagnone D, Barletta P, Huber D, et al. The influence of baseline clinical status and surgical strategy on early good to excellent result in spinal lumbar arthrodesis: a machine learning approach. J Pers Med. 2021;11(12). [DOI] [PMC free article] [PubMed]
53.Staartjes VE, Stumpo V, Ricciardi L, Maldaner N, Eversdijk HAJ, Vieli M, et al. FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease. Eur Spine J. 2022;31(10):2629–2638. doi: 10.1007/s00586-022-07135-9. [DOI] [PubMed] [Google Scholar]
54.Karhade AV, Fogel HA, Cha TD, Hershman SH, Doorly TP, Kang JD, et al. Development of prediction models for clinically meaningful improvement in PROMIS scores after lumbar decompression. Spine J. 2021;21(3):397–404. doi: 10.1016/j.spinee.2020.10.026. [DOI] [PubMed] [Google Scholar]
55.Halicka M, Wilby M, Duarte R, Brown C. Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models. BMC Musculoskelet Disord. 2023;24(1):333. doi: 10.1186/s12891-023-06446-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Siccoli A, de Wispelaere MP, Schroder ML, Staartjes VE. Machine learning-based preoperative predictive analytics for lumbar spinal stenosis. Neurosurg Focus. 2019;46(5):E5. doi: 10.3171/2019.2.FOCUS18723. [DOI] [PubMed] [Google Scholar]
57.Brinkman N, Shah R, Doornberg J, Ring D, Gwilym S, Jayakumar P. Artificial neural networks outperform linear regression in estimating 9-month patient-reported outcomes after upper extremity fractures with increasing number of variables. OTA Int. 2023;6(5 Suppl):e284. doi: 10.1097/OI9.0000000000000284. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Loos NL, Hoogendam L, Souer JS, Slijper HP, Andrinopoulou ER, Coppieters MW, et al. Machine learning can be used to predict function but not pain after surgery for thumb carpometacarpal osteoarthritis. Clin Orthop Relat Res. 2022;480(7):1271–1284. doi: 10.1097/CORR.0000000000002105. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Harrison CJ, Geoghegan L, Sidey-Gibbons CJ, Stirling PHC, McEachan JE, Rodrigues JN. Developing machine learning algorithms to support patient-centered, value-based carpal tunnel decompression surgery. Plast Reconstr Surg Glob Open. 2022;10(4):e4279. doi: 10.1097/GOX.0000000000004279. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.•Oeding JF, Krych AJ, Pearle AD, Kelly BT, Kunze KN. Medical imaging applications developed using artificial intelligence demonstrate high internal validity yet are limited in scope and lack external validation. Arthroscopy. 2024. This study is of importance to this topic as it parellels the themes identified in this review pertaining to clinically significant outcome achievement - repetitive use cases of current statistical methods and methodological shortcomings are currently impeding progress in this domain. [DOI] [PubMed]
61.Rossi MJ, Brand JC, Lubowitz JH. Minimally clinically important difference (MCID) is a low bar. Arthroscopy. 2023;39(2):139–141. doi: 10.1016/j.arthro.2022.11.001. [DOI] [PubMed] [Google Scholar]
62.Kunze KN, Madjarova S, Jayakumar P, Nwachukwu BU. Challenges and opportunities for the use of patient-reported outcome measures in orthopaedic pediatric and sports medicine surgery. J Am Acad Orthop Surg. 2023;31(20):e898–e905. doi: 10.5435/JAAOS-D-23-00087. [DOI] [PubMed] [Google Scholar]

[CR1] 1.Pasqualini I, Piuzzi NS. New CMS policy on the mandatory collection of patient-reported outcome measures for total hip and knee arthroplasty by 2027: what orthopaedic surgeons should know. J Bone Joint Surg Am. 2024. [DOI] [PubMed]

[CR2] 2.Makhni EC. Meaningful clinical applications of patient-reported outcome measures in orthopaedics. J Bone Joint Surg Am. 2021;103(1):84–91. doi: 10.2106/JBJS.20.00624. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Porter ME. What is value in health care? N Engl J Med. 2010;363(26):2477–2481. doi: 10.1056/NEJMp1011024. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Makhni EC, Baumhauer JF, Ayers D, Bozic KJ. Patient-reported outcome measures: how and why they are collected. Instr Course Lect. 2019;68:675–680. [PubMed] [Google Scholar]

[CR5] 5.Chung AS, Copay AG, Olmscheid N, Campbell D, Walker JB, Chutkan N. Minimum Clinically important difference: current trends in the spine literature. Spine (Phila Pa 1976). 2017;42(14):1096–105. [DOI] [PubMed]

[CR6] 6.Copay AG, Chung AS, Eyberg B, Olmscheid N, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part I: Upper Extremity: A Systematic Review. JBJS Rev. 2018;6(9):e1. doi: 10.2106/JBJS.RVW.17.00159. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Copay AG, Eyberg B, Chung AS, Zurcher KS, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part II: lower extremity: a systematic review. JBJS Rev. 2018;6(9):e2. doi: 10.2106/JBJS.RVW.17.00160. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Baumhauer JF, Bozic KJ. Value-based healthcare: patient-reported outcomes in clinical decision making. Clin Orthop Relat Res. 2016;474(6):1375–1378. doi: 10.1007/s11999-016-4813-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Kunze KN, Bart JA, Ahmad M, Nho SJ, Chahla J. Large heterogeneity among minimal clinically important differences for hip arthroscopy outcomes: a systematic review of reporting trends and quantification methods. Arthroscopy. 2021;37(3):1028–37 e6. doi: 10.1016/j.arthro.2020.10.050. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Menendez ME, Sudah SY, Cohn MR, Narbona P, Ladermann A, Barth J, et al. Defining minimal clinically important difference and patient acceptable symptom state after the latarjet procedure. Am J Sports Med. 2022;50(10):2761–2766. doi: 10.1177/03635465221107939. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Bernstein DN, Karhade AV, Bono CM, Schwab JH, Harris MB, Tobert DG. Sociodemographic factors are associated with patient-reported outcome measure completion in orthopaedic surgery: an analysis of completion rates and determinants among new patients. JB JS Open Access. 2022;7(3):e22.00026. [DOI] [PMC free article] [PubMed]

[CR12] 12.Jolback P, Rolfson O, Mohaddes M, Nemes S, Karrholm J, Garellick G, et al. Does surgeon experience affect patient-reported outcomes 1 year after primary total hip arthroplasty? Acta Orthop. 2018;89(3):265–271. doi: 10.1080/17453674.2018.1444300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The Academy Workshop. Radiology. 2019;291(3):781–791. doi: 10.1148/radiol.2019190613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Padash S, Mickley JP, Vera Garcia DV, Nugen F, Khosravi B, Erickson BJ, et al. An overview of machine learning in orthopedic surgery: an educational paper. J Arthroplasty. 2023;38(10):1938–1942. doi: 10.1016/j.arth.2023.08.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Langenberger B, Schrednitzki D, Halder AM, Busse R, Pross CM. Predicting whether patients will achieve minimal clinically important differences following hip or knee arthroplasty. Bone Joint Res. 2023;12(9):512–521. doi: 10.1302/2046-3758.129.BJR-2023-0070.R2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Fontana MA, Lyman S, Sarker GK, Padgett DE, MacLean CH. Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res. 2019;477(6):1267–1279. doi: 10.1097/CORR.0000000000000687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Nwachukwu BU, Beck EC, Lee EK, Cancienne JM, Waterman BR, Paul K, et al. Application of machine learning for predicting clinically meaningful outcome after arthroscopic femoroacetabular impingement surgery. Am J Sports Med. 2020;48(2):415–423. doi: 10.1177/0363546519892905. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Kunze KN, Krivicich LM, Clapp IM, Bodendorfer BM, Nwachukwu BU, Chahla J, et al. Machine learning algorithms predict achievement of clinically significant outcomes after orthopaedic surgery: a systematic review. Arthroscopy. 2022;38(6):2090–2105. doi: 10.1016/j.arthro.2021.12.030. [DOI] [PubMed] [Google Scholar]

[CR19] 19.El-Othmani MM, Zalikha AK, Shah RP. Comparative analysis of the ability of machine learning models in predicting in-hospital postoperative outcomes after total hip arthroplasty. J Am Acad Orthop Surg. 2022;30(20):e1337–e1347. doi: 10.5435/JAAOS-D-21-00987. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Rouzrokh P, Mickley JP, Khosravi B, Faghani S, Moassefi M, Schulz WR, et al. THA-AID: deep learning tool for total hip arthroplasty automatic implant detection with uncertainty and outlier quantification. J Arthroplasty. 2023;9(4):966–73. [DOI] [PubMed]

[CR21] 21.Khosravi B, Rouzrokh P, Mickley JP, Faghani S, Larson AN, Garner HW, et al. Creating high fidelity synthetic pelvis radiographs using generative adversarial networks: unlocking the potential of deep learning models without patient privacy concerns. J Arthroplasty. 2023;38(10):2037–43 e1. doi: 10.1016/j.arth.2022.12.013. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Jang SJ, Fontana MA, Kunze KN, Anderson CG, Sculco TP, Mayman DJ, et al. An interpretable machine learning model for predicting 10-year total hip arthroplasty risk. J Arthroplasty. 2023;38(7S):S44–50. [DOI] [PubMed]

[CR23] 23.Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak. 2019;19(1):3. doi: 10.1186/s12911-018-0731-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Harris AHS, Kuo AC, Bowe TR, Manfredi L, Lalani NF, Giori NJ. Can machine learning methods produce accurate and easy-to-use preoperative prediction models of one-year improvements in pain and functioning after knee arthroplasty? J Arthroplasty. 2021;36(1):112–7 e6. doi: 10.1016/j.arth.2020.07.026. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Katakam A, Karhade AV, Collins A, Shin D, Bragdon C, Chen AF, et al. Development of machine learning algorithms to predict achievement of minimal clinically important difference for the KOOS-PS following total knee arthroplasty. J Orthop Res. 2022;40(4):808–815. doi: 10.1002/jor.25125. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Zhang S, Lau BPH, Ng YH, Wang X, Chua W. Machine learning algorithms do not outperform preoperative thresholds in predicting clinically meaningful improvements after total knee arthroplasty. Knee Surg Sports Traumatol Arthrosc. 2022;30(8):2624–2630. doi: 10.1007/s00167-021-06642-4. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Kunze KN, Karhade AV, Sadauskas AJ, Schwab JH, Levine BR. Development of machine learning algorithms to predict clinically meaningful improvement for the patient-reported health state after total hip arthroplasty. J Arthroplasty. 2020;35(8):2119–2123. doi: 10.1016/j.arth.2020.03.019. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Bourne RB, Chesworth BM, Davis AM, Mahomed NN, Charron KD. Patient satisfaction after total knee arthroplasty: who is satisfied and who is not? Clin Orthop Relat Res. 2010;468(1):57–63. doi: 10.1007/s11999-009-1119-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Farooq H, Deckard ER, Ziemba-Davis M, Madsen A, Meneghini RM. Predictors of patient satisfaction following primary total knee arthroplasty: results from a traditional statistical model and a machine learning algorithm. J Arthroplasty. 2020;35(11):3123–3130. doi: 10.1016/j.arth.2020.05.077. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Pettit MH, Hickman SHM, Malviya A, Khanduja V. Development of machine-learning algorithms to predict attainment of minimal clinically important difference after hip arthroscopy for femoroacetabular impingement yield fair performance and limited clinical utility. Arthroscopy. 2023;40(4):1153–63. [DOI] [PubMed]

[CR31] 31.Ramkumar PN, Karnuta JM, Haeberle HS, Sullivan SW, Nawabi DH, Ranawat AS, et al. Radiographic indices are not predictive of clinical outcomes among 1735 patients indicated for hip arthroscopic surgery: a machine learning analysis. Am J Sports Med. 2020;48(12):2910–2918. doi: 10.1177/0363546520950743. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Kunze KN, Polce EM, Rasio J, Nho SJ. Machine learning algorithms predict clinically significant improvements in satisfaction after hip arthroscopy. Arthroscopy. 2021;37(4):1143–1151. doi: 10.1016/j.arthro.2020.11.027. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Kunze KN, Polce EM, Nwachukwu BU, Chahla J, Nho SJ. Development and internal validation of supervised machine learning algorithms for predicting clinically significant functional improvement in a mixed population of primary hip arthroscopy. Arthroscopy. 2021;37(5):1488–1497. doi: 10.1016/j.arthro.2021.01.005. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Kunze KN, Polce EM, Clapp IM, Alter T, Nho SJ. Association between preoperative patient factors and clinically meaningful outcomes after hip arthroscopy for femoroacetabular impingement syndrome: a machine learning analysis. Am J Sports Med. 2022;50(3):746–756. doi: 10.1177/03635465211067546. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Kunze KN, Polce EM, Clapp I, Nwachukwu BU, Chahla J, Nho SJ. Machine learning algorithms predict functional improvement after hip arthroscopy for femoroacetabular impingement syndrome in athletes. J Bone Joint Surg Am. 2021;103(12):1055–1062. doi: 10.2106/JBJS.20.01640. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Kunze KN, Polce EM, Ranawat AS, Randsborg PH, Williams RJ, 3rd, Allen AA, et al. Application of machine learning algorithms to predict clinically meaningful improvement after arthroscopic anterior cruciate ligament reconstruction. Orthop J Sports Med. 2021;9(10):23259671211046575. doi: 10.1177/23259671211046575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Ye Z, Zhang T, Wu C, Qiao Y, Su W, Chen J, et al. Predicting the objective and subjective clinical outcomes of anterior cruciate ligament reconstruction: a machine learning analysis of 432 patients. Am J Sports Med. 2022;50(14):3786–3795. doi: 10.1177/03635465221129870. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Martin RK, Wastvedt S, Pareek A, Persson A, Visnes H, Fenstad AM, et al. Predicting subjective failure of ACL reconstruction: a machine learning analysis of the Norwegian Knee Ligament Register and patient reported outcomes. J ISAKOS. 2022;7(3):1–9. doi: 10.1016/j.jisako.2021.12.005. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. Using machine learning to predict clinical outcomes after shoulder arthroplasty with a minimal feature set. J Shoulder Elbow Surg. 2021;30(5):e225–e236. doi: 10.1016/j.jse.2020.07.042. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. What is the accuracy of three different machine learning techniques to predict clinical outcomes after shoulder arthroplasty? Clin Orthop Relat Res. 2020;478(10):2351–2363. doi: 10.1097/CORR.0000000000001263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Ramkumar PN, Karnuta JM, Haeberle HS, Owusu-Akyaw KA, Warner TS, Rodeo SA, et al. Association between preoperative mental health and clinically meaningful outcomes after osteochondral allograft for cartilage defects of the knee: a machine learning analysis. Am J Sports Med. 2021;49(4):948–957. doi: 10.1177/0363546520988021. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Ramkumar PN, Karnuta JM, Haeberle HS, Rodeo SA, Nwachukwu BU, Williams RJ. Effect of preoperative imaging and patient factors on clinically meaningful outcomes and quality of life after osteochondral allograft transplantation: a machine learning analysis of cartilage defects of the knee. Am J Sports Med. 2021;49(8):2177–2186. doi: 10.1177/03635465211015179. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Alaiti RK, Vallio CS, Assunção JH, Andrade e Silva FBd, Gracitelli MEC, Neto AAF, et al. Using machine learning to predict nonachievement of clinically significant outcomes after rotator cuff repair. Orthop J Sports Med. 2023;11(10):23259671231206180. [DOI] [PMC free article] [PubMed]

[CR44] 44.Potty AG, Potty ASR, Maffulli N, Blumenschein LA, Ganta D, Mistovich RJ, et al. Approaching artificial intelligence in orthopaedics: predictive analytics and machine learning to prognosticate arthroscopic rotator cuff surgical outcomes. J Clin Med. 2023;12(6). [DOI] [PMC free article] [PubMed]

[CR45] 45.Kunze KN, Kaidi A, Madjarova S, Polce EM, Ranawat AS, Nawabi DH, et al. External validation of a machine learning algorithm for predicting clinically meaningful functional improvement after arthroscopic hip preservation surgery. Am J Sports Med. 2022;50(13):3593–3599. doi: 10.1177/03635465221124275. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Merali ZG, Witiw CD, Badhiwala JH, Wilson JR, Fehlings MG. Using a machine learning approach to predict outcome after surgery for degenerative cervical myelopathy. PLoS ONE. 2019;14(4):e0215133. doi: 10.1371/journal.pone.0215133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Khan O, Badhiwala JH, Witiw CD, Wilson JR, Fehlings MG. Machine learning algorithms for prediction of health-related quality-of-life after surgery for mild degenerative cervical myelopathy. Spine J. 2021;21(10):1659–1669. doi: 10.1016/j.spinee.2020.02.003. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Zhang JK, Jayasekera D, Javeed S, Greenberg JK, Blum J, Dibble CF, et al. Diffusion basis spectrum imaging predicts long-term clinical outcomes following surgery in cervical spondylotic myelopathy. Spine J. 2023;23(4):504–512. doi: 10.1016/j.spinee.2022.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Park C, Mummaneni PV, Gottfried ON, Shaffrey CI, Tang AJ, Bisson EF, et al. Which supervised machine learning algorithm can best predict achievement of minimum clinically important difference in neck pain after surgery in patients with cervical myelopathy? A QOD study. Neurosurg Focus. 2023;54(6):E5. doi: 10.3171/2023.3.FOCUS2372. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Staartjes VE, de Wispelaere MP, Vandertop WP, Schroder ML. Deep learning-based preoperative predictive analytics for patient-reported outcomes following lumbar discectomy: feasibility of center-specific modeling. Spine J. 2019;19(5):853–861. doi: 10.1016/j.spinee.2018.11.009. [DOI] [PubMed] [Google Scholar]

[CR51] 51.Pedersen CF, Andersen MO, Carreon LY, Eiskjaer S. Applied machine learning for spine surgeons: predicting outcome for patients undergoing treatment for lumbar disc herniation using PRO data. Global Spine J. 2022;12(5):866–876. doi: 10.1177/2192568220967643. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Berjano P, Langella F, Ventriglia L, Compagnone D, Barletta P, Huber D, et al. The influence of baseline clinical status and surgical strategy on early good to excellent result in spinal lumbar arthrodesis: a machine learning approach. J Pers Med. 2021;11(12). [DOI] [PMC free article] [PubMed]

[CR53] 53.Staartjes VE, Stumpo V, Ricciardi L, Maldaner N, Eversdijk HAJ, Vieli M, et al. FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease. Eur Spine J. 2022;31(10):2629–2638. doi: 10.1007/s00586-022-07135-9. [DOI] [PubMed] [Google Scholar]

[CR54] 54.Karhade AV, Fogel HA, Cha TD, Hershman SH, Doorly TP, Kang JD, et al. Development of prediction models for clinically meaningful improvement in PROMIS scores after lumbar decompression. Spine J. 2021;21(3):397–404. doi: 10.1016/j.spinee.2020.10.026. [DOI] [PubMed] [Google Scholar]

[CR55] 55.Halicka M, Wilby M, Duarte R, Brown C. Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models. BMC Musculoskelet Disord. 2023;24(1):333. doi: 10.1186/s12891-023-06446-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Siccoli A, de Wispelaere MP, Schroder ML, Staartjes VE. Machine learning-based preoperative predictive analytics for lumbar spinal stenosis. Neurosurg Focus. 2019;46(5):E5. doi: 10.3171/2019.2.FOCUS18723. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Brinkman N, Shah R, Doornberg J, Ring D, Gwilym S, Jayakumar P. Artificial neural networks outperform linear regression in estimating 9-month patient-reported outcomes after upper extremity fractures with increasing number of variables. OTA Int. 2023;6(5 Suppl):e284. doi: 10.1097/OI9.0000000000000284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Loos NL, Hoogendam L, Souer JS, Slijper HP, Andrinopoulou ER, Coppieters MW, et al. Machine learning can be used to predict function but not pain after surgery for thumb carpometacarpal osteoarthritis. Clin Orthop Relat Res. 2022;480(7):1271–1284. doi: 10.1097/CORR.0000000000002105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Harrison CJ, Geoghegan L, Sidey-Gibbons CJ, Stirling PHC, McEachan JE, Rodrigues JN. Developing machine learning algorithms to support patient-centered, value-based carpal tunnel decompression surgery. Plast Reconstr Surg Glob Open. 2022;10(4):e4279. doi: 10.1097/GOX.0000000000004279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.•Oeding JF, Krych AJ, Pearle AD, Kelly BT, Kunze KN. Medical imaging applications developed using artificial intelligence demonstrate high internal validity yet are limited in scope and lack external validation. Arthroscopy. 2024. This study is of importance to this topic as it parellels the themes identified in this review pertaining to clinically significant outcome achievement - repetitive use cases of current statistical methods and methodological shortcomings are currently impeding progress in this domain. [DOI] [PubMed]

[CR61] 61.Rossi MJ, Brand JC, Lubowitz JH. Minimally clinically important difference (MCID) is a low bar. Arthroscopy. 2023;39(2):139–141. doi: 10.1016/j.arthro.2022.11.001. [DOI] [PubMed] [Google Scholar]

[CR62] 62.Kunze KN, Madjarova S, Jayakumar P, Nwachukwu BU. Challenges and opportunities for the use of patient-reported outcome measures in orthopaedic pediatric and sports medicine surgery. J Am Acad Orthop Surg. 2023;31(20):e898–e905. doi: 10.5435/JAAOS-D-23-00087. [DOI] [PubMed] [Google Scholar]

PERMALINK

Artificial Intelligence for Clinically Meaningful Outcome Prediction in Orthopedic Research: Current Applications and Limitations

Seong Jun Jang

Jake Rosenstadt

Eugenia Lee

Kyle N Kunze

Abstract

Purpose of Review

Recent Findings

Summary

Introduction

Total Joint Arthroplasty

Model Applications

Table 1.

Model Architectures

Table 2.

Model Performance

Model Limitations

Sports Medicine

Model Applications

Table 3.

Model Architectures and Performance

Model Limitations

Table 4.

Spine

Model Applications

Table 5.

Model Architectures and Performance

Table 6.

Model Limitations

Trauma; Oncology; Pediatric; Hand, Foot, and Ankle; and Other Disciplines

Table 7.

Table 8.

Discussion

Conclusion

Author contributions

Declarations

Human and Animal Rights and Informed Consent

Competing interests

Footnotes

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases