Skip to main content
Current Reviews in Musculoskeletal Medicine logoLink to Current Reviews in Musculoskeletal Medicine
. 2024 Apr 8;17(6):185–206. doi: 10.1007/s12178-024-09893-z

Artificial Intelligence for Clinically Meaningful Outcome Prediction in Orthopedic Research: Current Applications and Limitations

Seong Jun Jang 1, Jake Rosenstadt 2, Eugenia Lee 3, Kyle N Kunze 1,
PMCID: PMC11091035  PMID: 38589721

Abstract

Purpose of Review

Patient-reported outcome measures (PROM) play a critical role in evaluating the success of treatment interventions for musculoskeletal conditions. However, predicting which patients will benefit from treatment interventions is complex and influenced by a multitude of factors. Artificial intelligence (AI) may better anticipate the propensity to achieve clinically meaningful outcomes through leveraging complex predictive analytics that allow for personalized medicine. This article provides a contemporary review of current applications of AI developed to predict clinically significant outcome (CSO) achievement after musculoskeletal treatment interventions.

Recent Findings

The highest volume of literature exists in the subspecialties of total joint arthroplasty, spine, and sports medicine, with only three studies identified in the remaining orthopedic subspecialties combined. Performance is widely variable across models, with most studies only reporting discrimination as a performance metric. Given the complexity inherent in predictive modeling for this task, including data availability, data handling, model architecture, and outcome selection, studies vary widely in their methodology and results. Importantly, the majority of studies have not been externally validated or demonstrate important methodological limitations, precluding their implementation into clinical settings.

Summary

A substantial body of literature has accumulated demonstrating variable internal validity, limited scope, and low potential for clinical deployment. The majority of studies attempt to predict the MCID—the lowest bar of clinical achievement. Though a small proportion of models demonstrate promise and highlight the utility of AI, important methodological limitations need to be addressed moving forward to leverage AI-based applications for clinical deployment.

Keywords: Machine learning, Minimal clinically important difference, Patient-reported outcome measures, Artificial intelligence, Orthopedic surgery

Introduction

Patient-reported outcome measures (PROMs) play a critical role in determining the success of treatment interventions and are a central focus for evaluating the value of care rendered [1]. Indeed, the Centers for Medicare & Medicaid Services (CMS) have recently finalized a landmark national policy to standardize and expand the collection and reporting of PROMs following total joint arthroplasty (TJA), with the overall goals of enhancing clinical care, shared decision-making, and quality measurement for these common elective procedures [1]. These metrics generally include evaluations of patient quality of life, pain, satisfaction, and function, thereby providing important information for clinicians regarding the effectiveness, quality and value of various surgical and non-surgical treatment interventions [24]. As such, the ability to identify which patients will experience clinically important changes in their health state after an intervention, as assessed through PROMs, has been a recent focus in the orthopedic surgery literature and of policy makers as it may aid in clinical decision making [58].

A notable limitation to the utilization of PROMs in clinical and research settings is their interpretability. The overall meaning of raw numeric scores can be both challenging to communicate to patients and to interpret across different centers [9]. Perhaps most importantly, standard PROMs inherently lack patient perception and experience in terms of whether the achieved total score is meaningful at a personal level. Therefore, a shift towards utilizing clinically significant outcome (CSO) measures has been adopted in recent years. These metrics are statistical transformations of standard PROMs, utilizing distribution- or anchor-based methods to integrate a form of patient-perceived change into their interpretation [9]. Emphasis on quantifying clinically meaningful outcome thresholds for various treatments in the form of the minimal clinically important difference (MCID), patient acceptable symptom state (PASS), and substantial clinical benefit (SCB) have presented an excellent opportunity for better understanding treatment response [10].

The interpretation of CSOs, which represent discrete entities capturing a health change during a treatment-related experience, is inherently influenced by a complex interaction among various medical and social determinants such demographic and clinical profiles, treatment techniques, geographic location, rehabilitation, patient expectations, and surgeon–patient interactions [11, 12]. The ability to appropriately consider increasingly complex and large amounts of data representing these unique combinations of factors may unlock the ability to more accurately predict health-state changes and distinguish which specific factors most influence the ability to achieve a CSO. This, in turn, holds the potential to optimize patient management in each of the various stages of musculoskeletal episodes of care. The complexity of this task has poised the application of artificial intelligence (AI) to address these challenges. Machine learning (ML), a subset of AI that leverages unique statistical techniques to learn from complex data patterns and optimize the prediction of a specific output, may better handle large and complex data interactions and leverage insight into achievement of CSOs [13, 14]. Numerous recent studies have applied ML techniques to develop prognostic clinical prediction models to this end [1518].

The current review synthesizes contemporary literature from orthopedic surgical subspecialties to critically assess the applications, performance, and limitations of ML techniques leveraged for the prediction of CSOs after treatment interventions. This review also highlights important limitations demonstrated in this body of literature. In doing so, another primary aim of this review is to determine how to focus future efforts to increase the quality and value of clinical prediction models developed for CSO achievement such that implementation into clinical settings may be feasible.

Total Joint Arthroplasty

Considerable effort has been directed towards applying AI to accomplish a variety of tasks pertaining to TJA including predicting clinical outcomes [19], identifying surgical implants [20], generating synthetic radiographs [21], and automating measurements for clinical analysis [22]. Notably, a large body of literature has specifically aimed to use ML to predict PROMs at various time points after both total hip and knee arthroplasty. The following section provides a concise overview of this literature, highlighting the applications, performance, and limitations of these prediction models.

Model Applications

Seven studies were identified that developed prognostic clinical prediction models for PROMs after TJA (Table 1). Of the seven studies, three (42.9%) investigated combined cohorts of patients undergoing THA and TKA [15, 16, 23], three (42.9%) focused on isolated TKA populations [2426], while one (14.3%) considered an isolated THA population [27]. The greater proportion of literature leveraging AI to predict CSO after TKA may reflect a traditionally higher dissatisfaction rate observed after TKA compared to THA [28]. Indeed, this discrepancy in patient-perceived health changes could certainly elicit interest in better identifying which patients would ultimately benefit from TKAs to better guide patient selection and enhance treatment decision making in this more challenging cohort. Cohort sizes used for algorithm training and testing varied largely across studies, ranging between 587 [24] and 64,634 patients [23].

Table 1.

Machine learning literature for predicting patient-reported outcomes in hip and knee replacement surgery

Article Surgery Cohort (N) Models PROM of interest CSO Time Top results
Fontana et al. (2019) [16] THA /TKA

7239 THA

6480 TKA

LASSO, SVM, RF

SF-36 physical component scores (PCS)

SF-36 mental component scores (MCS)

Hip and knee disability and osteoarthritis outcome scores (HOOS/KOOS JR)

MCID 2 years

PCS: LASSO, AUROC 0.78

MCS: LASSO, AUROC 0.89

HOOS JR: LASSO, AUROC 0.77

KOOS JR: LASSO, AUROC 0.75

Huber et al. (2019) [23] THA /TKA

30,524 THA

34,110 TKA

LR, XGB, MSAENET, RF, NNET, NB, KNN, LB

Visual analog scale (VAS)

Q score (sum of Oxford hip and knee scores)

MCID NR

THA:

- VAS: XGB, AUROC 0.87

- Q: XGB, AUROC 0.78

TKA:

- VAS: XGB/MSAENET, AUROC 0.83

- Q: XGB/MSAENET, AUROC 0.71

Kunze et al. (2020) [27] THA 616 THA SGB, RF, SVM, NNET, ENP Patient-reported health state (PRHS) MCID 2 years

PRHS: RF, AUROC 0.97

Calibration intercept 0.05,

Calibration slope 1.45,

Brier score 0.054

Harris et al. (2021) [24] TKA 587 TKA LR, LASSO, GBM, QDA

KOOS Total

KOOS Jr

KOOS sub-scales

MCID 1 year

KOOS JR: QDA, AUROC 0.60

KOOS Total: QDA AUROC 0.66

- ADL: LR, AUROC 0.76

- Pain: All ML, AUROC 0.72

- Symptom: LASSO, AUROC 0.72

- QoL: GBM, AUROC 0.71

- Recreation: GBM, AUROC 0.52

Katakam et al. (2022) [25] TKA 744 TKA SGB, RF, SVM, NN, ENP KOOS-PS MCID 1 year

KOOS-PS: ENP, AUROC 0.77

Calibration intercept -0.02

Calibration slope 1.15

Brier score 0.14

Zhang et al. (2022) [26] TKA 2840 TKA RF, XGB, SVM, LASSO

SF-36 PCS

SF-36 MCS

WOMAC

MCID 2 years

PCS: XGB, AUROC 0.77

MCS: XGB/SVM, AUROC 0.95

WOMAC: RF/LASSO, AUROC 0.89

Langenberger et al. (2023) [15] THA/TKA

1843 THA

1546 TKA

NNet, GBM, LASSO, RF, LR, Ridge regression, Elastic net

EuroQol five-dimension five-level questionnaire (EQ-5D- 5L)

EQ visual analogue scale (EQ-VAS)

HOOS-PS

KOOS-PS

MCID 1 year

THA:

-EQ-5D- 5L: GBM, AUROC 0.81

-EQ-VAS: LASSO, AUROC 0.84

-HOOS-PS: Ridge, AUROC 0.71

TKA:

-EQ-5D- 5L: RF AUROC 0.80

-EQ-VAS: Elastic net, AUROC 0.76

-KOOS-PS: Elastic net, AUROC 0.76

LASSO Logistic least absolute shrinkage and selection operator; SVM support vector machine; RF random forest; LR logistic regression; XGB extreme gradient boosting, MSAENET multi-step elastic-net; ENP elastic-net penalized logistic regression; NNET neural net, LB logistic boost, NB Naïve Bayes, KNN K-nearest neighbors, SGB stochastic gradient boosting; GBM gradient boosting machines; QDA quadratic discriminant analysis, WOMAC Western Ontario and McMaster University Osteoarthritis Index (WOMAC); MLP multi-layer perception

All seven studies developed clinical prediction models with the labeled dependent outcome being the MCID [15, 16, 2327]. Interestingly, no studies developed prediction models for the PASS or SCB for any outcome. Critically, no studies utilized the same type PROMs, which may reflect the variance in accessibility to data, differences in institutional or single-surgeon PROM collection patterns, or a strategic attempt to add a novel contribution to the literature. Follow-up for all studies ranged from “initial follow-up” to minimum of 2 years, while follow-up in one study was not reported. No study reported a minimum follow-up time exceeding 2 years.

Model Architectures

Commonly used model architectures included random forest (RF), support vector machines (SVM), least absolute shrinkage and selection operators (LASSO), and gradient boosting machines (GBM). All but one study [29] compared the performance of multiple models and subsequently determined the best-performing model based on comparative metrics. Similarly, all but one study utilized a non-ML comparison model (i.e., traditional logistic regression) to determine whether the ML model demonstrated any discernable predictive advantage over a non-ML technique utilizing within the same dataset (Table 2). Three (42.9%) studies demonstrated that ML techniques were not superior to non-ML techniques. Langenberger et al. demonstrated that the performance of ML models was not superior to logistic regression models for various PROMs [15]. Interestingly, Langenberger et al. [15] and Zheng et al. [26] also demonstrated that simply applying thresholds to preoperative PROMs was superior or equivalent to ML models in predicting which patients would achieve MCID after surgery, highlighting that preoperative PROMs are often the most important variables that may provide insight into the propensity for a patient to achieve a CSO.

Table 2.

Study characteristics and limitations of studies in hip and knee replacement surgery

Article Validation Non-ML comparisons Interpretability/feature analysis Code available Missing data/adjustments Unique study characteristics Other limitations
Fontana et al. (2019) [16] Internal Comparison against dummy models Yes, feature analysis No Imputation

Addition of data from four different time points

High pre-op MCID exclusion

Large feature space

Single institution

Huber et al. (2019) [23] Internal Comparison against National Health Services (NHS) linear regression model Yes, variable importance No Removal of missing data, synthetic minority over-sampling technique

Sensitivity analysis

Testing cohort separated by time

Missing variables (BMI)
Kunze et al. (2020) [27] Internal Null model Yes, variable importance No

 > 30%, exclusion

 < 30%, imputation

Web application Small cohort, limited data
Harris et al. (2021) [24] Internal Logistic regression No No NR VA hospitals from three separate regions Only VA data
Katakam et al. (2022) [25] Internal Null model Yes, partial dependence plots No  < 30%, imputation

Web application

Five sites

Small cohort
Zhang et al. (2022) [26] Internal Preoperative PROMs thresholds Yes, variable importance No Imputation, upsampling Determination against preoperative PROM cutoffs Single institution
Langenberger et al. (2023) [15] Internal Logistic regression, preoperative PROM scores Yes, SHAP analysis No

 > 30%, exclusion

 < 30%, imputation

9 hospitals represented Limited follow-up

SHAP Sharp additive explanation

Model Performance

The performance of the models varied across both THA and TKA studies. For THA, the AUROC of models ranged from 0.71 to 0.97; whereas for TKA, the AUROC of models ranged from 0.52 to 0.87. Importantly, every study differed in the type of PROMs investigated, the definition of MCID and satisfaction, the variables inputted for modeling, and the selected models to be trained and validated. These methodological differences likely contribute to the wide range in performance identified among these studies. Despite these differences, the majority of studies did include a form of feature analysis or interpretation of the variable contributing most to the predictions of each model for interpretability.

Model Limitations

Critically, no study was externally validated or published their code for external confirmation and research use. However, two studies deployed open-access online tools using their developed models for public use [25, 27]; it is recommended that interactive applications derived from clinical prediction tools remain as educational demonstrations and avoid being implemented into clinical workflows until external validation is achieved. Other limitations in this body of literature include non-standardized selection of CSO thresholds or use of thresholds from external populations that may not generalize to the study population, variation in methods of data handling and imputation, small sample sizes in several studies, and a lack of consensus as to which PROM may be most useful in most accurately capturing CSO.

Sports Medicine

Within sports medicine, the use of ML has primarily been applied to better understanding the propensity for predicting achievement of the MCID while elucidating major predictive factors associated with MCID achievement.

Model Applications

Across the 15 identified papers, study cohorts included patients undergoing hip arthroscopy for femoroacetabular impingement syndrome (FAIS) (n = 7) [17, 3034•, 35], anterior cruciate ligament (ACL) reconstructions (n = 3) [3638], anatomic and reverse total shoulder arthroplasties (TSA) (n = 2) [39, 40], osteochondral allograft (OCA) transplantations (n = 2) [41, 42]), and rotator cuff repairs (RCR) (n = 2) [43, 44]. Of the 15 studies, 11 (73.3%) developed clinical prediction models for the MCID, three (20%) for both the MCID and SCB, and one (6.7%) for the MCID, PASS, and SCB (Table 3). All but one study [31] applied feature selection or variable importance during model development in order determine the most predictive variables for the achievement of the CSO of interest. Overall, data was obtained from various sources including both single surgeon registries, multiple-institution combined registries, or administrative databases, resulting in patient cohort sizes between 135 [41] and 5774 [39]. Follow-up for included studies ranged between 3 months and 2 years.

Table 3.

Machine learning literature for predicting patient-reported outcomes in sports medicine surgeries

Article Surgery Cohort (N) Models PROM of interest CSO Time Top results
Kunze et al. (2021) [32] ACLR 442 SGM, RF, NNET, SVM, AGB, ENP International Knee Documentation Committee (IKDC) score MCID 2 years minimum

IKDC: ENP, AUROC 0.82

Calibration intercept 0.10

Calibration slope: 1.15

Brier score 0.07

Ye et al. (2022) [37] ACLR 432 LR, NB, RF, XGBs

Lysholm score

IKDC score

MCID 2 years minimum

Lysholm score nonachievement: XGB, AUROC 0.93

Accuracy 0.91

IKDC Nonachievement: XGB, AUROC 0.94

Accuracy 0.95

Nwachukwu et al. (2020) [17] FAIS 898 LASSO Harris Hip Score (mHHS), Hip Outcome score–Activities of Daily Living (HOS-ADL) HOS–Sport Specific (HOS-SS) MCID 2 years

HOS-ADL: AUROC 0.89

HOS-SS: AUROC 0.85

mHHS: AUROC 0.84

Kunze et al. (2021) [32] FAIS 1,118 SGM, RF, AGB, NNET, ENP HOS-SS MCID 2 years minimum

HOS-SS: ENP, AUROC 0.77

Calibration intercept 0.07

Calibration slope 1.22

Brier score 0.14

Kunze et al. (2022) [45••] FAIS 859 RF, ENP, NNET, SGB, XGB, SVM International Hip Outcome Tool-12 (IHOT-12) MCID, SCB, PASS 2 years minimum

MCID: ENP, AUROC 0.78

- Calibration intercept 0.02

- Calibration slope 0.85

- Brier score 0.14

PASS: RF, AUROC 0.74

- Calibration intercept 0.03

- Calibration slope 0.79

- Brier score 0.21

SCB: XGB, AUROC 0.71

- Calibration intercept -0.03

- Calibration slope 0.95

- Brier score 0.22

Kunze et al. (2021) [32] FAIS 818 RF, SGB, NNET, ENP, SVM HOS-ADL MCID 2 years

HOS-ADL: SCB, AUROC 0.84

- Calibration intercept 0.20

- Calibration slope 0.83

- Brier score 0.13

Kunze et al. (2021) [32] FAIS 935 RF, SGB, NNET, ENP, SVM Visual analog scale (VAS) score MCID 2 years minimum

VAS: NNET, AUROC 0.94

- Calibration intercept -0.43

- Calibration slope 1.07

- Brier score: 0.050

Ramkumar et al. (2020) [31] FAIS 1735 RF

mHHS

HOS-ADL

HOS-SS

International Hip Outcome Tool (iHOT-33)

MCID 1 year, 2 years

mHHS [1 year AUROC] | [2 years AUROC]

- Alpha: 0.490 | 0.539, Coronal: 0.503 | 0.572, Femoral: 0.532 | 0.495, McKibbin: 0.520 | 0.511, Hip impingement: 0.533 | 0.486

HOS-ADL

- Alpha: 0.487 | 0.538, Coronal: 0.502 | 0.568, Femoral: 0.504 | 0.489, McKibbin: 0.509 | 0.538, Hip impingement: 0.540 | 0.516

HOS-SS

- Alpha: 0.515 | 0.540, Coronal: 0.500 | 0.569, Femoral: 0.510 | 0.544, McKibbin: 0.517 | 0.546, Hip impingement: 0.521 | 0.516

iHOT-33

- Alpha: 0.534 | 0.529, Coronal: 0.512 | 0.570, Femoral: 0.522 | 0.519, McKibbin: 0.524 | 0.554, Hip impingement: 0.521 | 0.505

Pettit et al. (2023) [30] FAIS 1917 RF, LR, NNET, SVM, GBM iHOT-12 MCID 6 months minimum iHOT-12: RF, AUROC 0.75
Ramkumar et al. (2021) [41] OCA 135 1 year, 153 2 years NB, XGBs, RF, LR, ensemble classifier

IKDC

KOS-ADL

SF-36 MCS

SF-36 PCS

MCID, SCB 1 year, 2 years

IKDC

- MCID: 1 year—NB, AUROC 0.72, 2 years—LR, AUROC 0.88

- SCB: 1 year—NB, 0.74, 2 years—LR, 0.92

KOS-ADL

- MCID: 1 year—Isotonic, 0.74, 2 years—RF, 0.86

- SCB: 1 year—RF, 0.84, 2 years—NB, 0.92

SF-36 MCS

- MCID: 1 year—LR, 0.73, 2 years—Sigmoid, 0.94

- Not applicable for SCB

SF-36 PCS

- MCID: 1 year—Ensemble, 0.81, 2 years—LR, 0.90

- Not applicable for SCB

Ramkumar et al. (2021) [42] OCA 153 NB, XGBs, RF, LR, ensemble classifier

IKDC

KOS-ADL

SF-36 MCS

SF-36 PCS

MCID, SCB 2 years

IKDC

- MCID: LR, AUROC 0.84

- SCB: NB, AUROC 0.90

KOS-ADL

- MCID: RF, AUROC 0.88

- SCB: NB, AUROC 0.86

SF-36 MCS

- MCID: RF, AUROC 0.60

- Not applicable for SCB

SF-36 PCS

- MCID: LR, AUROC 0.68

- Not applicable for SCB

Alaiti et al. (2023) [43] RCR 474 RF, LightGBM, decision tree classifier, extra trees classifier, XGB, KNN, CatBoost, LR American Shoulder and Elbow Surgeons (ASES) MCID 2 years ASES: RF, AUROC 0.68
Potty et al. (2023) [44] RCR 631 Linear regression, ridge regression, LASSO, SVM, KNN, RF, XGB ASES MCID, SCB 3 mo, 6 mo, 12 mo Predicted change in outcome scores that fell within MCID and SCB thresholds from literature for ASES score; percentage of patients predicted within the MCID for 3-, 6-, and 12-months post operation was 52%, 54%, and 69%, respectively. The percentage of patients predicted within SCB for 3, 6, and 12 months was 73%, 78%, and 87%, respectively
Kumar et al. (2020) [40] aTSA, rTSA 4782 Linear regression, XGB, and wide and deep

ASES

University of California Los Angeles (UCLA)

CONSTANT pain score

Global shoulder function (GSF) score

VAS pain

MCID 1 year minimum

Mean absolute error (MAE), wide and deep

- ASES: ± 10.1 to 11.3 points

- UCLA: ± 2.5 to 3.4

- Constant: ± 7.3 to 7.9

- GSF: ± 1.0 to 1.4

- VAS ± 1.2 to 1.4

MCID, XGBoost

- ASES: AUROC 0.88

- UCLA: AUROC 0.88

- Constant: AUROC 0.94

- GSF: AUROC 0.91

- VAS: AUROC 0.86

SCB, XGBoost

- ASES: AUROC 0.80

- UCLA: AUROC 0.83

- Constant: AUROC 0.88

- GSF: AUROC 0.85

- VAS: AUROC 0.86

Kumar et al. (2021) [39] aTSA, rTSA 5774 XGB

ASES

CONSTANT pain score

GSF score

VAS pain

MCID 3 months minimum

MAE, XGB

- ASES: ± 11.7 full model, ± 12.0 abbreviated model

- Constant: ± 8.9, ± 9.8

- GSF ± 1.4, ± 1.5

- VAS: ± 1.3, ± 1.4

MCID, XGB AUROC

- ASES: 0.90 full model, 0.88 abbreviated model

- Constant: 0.95, 0.94

- GSF: 0.88, 0.87

- VAS pain score: 0.87, 0.87

SCB, XGB AUROC

- ASES: 0.84 full model, 0.82 abbreviated model

- Constant: 0.90, 0.89

- GSF: 0.84, 0.83

- VAS: 0.82, 0.85

LASSO Logistic least absolute shrinkage and selection operator; SVM support vector machine; RF random forest; LR logistic regression; XGB extreme gradient boosting, ENP elastic-net penalized logistic regression; NNET neural net, NB Naïve Bayes, KNN K-nearest neighbors, SGB stochastic gradient boosting; GBM gradient boosting machines; AGB adaptive gradient boosting, GAM generalized additive model

Model Architectures and Performance

Of the 15 studies, three (20%) utilized a single ML model [17, 31, 39], while the remaining 12 compared multiple models. The most common ML algorithms used were random forest (RF), XGBoost (XGB), neural networks (NN), SVM, stochastic gradient boosting (SGB), and elastic-net penalized logistic regression (ENPLR) (Table 3). Overall, model performance was observed to range from fair to excellent across the various treatment interventions upon internal validation and using various algorithm architectures. For example, when considering achievement of the MCID for the International Knee Documentation Committee (IKDC) scores, Kunze et al. [36] determined that the ENPLR model (AUROC: 0.82) demonstrated the best relative performance, whereas Ye et al.[37] reported that an XGB model (AUROC: 0.94) demonstrated the best relative performance. Given that patient populations and model training often vary considerably, there is no current method of directly comparing model performance across studies.

Notably, Kunze et al. [34•] have performed the only investigation to date that developed independent prediction models for the MCID, PAS, and SCB, providing insight into the propensity to achieve a wide range of magnitudes of clinical health change. This group reported that the best performing algorithm to predict achievement of the MCID for the IHOT-12 score after hip arthroscopy for FAIS was an ENPLR model (AUROC: 0.78), while an RF model performed best for predicant PASS achievement (AUROC: 0.74), and an XGB model performed best for predicting SCB achievement (AUROC: 0.71) [34•].

Overall, model performance varied considerably across ACLR, hip arthroscopy for FAIS, OCA, RCR, and aTSA/rTSA interventions. For ACLR, AUROC values ranged between 0.82 and 0.94. Of the seven FAIS studies, all but one study used a combination of demographic and imaging-based features to develop clinical prediction models. Interestingly, Ramkumar et al. [31] investigated the ability of radiographic angles to predict achievement of MCID for the Harris hip score (mHHS), hip Outcome Score–Activities of Daily Living (HOS-ADL), HOS–Sport Specific (HOS-SS), and the International Hip Outcome Tool (iHOT-33). AUROC values for this study ranged between 0.49 and 0.57, suggesting that these features did not predict CSO achievement any better than random chance [31]. The other six studies the developed ML models on FAIS datasets reported AUROC values ranging between 0.71 and 0.94, suggesting fair to excellent performance.

The two studies that developed prediction models to predict CSO after OCA for chondral and osteochondral defects of the knee were derived from the same patient population from a single institution. These two studies reported AUROC values ranging between 0.60 and 0.90 for the best models for both MCID and SCB achievement [41, 42]. For RCR, one study reported an AUROC of 0.68 for their best model [43]. Potty et al. [44] utilized ML techniques to predict the mean change in the ASES score at several postoperative time points that fell within pre-established MCID and SCB values from the literature. They reported that the percentage of patients predicted within the MCID for 3, 6, and 12 months was 52%, 54%, and 69%, respectively, while the percentage of patients predicted within SCB for 3, 6, and 12 months was 73%, 78%, and 87%, respectively [44].

For aTSA/rTSA, one study reported AUROC values ranging between 0.80 and 0.94 for 5 different PROMs [40]. The other aTSA/rTSA study sought to evaluate the difference between a full- versus abbreviated-input model (291 versus 19 variables, respectively) and found no significant difference between AUROC between the models [39].

Model Limitations

Major limitations of sports medicine studies that have developed prediction models for CSO achievement include lack of generalizability (single-surgeon or homogenous cohorts), exclusion of possible significant untested variables, selection bias due to low follow-up rates, and missing data (Table 4). Although most studies used a form of imputation to fill in missing data points, the proportion of patients lost to follow-up is significant. Furthermore, each study chose different variables to include in their analysis making comparisons across populations difficult. Even within the same interventions, the most predictive variables for each model did not overlap, making it difficult to compare studies and draw conclusions. Only one paper [41] published their code for open-source use. Finally, only a single study externally validation their algorithm on an independent population of patients, which limits model generalizability. Kunze et al. [45••] performed external validation of their ML clinical prediction model for MCID achievement for the HOS-SS after hip arthroscopy utilizing a geographically unique and independent population of 154 patients. This group demonstrated good performance on external validation, with an AUROC of 0.80 and appropriate calibration metrics. Such efforts represent essential steps forward toward introducing prognostic models into clinical workflows.

Table 4.

Study characteristics and limitations of studies in sports medicine surgeries

Article Validation Non-ML comparisons Interpretability/feature analysis Code available Missing data/adjustments Unique study characteristics Other limitations
Kunze et al. (2021) [32] Internal Null model Yes, variable importance No Imputation Web application Single institution
Ye et al. (2022) [37] Internal NR Yes, SHAP analysis No Removal of missing data Top models shared same top predictors

Single surgeon

Missing variables (preoperative psychological evaluation, timing of return to sports)

Nwachukwu et al. (2020) [17] Internal NR Yes, feature selection No Removal of missing data Used PatientIQ to analyze data

Single surgeon

Used one ML model

Selection bias from removing patients with incomplete data

Kunze et al. (2021) [32] Internal Null model Yes, feature selection, LIME plot, variable importance No Imputation

Web application

Included data from community hospitals

Only data from 4 hospitals
Kunze et al. (2022) [45••] Internal Null model Yes, variable importance No Imputation

Web application

Evaluated PASS

Selection bias from removing patients with incomplete data
Kunze et al. (2021) [32] External Null model Yes, feature selection, LIME plot No Imputation Web application Single institution
Kunze et al. (2021) [32] Internal Null model Yes, feature selection, LIME plot No Imputation Web application

Large proportion of missing data

Missing variable (hip osteoarthritis)

Ramkumar et al. (2020) [31] Internal Preoperative PROM No No NR Radiographic indices as main variable 45% 2 year follow-up rate
Pettit et al. (2023) [30] Internal NR Yes, feature importance No Imputed Used nationwide registry (UK) Limited patient data for primary outcome (37%)
Ramkumar et al. (2021) [41] Internal NR Yes, SHAP analysis Yes  > 80%, exclusion Measures at multiple timepoints

Small cohort

Single institution

Ramkumar et al. (2021) [42] Internal NR Yes, SHAP analysis No  > 80%, exclusion Includes preoperative imaging as variable

Small cohort

Single institution

Alaiti et al. (2023) [43] Internal NR Yes, SHAP analysis No Removal of missing data at 12 and 24 months Web application Single institution
Potty et al. (2023) [44] Internal NR Yes, SHAP analysis No NR Measures at multiple timepoints

Single institution

Variable surgical techniques

Kumar et al. (2020) [40] Internal Mean absolute error Yes, feature importance No Imputation 30 sites Excluded patients with revisions, humeral fractures, and hemiarthroplasty
Kumar et al. (2021) [39] Internal Full versus abbreviated input variables Yes, feature selection No Imputation 30 sites

Excluded patients with revisions, humeral fractures, and hemiarthroplasty

One ML algorithm

SHAP Sharp additive explanation

Spine

The current body of AI research concerning CSO after spine surgery encompasses a variety of surgical procedures and pathologies from disc herniations to spinal metastases. As highlighted below, this wide variety has led to heterogeneity in the literature concerning model applications, architecture, and performance.

Model Applications

A total of eleven studies used ML algorithms to predict CSO achievement in patients undergoing spine surgery (Table 5). Of the studies, four concerned outcomes after surgical treatment of degenerative cervical myelopathy (DCM) [4649], two after lumbar disc herniation (LDH) [50, 51], two after lumbar fusion (LF) [52, 53], two which included combined cohorts that underwent multiple lumbar procedures [5456] All eleven studies used their primary outcome as the MCID [4656], while none attempted to build models for predicting the PASS or SCB. A wide variety of outcomes were measured, with the most common being the numeric rating scale for back pain (NRS-BP) and leg pain (NRS-LP), the visual analog scale (VAS), the Oswestry Disability Index (ODI), and the Core Outcome Measures Index (COMI). Sample sizes varied considerably and were generally small, ranging between 40 [48] and 4307 [55•]. Mean follow-up ranged between 6 months and 2 years (Table 5).

Table 5.

Machine learning literature for predicting patient-reported outcomes in orthopedic spine surgery

Article Surgery Cohort (N) Models PROM of interest CSO Time Top results
Merali et al. (2019) [46] DCM 757 RF

SF-6D

Modified Japanese Orthopaedic Association (mJOA) scale

MCID 6 months, 1 year, 2 years

*SF-6D: RF, AUROC 0.73

mJOA: RF, AUROC 0.67

Siccoli et al. (2019) [56] LSS 635 RF, XGB, Bayesian GLM, BT, KNN, simple GLM, NNET

Numeric rating scale for back pain (NRS-BP)

Numeric rating scale for leg pain (NRS-LP)

Dutch Oswestry Disability Index (ODI)

MCID 6 weeks, 1 year

*NRS-BP: XGB, AUROC 0.79

Brier score 0.18

F1-score 0.87

NRS-LP: RF, AUROC 0.72

Brier score 0.20

F1-score 0.80

ODI: BGLM, AUROC 0.68

Brier score 0.20

F1-score 0.67

Staartjes et al. (2019) [50] LDH 422 NNET, LR

NRS-BP

NRS-LP

Dutch ODI

MCID 1 year

NRS-BP: NNET, AUROC 0.90

NRS-LP: NNET, AUROC 0.87

Dutch ODI: NNET, AUROC 0.84

Karhade et al. (2021) [54] LDH/LSS

532 LDH

374 LSS

SGB, RF, SVM, NNET, ENP

PROMIS physical function

Pain interference

Pain intensity

MCID  < 1 year

PROMIS Physical function: ENP, AUROC 0.79

Calibration intercept 0.00

Calibration slope 1.36

Brier score 0.15

Pain interference: NNET/ENP, AUROC 0.74

Calibration intercept, ENP 0.05

Calibration slope, ENP 1.07

Brier score, ENP 0.16

Pain intensity: NNET, AUROC 0.70

Calibration intercept 0.01

Calibration slope 1.08

Brier score 0.14

Berjano et al. (2021) [52] LF 1243 RF

Good early outcome (ODI)

Excellent early outcome (combined ODI, SF-36, COMI back)

MCID 6 months

Good early outcome: RF, AUROC 0.842

Excellent early outcome: RF, AUROC 0.808

Khan et al. (2021) [47] DCM 173 GBM, MARS (Earth)

SF-36 mental component summary (MCS)

SF-36 physical component summary (PCS)

MCID 1 year

SF-36 MCS: GBM, AUROC 0.77

Calibration intercept 0.38

Calibration slope 0.71

SF-36 PCS: Earth, AUROC 0.78

Calibration intercept 0.11

Calibration slope 1.06

Staartjes et al. (2022) [53] LF 1115 ENETGLM

ODI/Core Outcome Measures Index (COMI) functional impairment

NRS-BP

NRS-LP

MCID 1 year

ODI/COMI: ENETGLM, AUROC 0.67

Calibration intercept − 0.07

Calibration slope 0.63

NRS-BP: ENETGLM, AUROC 0.72

Calibration intercept − 0.38

Calibration slope 1.10

NRS-LP, ENETGLM, AUROC 0.64

Calibration intercept 0.14

Calibration slope 0.49

Zhang et al. (2022) [26] DCM 40 SVM

SF-36 MCS

SF-36 PCS

Neck Disability Index (NDI)

North American Spine Society (NASS)

mJOA

MCID 2 years

SF-36 MCS: SVM, AUROC 0.90

F1-score 0.90

SF-36 PCS: SVM, AUROC 0.86

F1-score 0.86

NDI: SVM, AUROC 0.65

F1-score 0.67

NASS: SVM, AUROC 0.98

F1-score 0.94

mJOA: SVM, AUROC 0.98

F1-score 0.86

Pedersen et al. (2022) [51] LDH 1968 DL, DT, RF, BT, SVM

EuroQol-5D (EQ-5D)

ODI

VAS-BP

VAS-LP

MCID 1 year

EQ-5D: SVM, AUROC 0.89

ODI: BT/SVM, AUROC 0.77

VAS-LP: SVM, AUROC 0.78

VAS-BP: RF, AUROC 0.87

Park et al. (2023) [49] DCM 916 LR, SVM, DT, RF, ET, GNB, KNN, MP, XGB VAS-NP MCID 3 months, 2 years

*VAS-NP: LR, AUROC 0.77

F1-score 0.78

Halicka et al. (2023) [55•] LSH or LSS

1041 LDH

2413 LSS

853 both

RF

Back pain

Leg pain

COMI

MCID 3–24 months

Back pain: RF, AUROC 0.68

Calibration intercept − 0.06

Calibration slope 1.20

Brier score 0.22

Leg pain: RF, AUROC 0.66

Calibration intercept 0.01

Calibration slope 0.96

Brier score 0.20

COMI: RF, AUROC 0.62

Calibration intercept 0.06

Calibration slope 0.94

Brier score 0.23

RF Random forest; XGB extreme gradient boosting; GLM generalized linear models; BT boosted trees; KNN K-nearest neighbors; NNET neural net; LR logistic regression; DT decision tree; SGB stochastic gradient boosting; SVM support vector machine; ENP elastic-net penalized logistic regression; GBM gradient boosting machines; MARS multivariate adaptive regression splines; ENETGLM elastic net generalized linear models; LASSO logistic least absolute shrinkage and selection operator; Ridge CV ridge cross validation; DL deep learning; GLMM generalized linear mixed model; CHAID Chi-square automatic interaction detection; ET extra trees; GNB Gaussian naive Bayes; MLP multi-layer perceptron; *Only reporting results for last time point

Model Architectures and Performance

Of the various ML architectures used for training models used, the most common were RF, SVM, and variations of gradient boosting. In studies investigating outcome after LDH, AUROC for MCID prediction ranged between 0.77 and 0.90; in studies investigating outcomes after LF, AUROC for MCID prediction ranged between 0.64 and 0.84; for MCID after LSS, the AUROC was 0.79 for the NRS-BP, 0.72 for NRS-LP, and 0.68 for ODI; for MCID after DCM, AUROC ranged between 0.65 and 0.98; and for multiple lumbar procedures, AUROC ranged between 0.62 and 0.79. Only four (36.4%) studies provided model calibration metrics. Other metrics reported included the F1-score and Brier score, which were variably reported.

Two studies reported on external validation of their algorithms, which were demonstrated to confer comparable performance in unique populations (Table 6). Staartjes et al. [53] used a large multinational, multicenter dataset (11 centers across 7 countries) to predict MCID achievement for the ODI, COMI, NRS-BP, and NRS-LP at 1 year after LF surgery. The authors reported that the discrimination for all outcomes was poor to fair, with AUROC ranging between 0.64 and 0.72. The positive predictive value of the model was relatively high (0.81–0.90), but the negative predictive value was low (0.23–0.39) [53]. Furthermore, model calibration performance was lower than that of the original model. Halicka et al. [55•] also performed external validation on a unique single-institution cohort, with AUROC for NRS-BP being 0.68, NRS-LP 0.66, and COMI 0.62. Model calibration metrics were comparable to that of the internally validated model.

Table 6.

Study characteristics and limitations of studies in spine surgery

Article Validation Non-ML comparisons Interpretability/feature analysis Code available Missing data/adjustments Unique study characteristics Other limitations
Merali et al. (2019) [46] Internal NR Yes, relative importance No

 > 5%, exclusion

 < 5%, imputation

International cohort

Large feature space

Follow-up rate 71.2%

One ML model

Siccoli et al. (2019) [56] Internal NR Yes, variable importance Yes Imputation Extensive number of outcomes predicted Single center, single surgeon
Staartjes et al. (2019) [50] Internal Logistic regression Yes, variable importance No Demographic data < 20% imputation Trained ML on radiological data (MRI) Small cohort, single surgeon
Karhade et al. (2021) [54] Internal Null model Yes, variable importance No  < 30% imputation

Web application

Individual explanations

Small sample size
Berjano et al. (2021) [52] Internal NR Yes, mean decrease Gini index No Imputation Leveraged baseline self-reported variables for prediction

Single center

Short follow-up

One ML model

Khan et al. (2021) [47] Internal Logistic regression Yes, relative importance No Imputation Multicenter dataset Small cohort
Staartjes et al. (2022) [53] External NR Yes, variable importance No Imputation

Multinational, multicenter dataset

Web application

No sample size calculation for external validation
Zhang et al. (2022) [26] Internal NR Yes, feature selection No  < 0.5% imputation Trained ML on radiological data (MRI)

Small cohort

One ML model

Pedersen et al. (2022) [51] Internal Comparison against MARS and logistic regression No No NR Relatively large sample size Single institution
Park et al. (2023) [49] Internal NR Yes, SHAP analysis No Imputation Large number of algorithms compared Small sample size
Halicka et al. (2023) [55•] External Logistic regression, linear regression Yes, relative importance Yes Imputation (multivariate imputation by chained equations (MICE)) Individual explanations

Single center

One ML model

SHAP Shapley additive explanations

Model Limitations

The most significant limitation among the studies was the lack of external validation, as observed with the majority of existing literature in other orthopedic subspecialties. As discussed above, only 2 of 11 studies (18.2%) externally validated their models [53, 55•]. Thus, the conclusions of the majority of the studies regarding model performance cannot currently be generalized. A second observed limitation was small sample sizes, with studies developing models on study populations as small as 40 patients. These practices, without appropriately calculating minimum sample sizes for clinical prediction model creation, may lead to model performance bias. Other limitations consistent with the body of literature concerning CSO achievement includes incomplete analysis or disclosure of important modeling metrics such as calibration and Brier score; failure to disclose full source code; and lack of transparency and consistency in data handling.

Trauma; Oncology; Pediatric; Hand, Foot, and Ankle; and Other Disciplines

Outside of sports medicine, spine, and TJA subspecialties, literature concerning the development of prognostic clinical prediction models for CSO achievement is scarce. Indeed, no articles were identified in pediatric orthopedic surgery, foot and ankle surgery, or musculoskeletal oncology. One article was identified within orthopedic trauma [57], while two articles were identified concerning hand surgery [58, 59]. The study interventions included those for upper extremity fracture management, thumb carpometacarpal osteoarthritis (CMC OA), and carpal tunnel decompression (CTD) surgery. All three studies evaluated the MCID for various PROMs, including QuickDASH, patient-reported outcomes measurement information system (PROMIS), and Michigan hand outcomes questionnaire (MHQ). Patient sample sizes ranged from 734 [57] to 1489 [58], and they included data from single institutions and national registries (Table 7).

Table 7.

Machine learning literature for predicting patient-reported outcomes in trauma, hand, and wrist surgeries

Article Surgery Cohort (N) Models PROM of interest CSO Time Top results
Brinkman et al. (2023) [57] Upper extremity fracture management 734 linear regression, NNET QuickDASH, PROMIS-UE-PF, PROMIS-PI MCID 9 months

Quick-DASH

- M-20: ANN, 64%

- M-23: ANN, 65%

- M-29: ANN, 80%

- M-34: ANN, 81%

- M-54: ANN, 85%

PROMIS UE:

- M-20: LR, 59%

- M-23: ANN, 66%

- M-29: ANN, 67%

- M-34: ANN, 74%

- M-54: ANN, 71%

Loos et al. (2022) [58] Thumb CMC OA (carpometacarpal osteoarthritis) 1489 GBM, RF MHQ pain, MHQ function MCID 12 months

MHQ:

- Pain: RF, AUROC 0.59

- Function: GBM, AUROC 0.74

Harrison et al. (2022) [59] Carpal tunnel decompression (CTD) 1093 ENP, KNN, SVM, XGB, NNET QuickDASH function, QuickDASH symptoms MCID  > 6 months

QuickDASH:

- Function: XGB, AUROC 0.79

- Symptoms: EN, AUROC 0.75

SVM Support vector machine; RF random forest; XGB extreme gradient boosting, ENP elastic-net penalized logistic regression; NNET neural net, KNN K-nearest neighbors, GBM gradient boosting machines

Brinkman et al. [57] developed a prediction model for MCID achievement at nine months after an upper extremity fracture treatment, though type of treatment was not disclosed. They compared the accuracy of 20-, 23-, 29-, 34-, and 54-variable models using artificial neural networks and linear regression. Neural networks performed better than linear regression models, especially for models with a greater number of variables. Specifically, for QuickDASH scores, neural networks achieved 64%, 65%, 80%, 81%, and 85% accuracy for the 20-, 23-, 29-, 34-, and 54-variable models respectively, while linear regression had 61%, 58%, 72%, 75%, and 77% accuracy. When examining PROMIS-UE scores, there was a trend for increased accuracy with an increasing number of variables. Accuracies for the neural networks ranged between 54 and 74%, while accuracies for linear regression range from 57 to 70% [57].

Loos et al. [58] developed GBM and RF algorithms for the prediction of MCID on the Michigan hand questionnaire (MHQ) for function and pain at 12 months after trapeziectomy with ligament reconstruction and tendon interposition for thumb CMC OA. Patient data was derived from 25 locations in the Netherlands. They reported that the AUROC for the MCID for function was 0.74, while the AUROC for the MCID for pain was 0.59, suggesting limited performance and clinical utility. Harrison et al. [59] developed five ML algorithms to predict improvement in QuickDASH score after CTD, with AUROC ranging between 0.74 and 0.79. Of note, the algorithm with the highest accuracy across both outcomes in this study was XGBoost. Interestingly, Harrison et al. also developed heuristic models using chi-squared automatic interaction detection (CHAID) to provide novel perspectives on treatment decision-making. Their CHAID trees suggested that the most appealing candidates for surgery were those with high overall baseline health and more severe hand symptoms. These intuitive diagrams can help to narrow clinical benefit down to just a few preoperative questions and may be valuable for incorporating ML predictions into clinical decision-making.

The largest limitation of all three of these studies was the lack of external validation, which was not performed (Table 8). This was not surprising considering how few studies in both trauma and hand and wrist interventions were found and how few studies in other interventions were externally validated, but it is still a noteworthy aspect to emphasize. Other than external validity, other limitations include arbitrary variable selection [57] and low follow-up rates. Loos et al. [58] reported a 55.4% follow-up rate and Harrison et al. [59] reported a 58.7% follow-up rate, which may have led to selection biases in the data.

Table 8.

Study characteristics and limitations of studies in trauma, hand, and wrist surgeries

Article Validation Non-ML comparisons Interpretability/feature analysis Code available Missing data/adjustments Unique study characteristics Other limitations
Brinkman et al. (2023) [57] Internal Linear regression Yes, variable importance No not mentioned Measure at multiple timepoints

Single center

Arbitrary variable selection

Trained model using variables 2–4 weeks post-injury

Loos et al. (2022) [58] Internal Logistic regression Yes, feature selection Yes

Imputation

Performed non-responder analysis

Web application for hand function

Excluded patients who underwent revision surgery

Follow-up rate 55.4%

Only 1 variable used in RF model

Harrison et al. (2022) [59] Internal None reported Yes, SHAP analysis Yes Performed missing data analysis

Web application

CHAID decision tree analysis

Single center

Follow-up rate 58.7%

Discussion

The main findings of the current review are as follows: (1) within the TJA literature, applications of ML have been directed at optimizing predictive performance for achieving CSO for a wide range of metrics, with most studies reporting reasonable internal validity despite a wide range in performance; (2) within the sports medicine literature, applications of ML for CSO prediction have been targeted at ACLR, hip arthroscopy, cartilage restoration, RCR, and TSA, with narrow use cases and variable performance; (3) within the spine literature, applications of ML for CSO prediction have involved the degree of change in pain and function predominately after treatment interventions for lumbar spine pathology and cervical myelopathy, with a wide range of performance and reporting conduct; (4) a paucity of effort has been directed towards applying AI to anticipate CSO achievement in the orthopedic surgical subspecialties of foot and ankle, hand, trauma, pediatrics, and musculoskeletal oncology; (5) the transparency of reporting and methodological conduct of studies that have developed prognostic clinical prediction models for CSO prediction is generally poor and demonstrates considerable inter-study variably, with information concerning data handling, code availability, and external validation lacking.

The use of ML on datasets for prediction of CSO in the hip and knee arthroplasty literature considered an extensive variety of outcome assessment tools, likely highlighting a trend of interest in applying ML that exceeds the clinical translation and utility of the models developed. This is supported by the overwhelming proportion of these models that were not externally validated, precluding the implementation of these models into clinical pathways. When considering the internal validation and performance, several models were developed with good to excellent discrimination capabilities. However, only two studies included other important performance metrics, such as model calibration, to further support the validity of their models. A scoping assessment of the methodological conduct of ML studies within this literature is exceedingly important as regulatory bodies begin to place emphasis on the use of clinically significant outcome metrics to define value and quality as payment models shift towards value-based care. Advanced predictive modeling, including ML, can leverage insight into the personalized trajectories of outcome improvement, which should be considered when defining value-based care targets and treatment quality. Additionally, understanding the current state of the ML literature for predicting CSO is important as AI-assisted diagnosis and decision-making becomes adopted as billable interventions in reimbursement models. Based on the current review, published ML models in the hip and knee arthroplasty literature remain imperfect and unfocused with variable reporting. To move forward, consensus on important metrics to be used is needed, as well as increased methodological transparency and external validation.

Within the sports medicine literature, a narrow scope of applications was investigated; although similar to the TJA literature, studies demonstrated variable internal validity, and only a single study achieving external validation of their previously developed model. Importantly, duplicative efforts in developing ML clinical prediction tools are not clinically useful unless a substantial increase in performance is observed, and this narrow scope of direction hinders advancement in the field. Indeed, Oeding et al. [60•] recently identified a similar trend in studies concerning computer vision tasks within orthopedic sports medicine surgery, where the authors reviewed 55 studies in this domain. They found that 65% of studies developed computer vision applications that aimed to detect similar pathologies, while the diagnostic performance demonstrated excellent internal validity. Likewise, only 7% of studies were externally validated. These findings suggest that similar to ML studies within this subspecialty, current models have a narrow conceptual framework, limited clinically applicability, and require improved methodological conduct (such as external validation and prospective assessments) to confer important clinical utility. This review highlights key knowledge gaps in reviewing the current literature that should be used to drive future research and create holistic AI strategies. Prior to initiating the development of additional clinical prognostic models, collaborators should determine several important criteria: (1) upon model development, is the metric being predicted clinically relevant, and is there a current or future need for a prediction model by providers or payers? (2) where will patient data be obtained from, what limitations and assumptions exist in the data, and what methods will be implemented to address these shortcomings? (3) what ethical and statistical biases may be present in the data and how will these be accounted for? (4) is it feasible to update and reinforce model training? (5) is it possible to externally validate the model? (6) what are the plans and infrastructure for model deployment and what resources are available to facilitate and maintain deployment? Failure to consider, plan for, and address these important considerations will result in the continued proliferation of ML models without potential for clinical utility.

As with most of the current literature pertaining to developing prediction models for CSO achievement in orthopedic surgery, the spine literature also demonstrated variable transparency and reporting conduct as well as performance evaluation. The scope of applications for ML was wide, though in terms of outcome metrics, a high degree of overlap was observed in PROM selection. Furthermore, all studies investigated the ability to optimize prediction of the MCID, whereas no studies predicted the PASS or SCB. This a recurrent theme identified throughout this review and a concern raised by those performing clinical outcomes research. Indeed, Rossi et al. [61] caution that reliance on the MCID for clinical significance may not be sufficient and neither addresses whether patients are satisfied nor if the perceived benefit gained is substantial. Future studies should shift their focus towards clinical significance metrics that aim to capture higher thresholds of improvement. Additionally, in the spine literature, I = internal validity for various use cases ranged from poor to excellent in terms of AUROC, with variable reporting of additional performance metrics, and only two studies pursuing external validation. Future research is needed to accomplish external validation of these models, while increased awareness of reporting guidelines and consensus on model development, determination of MCID thresholds, and PROM use is imperative to confer meaningful clinical applications.

Interestingly, other musculoskeletal subspecialties outside of sports medicine, spine, and TJA have not observed the same period of recent heightened expectations, hype, and academic productivity. It is unclear as to why other subspecialties have not devoted similar effort to investigating the utility of AI for predicting CSO achievement in their respective specialties. It is plausible that these other subspecialties do not have the same access to readily available registries that can facilitate AI research, or that the patient populations are more difficult to evaluate through utilizing PROMs [62]. For example, subspecialties such as orthopedic trauma surgery, oncology, and pediatrics are challenged with treating diverse patient populations, conditions, and disease presentations. In specialties such as pediatric orthopedics, PROM collection may be limited by the inability to complete PROMs or difficulty with conveying important perceptions of their health state given the young age of patients [62]. In orthopedic oncology, the true effect of orthopedic treatment interventions may be convoluted by prior or ongoing medical treatments, palliative circumstances, or unique oncologic pathologies, among many other factors. This heterogeneity presents a considerable barrier to creating a homogenous registry with large enough data to facilitate AI studies.

Among all disciplines and studies reviewed in this current synthesis, substantial methodological limitations were observed, which may be facilitated by a mismatch between the innovation triggers with excitement surrounding a new area of AI research and a lack of domain expertise in model development and data analysis. Indeed, the accessibility of coding and software has enabled those without formal data science backgrounds to be able to conduct AI analyses, which may predispose some authors to overlook key model components or fail to adhere to model reporting guidelines. Authors should be encouraged to collaborate with several members with complimentary skillsets when performing AI research such that appropriate scientific conduct is not omitted and clinically relevant use cases are pursued. Unfortunately, current prognostic clinical prediction models developed for CSO are narrow in scope and lack the foundation to be deployed for meaningful clinical uses.

Conclusion

A proliferation of prognostic clinical prediction models for achievement of CSO after musculoskeletal treatment interventions has been observed over a short period of time, with several models demonstrating high internal performance; however, there remains a mismatch between the considerable production of new literature with hype surrounding the topic and translational capabilities, with publication of prediction models often demonstrating limited and meaningless applications, poor methodological conduct, and inability for deployment in clinical settings. Furthermore, the scope of applications is often limited, with a considerable proportion of studies attempting to predict the MCID—the lowest bar of clinical significance. Future studies are required to expand the scope and clinical importance of applications, while there is a critical need for expert consensus and formal AI guidelines to avoid perpetuating models with poor methodological conduct and limited potential for clinical impact.

Author contributions

KNK: Supervision, methodology, writing of initial manuscript, revision of initial manuscript SJ: Writing of initial manuscript, revision of initial manuscript JR: Writing of initial manuscript, revision of initial manuscript EL: Writing of initial manuscript, revision of initial manuscript

Declarations

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

  • 1.Pasqualini I, Piuzzi NS. New CMS policy on the mandatory collection of patient-reported outcome measures for total hip and knee arthroplasty by 2027: what orthopaedic surgeons should know. J Bone Joint Surg Am. 2024. [DOI] [PubMed]
  • 2.Makhni EC. Meaningful clinical applications of patient-reported outcome measures in orthopaedics. J Bone Joint Surg Am. 2021;103(1):84–91. doi: 10.2106/JBJS.20.00624. [DOI] [PubMed] [Google Scholar]
  • 3.Porter ME. What is value in health care? N Engl J Med. 2010;363(26):2477–2481. doi: 10.1056/NEJMp1011024. [DOI] [PubMed] [Google Scholar]
  • 4.Makhni EC, Baumhauer JF, Ayers D, Bozic KJ. Patient-reported outcome measures: how and why they are collected. Instr Course Lect. 2019;68:675–680. [PubMed] [Google Scholar]
  • 5.Chung AS, Copay AG, Olmscheid N, Campbell D, Walker JB, Chutkan N. Minimum Clinically important difference: current trends in the spine literature. Spine (Phila Pa 1976). 2017;42(14):1096–105. [DOI] [PubMed]
  • 6.Copay AG, Chung AS, Eyberg B, Olmscheid N, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part I: Upper Extremity: A Systematic Review. JBJS Rev. 2018;6(9):e1. doi: 10.2106/JBJS.RVW.17.00159. [DOI] [PubMed] [Google Scholar]
  • 7.Copay AG, Eyberg B, Chung AS, Zurcher KS, Chutkan N, Spangehl MJ. Minimum clinically important difference: current trends in the orthopaedic literature, Part II: lower extremity: a systematic review. JBJS Rev. 2018;6(9):e2. doi: 10.2106/JBJS.RVW.17.00160. [DOI] [PubMed] [Google Scholar]
  • 8.Baumhauer JF, Bozic KJ. Value-based healthcare: patient-reported outcomes in clinical decision making. Clin Orthop Relat Res. 2016;474(6):1375–1378. doi: 10.1007/s11999-016-4813-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kunze KN, Bart JA, Ahmad M, Nho SJ, Chahla J. Large heterogeneity among minimal clinically important differences for hip arthroscopy outcomes: a systematic review of reporting trends and quantification methods. Arthroscopy. 2021;37(3):1028–37 e6. doi: 10.1016/j.arthro.2020.10.050. [DOI] [PubMed] [Google Scholar]
  • 10.Menendez ME, Sudah SY, Cohn MR, Narbona P, Ladermann A, Barth J, et al. Defining minimal clinically important difference and patient acceptable symptom state after the latarjet procedure. Am J Sports Med. 2022;50(10):2761–2766. doi: 10.1177/03635465221107939. [DOI] [PubMed] [Google Scholar]
  • 11.Bernstein DN, Karhade AV, Bono CM, Schwab JH, Harris MB, Tobert DG. Sociodemographic factors are associated with patient-reported outcome measure completion in orthopaedic surgery: an analysis of completion rates and determinants among new patients. JB JS Open Access. 2022;7(3):e22.00026. [DOI] [PMC free article] [PubMed]
  • 12.Jolback P, Rolfson O, Mohaddes M, Nemes S, Karrholm J, Garellick G, et al. Does surgeon experience affect patient-reported outcomes 1 year after primary total hip arthroplasty? Acta Orthop. 2018;89(3):265–271. doi: 10.1080/17453674.2018.1444300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The Academy Workshop. Radiology. 2019;291(3):781–791. doi: 10.1148/radiol.2019190613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Padash S, Mickley JP, Vera Garcia DV, Nugen F, Khosravi B, Erickson BJ, et al. An overview of machine learning in orthopedic surgery: an educational paper. J Arthroplasty. 2023;38(10):1938–1942. doi: 10.1016/j.arth.2023.08.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Langenberger B, Schrednitzki D, Halder AM, Busse R, Pross CM. Predicting whether patients will achieve minimal clinically important differences following hip or knee arthroplasty. Bone Joint Res. 2023;12(9):512–521. doi: 10.1302/2046-3758.129.BJR-2023-0070.R2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fontana MA, Lyman S, Sarker GK, Padgett DE, MacLean CH. Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res. 2019;477(6):1267–1279. doi: 10.1097/CORR.0000000000000687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Nwachukwu BU, Beck EC, Lee EK, Cancienne JM, Waterman BR, Paul K, et al. Application of machine learning for predicting clinically meaningful outcome after arthroscopic femoroacetabular impingement surgery. Am J Sports Med. 2020;48(2):415–423. doi: 10.1177/0363546519892905. [DOI] [PubMed] [Google Scholar]
  • 18.Kunze KN, Krivicich LM, Clapp IM, Bodendorfer BM, Nwachukwu BU, Chahla J, et al. Machine learning algorithms predict achievement of clinically significant outcomes after orthopaedic surgery: a systematic review. Arthroscopy. 2022;38(6):2090–2105. doi: 10.1016/j.arthro.2021.12.030. [DOI] [PubMed] [Google Scholar]
  • 19.El-Othmani MM, Zalikha AK, Shah RP. Comparative analysis of the ability of machine learning models in predicting in-hospital postoperative outcomes after total hip arthroplasty. J Am Acad Orthop Surg. 2022;30(20):e1337–e1347. doi: 10.5435/JAAOS-D-21-00987. [DOI] [PubMed] [Google Scholar]
  • 20.Rouzrokh P, Mickley JP, Khosravi B, Faghani S, Moassefi M, Schulz WR, et al. THA-AID: deep learning tool for total hip arthroplasty automatic implant detection with uncertainty and outlier quantification. J Arthroplasty. 2023;9(4):966–73. [DOI] [PubMed]
  • 21.Khosravi B, Rouzrokh P, Mickley JP, Faghani S, Larson AN, Garner HW, et al. Creating high fidelity synthetic pelvis radiographs using generative adversarial networks: unlocking the potential of deep learning models without patient privacy concerns. J Arthroplasty. 2023;38(10):2037–43 e1. doi: 10.1016/j.arth.2022.12.013. [DOI] [PubMed] [Google Scholar]
  • 22.Jang SJ, Fontana MA, Kunze KN, Anderson CG, Sculco TP, Mayman DJ, et al. An interpretable machine learning model for predicting 10-year total hip arthroplasty risk. J Arthroplasty. 2023;38(7S):S44–50. [DOI] [PubMed]
  • 23.Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak. 2019;19(1):3. doi: 10.1186/s12911-018-0731-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Harris AHS, Kuo AC, Bowe TR, Manfredi L, Lalani NF, Giori NJ. Can machine learning methods produce accurate and easy-to-use preoperative prediction models of one-year improvements in pain and functioning after knee arthroplasty? J Arthroplasty. 2021;36(1):112–7 e6. doi: 10.1016/j.arth.2020.07.026. [DOI] [PubMed] [Google Scholar]
  • 25.Katakam A, Karhade AV, Collins A, Shin D, Bragdon C, Chen AF, et al. Development of machine learning algorithms to predict achievement of minimal clinically important difference for the KOOS-PS following total knee arthroplasty. J Orthop Res. 2022;40(4):808–815. doi: 10.1002/jor.25125. [DOI] [PubMed] [Google Scholar]
  • 26.Zhang S, Lau BPH, Ng YH, Wang X, Chua W. Machine learning algorithms do not outperform preoperative thresholds in predicting clinically meaningful improvements after total knee arthroplasty. Knee Surg Sports Traumatol Arthrosc. 2022;30(8):2624–2630. doi: 10.1007/s00167-021-06642-4. [DOI] [PubMed] [Google Scholar]
  • 27.Kunze KN, Karhade AV, Sadauskas AJ, Schwab JH, Levine BR. Development of machine learning algorithms to predict clinically meaningful improvement for the patient-reported health state after total hip arthroplasty. J Arthroplasty. 2020;35(8):2119–2123. doi: 10.1016/j.arth.2020.03.019. [DOI] [PubMed] [Google Scholar]
  • 28.Bourne RB, Chesworth BM, Davis AM, Mahomed NN, Charron KD. Patient satisfaction after total knee arthroplasty: who is satisfied and who is not? Clin Orthop Relat Res. 2010;468(1):57–63. doi: 10.1007/s11999-009-1119-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Farooq H, Deckard ER, Ziemba-Davis M, Madsen A, Meneghini RM. Predictors of patient satisfaction following primary total knee arthroplasty: results from a traditional statistical model and a machine learning algorithm. J Arthroplasty. 2020;35(11):3123–3130. doi: 10.1016/j.arth.2020.05.077. [DOI] [PubMed] [Google Scholar]
  • 30.Pettit MH, Hickman SHM, Malviya A, Khanduja V. Development of machine-learning algorithms to predict attainment of minimal clinically important difference after hip arthroscopy for femoroacetabular impingement yield fair performance and limited clinical utility. Arthroscopy. 2023;40(4):1153–63. [DOI] [PubMed]
  • 31.Ramkumar PN, Karnuta JM, Haeberle HS, Sullivan SW, Nawabi DH, Ranawat AS, et al. Radiographic indices are not predictive of clinical outcomes among 1735 patients indicated for hip arthroscopic surgery: a machine learning analysis. Am J Sports Med. 2020;48(12):2910–2918. doi: 10.1177/0363546520950743. [DOI] [PubMed] [Google Scholar]
  • 32.Kunze KN, Polce EM, Rasio J, Nho SJ. Machine learning algorithms predict clinically significant improvements in satisfaction after hip arthroscopy. Arthroscopy. 2021;37(4):1143–1151. doi: 10.1016/j.arthro.2020.11.027. [DOI] [PubMed] [Google Scholar]
  • 33.Kunze KN, Polce EM, Nwachukwu BU, Chahla J, Nho SJ. Development and internal validation of supervised machine learning algorithms for predicting clinically significant functional improvement in a mixed population of primary hip arthroscopy. Arthroscopy. 2021;37(5):1488–1497. doi: 10.1016/j.arthro.2021.01.005. [DOI] [PubMed] [Google Scholar]
  • 34.Kunze KN, Polce EM, Clapp IM, Alter T, Nho SJ. Association between preoperative patient factors and clinically meaningful outcomes after hip arthroscopy for femoroacetabular impingement syndrome: a machine learning analysis. Am J Sports Med. 2022;50(3):746–756. doi: 10.1177/03635465211067546. [DOI] [PubMed] [Google Scholar]
  • 35.Kunze KN, Polce EM, Clapp I, Nwachukwu BU, Chahla J, Nho SJ. Machine learning algorithms predict functional improvement after hip arthroscopy for femoroacetabular impingement syndrome in athletes. J Bone Joint Surg Am. 2021;103(12):1055–1062. doi: 10.2106/JBJS.20.01640. [DOI] [PubMed] [Google Scholar]
  • 36.Kunze KN, Polce EM, Ranawat AS, Randsborg PH, Williams RJ, 3rd, Allen AA, et al. Application of machine learning algorithms to predict clinically meaningful improvement after arthroscopic anterior cruciate ligament reconstruction. Orthop J Sports Med. 2021;9(10):23259671211046575. doi: 10.1177/23259671211046575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ye Z, Zhang T, Wu C, Qiao Y, Su W, Chen J, et al. Predicting the objective and subjective clinical outcomes of anterior cruciate ligament reconstruction: a machine learning analysis of 432 patients. Am J Sports Med. 2022;50(14):3786–3795. doi: 10.1177/03635465221129870. [DOI] [PubMed] [Google Scholar]
  • 38.Martin RK, Wastvedt S, Pareek A, Persson A, Visnes H, Fenstad AM, et al. Predicting subjective failure of ACL reconstruction: a machine learning analysis of the Norwegian Knee Ligament Register and patient reported outcomes. J ISAKOS. 2022;7(3):1–9. doi: 10.1016/j.jisako.2021.12.005. [DOI] [PubMed] [Google Scholar]
  • 39.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. Using machine learning to predict clinical outcomes after shoulder arthroplasty with a minimal feature set. J Shoulder Elbow Surg. 2021;30(5):e225–e236. doi: 10.1016/j.jse.2020.07.042. [DOI] [PubMed] [Google Scholar]
  • 40.Kumar V, Roche C, Overman S, Simovitch R, Flurin PH, Wright T, et al. What is the accuracy of three different machine learning techniques to predict clinical outcomes after shoulder arthroplasty? Clin Orthop Relat Res. 2020;478(10):2351–2363. doi: 10.1097/CORR.0000000000001263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ramkumar PN, Karnuta JM, Haeberle HS, Owusu-Akyaw KA, Warner TS, Rodeo SA, et al. Association between preoperative mental health and clinically meaningful outcomes after osteochondral allograft for cartilage defects of the knee: a machine learning analysis. Am J Sports Med. 2021;49(4):948–957. doi: 10.1177/0363546520988021. [DOI] [PubMed] [Google Scholar]
  • 42.Ramkumar PN, Karnuta JM, Haeberle HS, Rodeo SA, Nwachukwu BU, Williams RJ. Effect of preoperative imaging and patient factors on clinically meaningful outcomes and quality of life after osteochondral allograft transplantation: a machine learning analysis of cartilage defects of the knee. Am J Sports Med. 2021;49(8):2177–2186. doi: 10.1177/03635465211015179. [DOI] [PubMed] [Google Scholar]
  • 43.Alaiti RK, Vallio CS, Assunção JH, Andrade e Silva FBd, Gracitelli MEC, Neto AAF, et al. Using machine learning to predict nonachievement of clinically significant outcomes after rotator cuff repair. Orthop J Sports Med. 2023;11(10):23259671231206180. [DOI] [PMC free article] [PubMed]
  • 44.Potty AG, Potty ASR, Maffulli N, Blumenschein LA, Ganta D, Mistovich RJ, et al. Approaching artificial intelligence in orthopaedics: predictive analytics and machine learning to prognosticate arthroscopic rotator cuff surgical outcomes. J Clin Med. 2023;12(6). [DOI] [PMC free article] [PubMed]
  • 45.Kunze KN, Kaidi A, Madjarova S, Polce EM, Ranawat AS, Nawabi DH, et al. External validation of a machine learning algorithm for predicting clinically meaningful functional improvement after arthroscopic hip preservation surgery. Am J Sports Med. 2022;50(13):3593–3599. doi: 10.1177/03635465221124275. [DOI] [PubMed] [Google Scholar]
  • 46.Merali ZG, Witiw CD, Badhiwala JH, Wilson JR, Fehlings MG. Using a machine learning approach to predict outcome after surgery for degenerative cervical myelopathy. PLoS ONE. 2019;14(4):e0215133. doi: 10.1371/journal.pone.0215133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Khan O, Badhiwala JH, Witiw CD, Wilson JR, Fehlings MG. Machine learning algorithms for prediction of health-related quality-of-life after surgery for mild degenerative cervical myelopathy. Spine J. 2021;21(10):1659–1669. doi: 10.1016/j.spinee.2020.02.003. [DOI] [PubMed] [Google Scholar]
  • 48.Zhang JK, Jayasekera D, Javeed S, Greenberg JK, Blum J, Dibble CF, et al. Diffusion basis spectrum imaging predicts long-term clinical outcomes following surgery in cervical spondylotic myelopathy. Spine J. 2023;23(4):504–512. doi: 10.1016/j.spinee.2022.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Park C, Mummaneni PV, Gottfried ON, Shaffrey CI, Tang AJ, Bisson EF, et al. Which supervised machine learning algorithm can best predict achievement of minimum clinically important difference in neck pain after surgery in patients with cervical myelopathy? A QOD study. Neurosurg Focus. 2023;54(6):E5. doi: 10.3171/2023.3.FOCUS2372. [DOI] [PubMed] [Google Scholar]
  • 50.Staartjes VE, de Wispelaere MP, Vandertop WP, Schroder ML. Deep learning-based preoperative predictive analytics for patient-reported outcomes following lumbar discectomy: feasibility of center-specific modeling. Spine J. 2019;19(5):853–861. doi: 10.1016/j.spinee.2018.11.009. [DOI] [PubMed] [Google Scholar]
  • 51.Pedersen CF, Andersen MO, Carreon LY, Eiskjaer S. Applied machine learning for spine surgeons: predicting outcome for patients undergoing treatment for lumbar disc herniation using PRO data. Global Spine J. 2022;12(5):866–876. doi: 10.1177/2192568220967643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Berjano P, Langella F, Ventriglia L, Compagnone D, Barletta P, Huber D, et al. The influence of baseline clinical status and surgical strategy on early good to excellent result in spinal lumbar arthrodesis: a machine learning approach. J Pers Med. 2021;11(12). [DOI] [PMC free article] [PubMed]
  • 53.Staartjes VE, Stumpo V, Ricciardi L, Maldaner N, Eversdijk HAJ, Vieli M, et al. FUSE-ML: development and external validation of a clinical prediction model for mid-term outcomes after lumbar spinal fusion for degenerative disease. Eur Spine J. 2022;31(10):2629–2638. doi: 10.1007/s00586-022-07135-9. [DOI] [PubMed] [Google Scholar]
  • 54.Karhade AV, Fogel HA, Cha TD, Hershman SH, Doorly TP, Kang JD, et al. Development of prediction models for clinically meaningful improvement in PROMIS scores after lumbar decompression. Spine J. 2021;21(3):397–404. doi: 10.1016/j.spinee.2020.10.026. [DOI] [PubMed] [Google Scholar]
  • 55.Halicka M, Wilby M, Duarte R, Brown C. Predicting patient-reported outcomes following lumbar spine surgery: development and external validation of multivariable prediction models. BMC Musculoskelet Disord. 2023;24(1):333. doi: 10.1186/s12891-023-06446-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Siccoli A, de Wispelaere MP, Schroder ML, Staartjes VE. Machine learning-based preoperative predictive analytics for lumbar spinal stenosis. Neurosurg Focus. 2019;46(5):E5. doi: 10.3171/2019.2.FOCUS18723. [DOI] [PubMed] [Google Scholar]
  • 57.Brinkman N, Shah R, Doornberg J, Ring D, Gwilym S, Jayakumar P. Artificial neural networks outperform linear regression in estimating 9-month patient-reported outcomes after upper extremity fractures with increasing number of variables. OTA Int. 2023;6(5 Suppl):e284. doi: 10.1097/OI9.0000000000000284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Loos NL, Hoogendam L, Souer JS, Slijper HP, Andrinopoulou ER, Coppieters MW, et al. Machine learning can be used to predict function but not pain after surgery for thumb carpometacarpal osteoarthritis. Clin Orthop Relat Res. 2022;480(7):1271–1284. doi: 10.1097/CORR.0000000000002105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Harrison CJ, Geoghegan L, Sidey-Gibbons CJ, Stirling PHC, McEachan JE, Rodrigues JN. Developing machine learning algorithms to support patient-centered, value-based carpal tunnel decompression surgery. Plast Reconstr Surg Glob Open. 2022;10(4):e4279. doi: 10.1097/GOX.0000000000004279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.•Oeding JF, Krych AJ, Pearle AD, Kelly BT, Kunze KN. Medical imaging applications developed using artificial intelligence demonstrate high internal validity yet are limited in scope and lack external validation. Arthroscopy. 2024. This study is of importance to this topic as it parellels the themes identified in this review pertaining to clinically significant outcome achievement - repetitive use cases of current statistical methods and methodological shortcomings are currently impeding progress in this domain. [DOI] [PubMed]
  • 61.Rossi MJ, Brand JC, Lubowitz JH. Minimally clinically important difference (MCID) is a low bar. Arthroscopy. 2023;39(2):139–141. doi: 10.1016/j.arthro.2022.11.001. [DOI] [PubMed] [Google Scholar]
  • 62.Kunze KN, Madjarova S, Jayakumar P, Nwachukwu BU. Challenges and opportunities for the use of patient-reported outcome measures in orthopaedic pediatric and sports medicine surgery. J Am Acad Orthop Surg. 2023;31(20):e898–e905. doi: 10.5435/JAAOS-D-23-00087. [DOI] [PubMed] [Google Scholar]

Articles from Current Reviews in Musculoskeletal Medicine are provided here courtesy of Humana Press

RESOURCES