Abstract
Background
Machine learning is a subset of artificial intelligence using algorithmic modeling to progressively learn and create predictive models. Clinical application of machine learning can aid physicians through identification of risk factors and implications of predicted patient outcomes.
Aims
The aim of this study was to compare patient-specific and situation perioperative variables through optimized machine learning models to predict postoperative outcomes.
Methods
Data from 2016 to 2017 from the National Inpatient Sample was used to identify 177,442 discharges undergoing primary total hip arthroplasty, which were included in the training, testing, and validation of 10 machine learning models. 15 predictive variables consisting of 8 patient-specific and 7 situational specific variables were utilized to predict 3 outcome variables: length of stay, discharge, and mortality. The machine learning models were assessed in responsiveness via area under the curve and reliability.
Results
For all outcomes, Linear Support Vector Machine had the highest responsiveness among all models when using all variables. When utilizing patient-specific variables only, responsiveness of the top 3 models ranged between 0.639 and 0.717 for length of stay, 0.703–0.786 for discharge disposition, and 0.887–0.952 for mortality. The top 3 models utilizing situational variables only produced responsiveness of 0.552–0.589 for length of stay, 0.543–0.574 for discharge disposition, and 0.469–0.536 for mortality.
Conclusions
Linear Support Vector Machine was the most responsive machine learning model of the 10 algorithms trained, while decision list was most reliable. Responsiveness was observed to be consistently higher with patient-specific variables than situational variables, emphasizing the predictive capacity and value of patient-specific variables. The current practice in machine learning literature generally deploys a single model, it is suboptimal to develop optimized models for application into clinical practice. The limitation of other algorithms may prohibit potential more reliable and responsive models.
Level of Evidence III.
1. Introduction
1.1. Background
Osteoarthritis of the hip joint is a progressive disease that substantially impacts quality of life through pain and functional limitation, with the definitive treatment being total hip arthroplasty (THA).1,2 Total joint arthroplasty (TJA), including knee and hip arthroplasty, accounted for the highest number of procedures in the United States in 2018, and hence collectively constituted the largest disbursement for Medicare.1 Given the projected increase of the arthritis burden to a total of 78.4 million adults over ages 18 by 2040, there has been a marked shift in focus on establishing value-based care models for TJA.3 In 2016, the Comprehensive Care for Joint Replacement (CJR) model was developed in line with this value-based philosophy, aiming to incentivize provider-driven savings by financially rewarding care delivered at a cost lower than historical regional averages across a 90-day care episode.4 Although such models ultimately aim at improving quality in a cost-conscience environment, they were accompanied with concerns for limiting patient access to care.5 As providers are penalized for postoperative complications within these reimbursement models, providers are indirectly incentivized to steer higher-risk patients away from potentially life-altering TJA.5 These concerns drove the call for patient-specific payment models that account for patient-specific risk factors in allocating reimbursement. More recently, studies implemented the use of machine learning (ML) algorithms to forecast outcomes, with the potential for use in fairer patient-specific reimbursement models.6
ML, a subset of artificial intelligence (AI), has recently gained traction in the medical literature as a robust tool using algorithms to create predictive models that progressively learn and improve through experience.7 ML can utilize previously-trained or new data to predict and estimate outcomes with substantial accuracy, by focusing on generalizable patterns.8 Through a recurrent and systematic approach to generate “learned” decisions, these algorithms have the potential for an important role in the clinical setting to aid physicians in diagnosing, classifying, and identifying risk factors that can impact postoperative outcomes.9 Specifically within the orthopaedic literature, previous studies have looked at applying ML to predict mortality, readmission rates, complications rates, length of stay (LOS), and patient-reported postoperative outcomes.10, 11, 12, 13, 14 Despite these important early efforts, there has been limited delineation and investigation of the predictive capacity of various ML models using patient-specific risk factors when compared to situational specific risk factors.
The aim of this study was to compare the capacity of patient-specific and situational perioperative variables in predicting postoperative outcomes, when utilized with optimal machine learning models. To achieve this aim, we initially 1) developed, internally validated, and compared the performance of ten different ML algorithmic models, using a total of fifteen available variables, and then 2) deployed the top three performing algorithmic methods to compare performance when using patient-specific and situational variables. We hypothesize patient-specific perioperative variables have superior predictive capacity compared to situational variables.
2. Methods
2.1. Predictive and outcome variables selection
All available variables in the NIS database were considered for inclusion in this study. These variables are independently measured by the NIS and obtained from the database. Fifteen predictive variables were initially included in building and assessing ten different ML models. The second step of the study then subsequently divided the variables into 8 patient-specific (including Age, Sex, Race, Total number of diagnoses, All Patient Refined Diagnosis Related Groups (APRDRG) Severity of illness, APRDRG Mortality risk, Income zip quartile, Primary payer) and 7 situational specific variables (including Location, Month of the procedure, Hospital Division, Hospital Region, Hospital Teaching status, Hospital Bed size, and Hospital Control). The analysis outcome variables were in-hospital mortality (binary yes/no outcome), discharge disposition (home vs facility), and length of stay (≤2 vs > 2 days) among primary THA recipients. LOS cutoff determination was guided by analysis of the average LOS for the entire cohort, and subsequently utilizing the closest lower integral number to create the binary outcomes. Patient discharge destination was coded as either home (discharge to home or home health care) or facility (all other disposition to a facility, such as skilled nursing facilities or inpatient rehabilitation centers). Patient datasets with deficient information on any of the 15 variables were removed from the study sample.
2.2. Data source and study sample
This retrospective analysis and ML models’ development utilized the NIS, a public database of more than 7 million inpatient within the US, for the years 2016 and 2017. The International Classification of Disease, Tenth Revision (ICD-10) Procedure Coding System (ICD-10-PCS) for THA was utilized to identify the study population due to concurrent indexing system within the database during this study period. Patients undergoing revision or conversion THA, pediatric patients younger than 18 years of age, or missing age information were excluded from the study population. This approach yielded a total of 177,442 discharges that were included in the current study.
SPSS Modeler (IBM, Armonk, NY, USA), a data mining and predictive analytics software, was utilized to develop the models based on commonly used ML techniques. The algorithmic methods implemented included Random Forest (RF), Neural Network (NN), Extreme Gradient Boost Tree (XGBoost Tree), Extreme Gradient Boost Linear (XGBoost Linear), Linear Support Vector Machine (LSVM), Chi square Automatic Interaction Detector (CHAID), Decision lists, Linear Discriminant Analysis (Discriminant), Logistic Regression, and Bayesian Networks. These specific ML techniques were selected as they are well-studied, commonly used ML methods in medical literature and are distinct in their pattern recognition methods. A description of these methods is provided in Table 2.6, 7, 8, 9
Table 2.
Different machine learning models and their descriptions.
| Machine Learning Models | Description |
|---|---|
| Random Forest (RF) | Qualitative algorithm using individual decision trees to generate a collective prediction. The strengths of this model is based on randomness utilizing methods such as bootstrapping, creating individual data sets through sampling, and bootstrap aggregating, otherwise known as bagging to shuffle individual variables each tree is trained. The algorithm works in a voting matter, so that the collective decision is supported by the number of individual trees that cast a vote |
| Neural Network (NN) | Network based on the working layers of neurons programmed to interpret data based on the channels and their corresponding weight in the forward propagation of decision making. Backpropagation trains the neurons by comparing the output with the correct output to generate the appropriate weight of each channel. |
| Extreme Gradient Boost Tree (XGBoost Tree) | Expands on existing tree algorithms by further subtraining each tree in smaller subsets of data. The integration of small batch training strengthens an individual tree while the gradient boosting process uses the collective output from the trees. Gradient boosting builds upon sequential loss function to build the next generation of trees. This method occurs until the boosted ensemble can no longer improve upon the previous generation. |
| Extreme Gradient Boost Linear (XGBoost Linear) | Similar to XGBoost tree, however its utility is in features with less data-sets or low noise. The algorithm acts in a linear solution model with gradient boosting acting to build on the next rule until a rule can no longer improve upon the next generation. The speed is generally faster than that of XGBoost Tree, but accuracy is decreased if noise is high. |
| Linear Support Vector Machine (LSVM) | Classifies a dataset using a regression algorithm with a small learning datasets. The model aims to divide the dataset into two classes. Each data point represents a distinct point in the Nth dimension of the hyperplane. LSVM maximizes the distances between the data points to determine the margin and to predict outcomes. |
| Chi square Automatic Interaction Detector (CHAID | Model based on the statistical differences between parent and child nodes given qualitative descriptors. The development requires large datasets to determine how to best identify patterns to generate accurate predictions. |
| Decision lists |
Boolean function model based on “if-then-else” statements with all subsets having either a true or false functional value, this is also known as an ordered rule set. Rules in this form are usually learned with a covering algorithm, learning one rule at a time. |
| The rules of this subset are tried in order unless no rule is induced which pushes a default rule to be invoked. | |
| Linear Discriminant Analysis (Discriminant) | Calculate summary statistics of data by means and standard deviations. Using a training data source, new predictions are made when data is added and class labels are given based on each input feature. This machine learning method assumes input variables are normally distributed and therefore have the same overall variance. |
| Logistic Regression | Similar to other linear regression models, but instead of solving for regression it acts to solve for classification. The input data sources can give a binary discrete value probability based on the independent variables of a given set. The benefit of logistic regression is its ability to classify observations and determine the most efficient observation group for classification, which can then be used to identify the probabilities of new data sets to fit into that classification. |
| Bayesian Networks | Probabilistic graphical model of machine learning. They act to use a data source to identify probabilities for predictions, anomaly detections, and times predictions of an inputted data source. The data is computed into nodes which represent the variables that are linked to one another indicating their influence on one another. These links are a part of the structural learning and are identified automatically from the data. The data source can then be represented in graphical depictions called Asia networks making their data easy to understand following calculation. |
A new algorithm was developed for each technique and outcome variable. The overall data set was split into three separate groups: a training, testing, and validation cohort. The distribution of data was 64% for training, 16% for testing, and 20% for models validation. The distribution total was 80% of the data were used to train-test the models, while the remaining 20% was used to validate the model parameters. Each mutually exclusive sets were used to train, test, and then validate each predictive algorithm without leaks between each distributed dataset.
There exists a bias within the model leading to an inaccurate imbalance in predictive capacity biased against the minority outcome when predicting outcomes with low incidence rates.15 To avoid such implications when imbalanced outcome frequencies were encountered, the Synthetic Minority Oversampling Technique (SMOTE) was deployed to resample the training set to avoid any implications on the training of the ML classification.16,17 Despite the validation of SMOTE as a measure to successfully minimize the impact of the bias, the classifier's predictive ability in minority outcomes is improved, however remains imperfect.
Each ML model predictions’ responsiveness and reliability was assessed through comparative analysis for all models. The measurement of reliability is defined as overall performance accuracy quantified by the percentage of correct predictions achieved by the model. The measure of responsiveness is the successful prediction of variable outcomes and was quantified with area under the curve (AUC) for the receiver operating characteristic (ROC) curve. AUCROC measurements were generated by assessing true positive rates vs false positive rates under the training, testing, and validation phases of each model. AUCROC is an aggregate for the performance of a model in the prediction of the specified outcome. For this study, definitions of responsiveness was excellent for AUCROC was 0.90–1.00, good for 0.80–0.90, fair for 0.70–0.80, poor for 0.60–0.70, and fail for 0.50–0.60. Fig. 1 details the AUCROC during training, testing, and validating of a machine learning model.
Fig. 1.
Training, testing and validation of LSVM
Legend: LSVM machine learning algorithm AUCROC curve during training, testing, and validation phases in predicting discharge, mortality, and LOS.
All 10 ML models were trained, tested, and validated to assess responsiveness and reliability. The first step of the study aimed at analyzing and comparing the predictive performance of these ML models in identifying the outcome variables after primary THA: in-hospital mortality, discharge disposition, and LOS. The validation phase utilizing 20% of the sample was considered as the main assessment metric, and quantified with responsiveness and reliability. Once the development and comparative assessment of the different ML models was completed, the three algorithmic methodologies with the highest accuracy for each outcome variables were identified. The second step of the study consisted of developing and comparing the predictive performance of the 3 top ML methodologies for the same set of outcome measures while using patient-specific and situational. All statistical analyses were performed with SPSS Modeler version 18.2.2 (IBM, NY, USA).
3. Results
This study included a total of 177,442 patients with an average age of 65.52 years. Descriptive statistics for the distributions of the aforementioned predictive variables are included in Table 1. The study population had an average of 0.1% mortality during hospitalization, 79.5% discharged home, and 2.4 days LOS.
Table 1.
Descriptive statistics for all variables included in analysis.
| THA |
|
|---|---|
| n = 177,442 | |
| Age of Patient in Years- Mean (Mean Standard Error) | 65.52 (0.027) |
| Biological Sex of Patient | |
| Male | 78,600 (44.3%) |
| Female | 98,806 (55.7%) |
| Primary Payor | |
| Medicaid | 9,062 (5.1%) |
| Private insurance | 66,300 (37.4%) |
| Other | 5,902 (3.4%) |
| Race of Patient | |
| White | 144,857 (81.6%) |
| African American | 13,150 (7.4%) |
| Hispanic | 6,175 (3.5%) |
| Asian or Pacific Islander | 1,611 (0.9%) |
| Native American | 480 (0.3%) |
| Other or Unknown | 11,169 (6.3%) |
| Median household income national quartile for patient ZIP Code | |
| 0-25th percentile | 35,003 (19.7%) |
| 26th to 50th percentile (median) | 44,024 (24.8%) |
| 51st to 75th percentile | 47,255 (26.6%) |
| 76th to 100th percentile | 48,465 (27.3%) |
| Unknown | 2,695 (1.5%) |
| Bed size of Hospital | |
| Small | 50,803 (28.6%) |
| Medium | 49,306 (27.8%) |
| Large | 77,333 (43.6%) |
| Location/Teaching Status | |
| Rural | 14,185 (8%) |
| Urban Nonteaching | 47,539 (26.8%) |
| Urban Teaching | 115,718 (65.2%) |
| Region of hospital | |
| Northeast | 35,052 (19.8%) |
| Midwest | 45,610 (25.7%) |
| South | 59,620 (33.6%) |
| West | 37,160 (20.9%) |
| Control/ownership of hospital (STRATA) | |
| Government, nonfederal | 14,913 (8.4%) |
| Private, not-for-profit | 136,945 (77.2%) |
| Private, investor-owned | 25,584 (14.4%) |
| Census Division of hospital | |
| New England | 10,399 (5.9%) |
| Middle Atlantic | 24,653 (13.9%) |
| East North Central | 31,139 (17.5%) |
| West North Central | 14,471 (8.2%) |
| South Atlantic | 34,003 (19.2%) |
| East South Central | 11,170 (6.3%) |
| West South Central | 14,447 (8.1%) |
| Mountain | 13,275 (7.5%) |
| Pacific | 23,885 (13.5%) |
| Patient Location: NCHS Urban-Rural Code | |
| Central counties of metro areas of≥1 million population | 42,625 (24%) |
| Fringe counties of metro areas of≥1 million population | 47,289 (26.7%) |
| Counties in metro areas of 250,000–999,999 population | 38,599 (21.8%) |
| Counties in metro areas of 50,000–249,999 population | 17,767 (10%) |
| Micropolitan counties | 17,960 (10.1%) |
| Not metropolitan or micropolitan counties | 12,963 (7.3%) |
| Unknown | 239 (0.1%) |
| APRDRG Risk Mortality | |
| 1- Minor likelihood of dying | 142,718 (80%) |
| 2- Moderate likelihood of dying | 28,094 (15.83%) |
| 3- Major likelihood of dying | 5,528 (3.12%) |
| 4- Extreme likelihood of dying | 1,099 (0.62%) |
| APRDRG Severity | |
| 1- Minor loss of function (includes cases with no comorbidity or complications) | 78,223 (44.08%) |
| 2- Moderate loss of function | 87,655 (49.4%) |
| 3- Major loss of function | 10,522 (5.93%) |
| 4- Extreme loss of function | 1,039 (0.59%) |
| Number of Diagnosis (Mean Standard Error) | 8.435 (0.012) |
| Month of Procedure | |
| January | 14,301 (8.06%) |
| February | 14,652 (8.26%) |
| March | 15,254 (8.6%) |
| April | 13,985 (7.88%) |
| May | 15,300 (8.62%) |
| June | 15,064 (8.49%) |
| July | 13,163 (7.42%) |
| August | 15,530 (8.75%) |
| September | 13,394 (7.55%) |
| October | 15,929 (8.98%) |
| November | 15,845 (8.93%) |
| December | 14,929 (8.41%) |
| Died during hospitalization | 149 (0.1%) |
| Disposition of patient | |
| Discharged to Home | 65,102 (36.7%) |
| Transfer to Short-term Hospital | 445 (0.3%) |
| Transfer to Facility | 35,523 (20%) |
| Home Health Care (HHC) | 75,969 (42.8%) |
| Against Medical Advice (AMA) and Unknown | 254 (0.2%) |
| Length of Stay (Mean Standard Error) | 2.4 (0.005) |
The three most responsive models for LOS were LSVM, Neural Network, and CHAID, with fair results measuring 0.744, 0.723 and 0.719, respectively. Additionally, decision list had good reliability, while LSVM and CHAID had fair reliability with accuracy of 84.36%, 72.21% and 71.32% respectively. The three most responsive models for discharge were LSVM, XGT Boost Tree and CHAID with values of 0.80, 0.776, 0.776, respectively. Correspondingly, the three most reliable models yielding good reliability were decision list, LSVM and CHAID measuring: 88.38%, 82.37% and 81.74%, respectively. The top 3 models that yielded excellent responsiveness of in-hospital mortality were LSVM, Neural Network and Logistic Regression with values 0.973, 0.97 and 0.968 respectively. Additionally, the most accurate models with excellent reliability were LSVM at 99.88%, decision list measuring 99.87%, and XGT Boost Tree, XGT Boost linear, CHAID all measuring 99.84% (Table 3).
Table 3.
Machine learning models development and performance assessment with fifteen variables in predicting length of stay, discharge disposition, and mortality.
| THA | ||||||
|---|---|---|---|---|---|---|
|
LOS | ||||||
| Reliability (Accuracy) |
Responsiveness (AUC) |
|||||
| Training | Testing | Validation | Training | Testing | Validation | |
| Random Forest | 91.74% | 65.78% | 66.06% | 0.948 | 0.684 | 0.683 |
| Neural Network | 67.75% | 67.59% | 68.01% | 0.711 | 0.716 | 0.723 |
| XGT Boost Tree | 65.41% | 64.85% | 65.55% | 0.636 | 0.634 | 0.632 |
| XGT Boost linear | 65.41% | 64.85% | 65.55% | 0.623 | 0.626 | 0.629 |
| LSVM | 72.10% | 71.85% | 72.21% | 0.742 | 0.745 | 0.744 |
| CHAID | 71.15% | 70.62% | 71.32% | 0.721 | 0.72 | 0.719 |
| Decision List | 84.39% | 83.88% | 84.36% | 0.65 | 0.654 | 0.651 |
| Discriminant | 65.29% | 65.72% | 65.23% | 0.69 | 0.695 | 0.689 |
| Logistic Regression | 67.66% | 67.64% | 67.86% | 0.71 | 0.715 | 0.713 |
| Bayesian Network |
67.47% |
67.24% |
67.49% |
0.711 |
0.714 |
0.712 |
| Discharge | ||||||
| Reliability (Accuracy) | Responsiveness (AUC) | |||||
| Training |
Testing |
Validation |
Training |
Testing |
Validation |
|
| Random Forest | 91.93% | 76.02% | 76.04% | 0.957 | 0.73 | 0.735 |
| Neural Network | 77.12% | 77.32% | 77.14% | 0.763 | 0.768 | 0.764 |
| XGT Boost Tree | 79.58% | 79.41% | 79.68% | 0.78 | 0.778 | 0.776 |
| XGT Boost linear | 79.58% | 79.41% | 79.68% | 0.758 | 0.757 | 0.761 |
| LSVM | 82.24% | 82.33% | 82.37% | 0.802 | 0.805 | 0.801 |
| CHAID | 81.71% | 81.67% | 81.74% | 0.781 | 0.78 | 0.776 |
| Decision List | 88.20% | 88% | 88.38% | 0.707 | 0.71 | 0.704 |
| Discriminant | 70.69% | 70.60% | 70.28% | 0.765 | 0.77 | 0.763 |
| Logistic Regression | 77.21% | 77.41% | 77.28% | 0.767 | 0.77 | 0.768 |
| Bayesian Network |
76.52% |
76.46% |
76.47% |
0.763 |
0.763 |
0.763 |
| Mortality | ||||||
| Reliability (Accuracy) | Responsiveness (AUC) | |||||
| Training |
Testing |
Validation |
Training |
Testing |
Validation |
|
| Random Forest | 93.67% | 93.71% | 93.58% | 0.964 | 0.848 | 0.724 |
| Neural Network | 93.59% | 93.71% | 93.58% | 0.909 | 0.951 | 0.97 |
| XGT Boost Tree | 99.84% | 99.86% | 99.84% | 0.956 | 0.944 | 0.929 |
| XGT Boost linear | 99.84% | 99.86% | 99.84% | 0.982 | 0.987 | 0.939 |
| LSVM | 99.84% | 99.86% | 99.88% | 0.982 | 0.959 | 0.973 |
| CHAID | 99.84% | 99.86% | 99.84% | 0.957 | 0.938 | 0.922 |
| Decision List | 99.87% | 99.90% | 99.87% | 0.899 | 0.909 | 0.899 |
| Discriminant | 88.50% | 88.41% | 88.62% | 0.937 | 0.935 | 0.94 |
| Logistic Regression | 93.31% | 93.36% | 93.26% | 0.917 | 0.988 | 0.968 |
| Bayesian Network | 93.60% | 93.70% | 93.55% | 0.948 | 0.826 | 0.836 |
Following this analysis, the performance of ML models in predicting outcomes using different sets of variables was assessed. The performance of the top three models from the initial analysis was compared using patient specific variables to situational variables in terms of reliability and responsiveness in the prediction of LOS, discharge disposition, and mortality (Table 4). LSVM, CHAID, Decision List were the top three ML models in the prediction of LOS. When using patient variables for LOS prediction, reliability was 85.08%, 70.34%, and 69.78% for Decision List, LSVM, and CHAID, respectively. Responsiveness was fair for LSVM and CHAID at 0.717 and 0.704, and poor for Decision List at 0.638. When using situational variables for LOS outcomes, reliability was 86.35%, 65.88%, and 65.64%, for Decision List, LSVM, and CHAID, respectively. In the same order, all models demonstrated poor responsiveness with AUCROC values 0.552, 0.589, and 0.579.
Table 4.
Top three most reliable machine learning models performance comparison with patient-specific and situational variables in predicting length of stay, discharge disposition, and in-patient mortality.
| THA |
||||||
|---|---|---|---|---|---|---|
| LOS | ||||||
| Reliability (Accuracy) |
Responsiveness (AUC) |
|||||
| Training | Testing | Validation | Training | Testing | Validation | |
| Patient Variables | ||||||
| LSVM | 70.36% | 70.02% | 70.34% | 0.717 | 0.72 | 0.717 |
| CHAID | 69.97% | 69.79% | 69.78% | 0.708 | 0.708 | 0.704 |
| Decision List | 84.92% | 84.69% | 85.08% | 0.64 | 0.64 | 0.638 |
| Situational Variables | ||||||
| LSVM | 65.51% | 64.99% | 65.64% | 0.578 | 0.578 | 0.579 |
| CHAID | 65.79% | 65.22% | 65.88% | 0.587 | 0.588 | 0.589 |
| Decision List | 86.29% | 86.28% | 86.35% | 0.552 | 0.55 | 0.552 |
The top three models for discharge disposition were LSVM, CHAID, and Decision List. In the prediction of discharge with patient specific variables, reliability was 81.53%, 81.64%, 88.26% for LSVM, CHAID, Decision List, respectively. Responsiveness was fair for LSVM, CHAID, and Decision list with values of 0.786, 0.775, 0.703. Situational variables produced reliability values of 79.68%, 79.68%, and 90.59% for LSVM, CHAID, Decision List. All three models had failed responsiveness when using situational variables with values of 0.565, 0.574, and 0.543.
XGT Boost Tree, LSVM, and Decision list were the top three ML models for the prediction of mortality. Mortality outcome prediction with patient specific variables had a reliability of 99.84%, 99.84%, and 99.87% for XGT Boost Tree, LSVM, and Decision List, respectively. LSVM demonstrated excellent responsiveness of 0.952, while XGT Boost Tree and Decision List had good responsiveness of 0.887 and 0.899. When situational variables were utilized to predict mortality, reliability for XGT Boost Tree, LSVM, and Decision List were 99.85%, 99.85%, and 62.61%, respectively. Their respective AUCROC were 0.5, 0.536, and 0.469, demonstrating failed responsiveness.
4. Discussion
Hip arthroplasty accounted for 4.2% of all procedures performed in the United States in 2018, with future projections predicting a continued rise in the procedural demand.1 THA remains a target for continuous improvement efforts, as volume increases, to minimize wasteful resources utilization in this value-based care delivery era.18 The CJR model is an example of such health efforts using bundled payments to incentivize the shift towards value-focused care models. The limitation of access to care for at-risk population remains a concerning byproduct of such models as development of a value centric reimbursement system couple quality of care to per care episode cost. Patient-specific risk factors have been noted to correlate with complication rates and total costs after TJA, with ranges of an average of $25,568 and $37,575 in primary procedures for patients without and with comorbidities, respectively.19 As providers are financially penalized for postoperative complications, the consequential tendency shifts towards “cherry-picking” patients with no medical comorbidities and “lemon-dropping” and denying access to care for high-risk patients.20 Driven by these concepts, recent literature aimed to identify and delineate risk-stratification tools in order to guide preoperative optimization efforts, with potential application in the implementation of risk-adjusted reimbursement models. In this study, we assessed the performance of optimized machine learning algorithmic models in predicting a set of postoperative outcomes when using patient-specific variables in comparison to situational variables. We found that patient-specific variables outperformed situational variables with mirroring performance to total available variables of these models.
This study initial step consisted of building and assessing the performance of ten different ML models in predicting LOS, inpatient mortality, and discharge disposition postoperatively while using a total of fifteen available variables. We noted a substantial variance in the performance of the various models with respect to reliability and responsiveness in predicting these outcomes. In this analysis with all fifteen variables, LSVM algorithm consistently generated the most responsive model with the most accuracy for predicting the postoperative outcomes. The LSVM algorithm's method of training in small batches prior to predicting outcomes may explain the observed responsiveness and accuracy. This approach of determining the optimal ML methodologies with a standard set of predictive variables prior to developing a model to assess and compare the predictive capacity of a select group of variables, should be emphasized. Clinical research into the application of ML is still in the infancy stage. The current practice in ML literature consists of focusing on a single algorithmic methodology, which is generally deployed to build a single model and subsequently assessed through analysis of predictive accuracy and responsiveness. While such an approach might be appropriate to assess the predictive capacity of a single methodology, it might be suboptimal for developing optimized models for clinical application, as it would limit access to other ML methodologies potentially yielding more reliable and responsive predictions.
The second step of this study compared patient-specific and situational variables performance when used in the top 3 ML models for the prediction of the outcome variables of interest. The predictive capacity of these variables was then compared against the set of total variables identified in the first step. When using the 8 patient-specific variables, the responsiveness of the ML models was higher for every model than the 7 situational variables, emphasizing the predictive value of patient-specific risk factors. Previous studies using simple statistical methods assessed isolated primary THA situational characteristics such as month of the procedure, hospital bed-size, location, teaching status, among other variables with various conclusions with postoperative outcomes. While these correlations remain potentially applicable in the clinical setting, the findings of this study highlight a stronger predictive capacity of ML models when utilizing patient-specific factors compared to situational-variables. The findings of this study are concordant with recently published reports using deep learning models for risk stratification of primary THA.21 Ramkumar et al. developed a NN model with acceptable predictive responsiveness and reliability in predicting several value metrics and called more equitable arbitration between payors and provider. In another study, Ramkumar et al. also developed a NN model to assess total knee arthroplasty and demonstrated similar results in the prediction of LOS and inpatient charges.13 The authors proposed a patient-specific payment model that factored patient-specific variables. In another study, Kumar et al. compared 4 ML models incorporating 291 feature in predicting postoperative outcomes after anatomic and reverse total shoulder arthroplasty.22 They noted consistency in the top features, including follow-up time, surgery on dominant hand, and gender. Throughout the literature, patient-specific factors are consistently observed to impact outcome prediction, and this study's findings add to the collective evidence. Rather than relying on a single or pre-determined set of ML models, the current study's strength stems from assessing multiple different algorithms to identify the best performing models, and subsequently deployed those to compare the effect of different sets of variables on the predictive capacity of the models.
This study has limitations inherent to retrospective studies. The outcomes of this study were adjusted to be binary instead of continuous, in order to simplify outcomes and provide more accurate analysis. While these adjusted continuous outcomes provide utility in ML prediction, it comes with the cost of precise predictions, especially for LOS. When quality-improvement efforts are implemented towards LOS in the clinical setting, the continuous nature of LOS is generally a binary cutoff. Another limitation is that the strength of ML models is reliant on data quality used to train, test, and validate the ML algorithms, and therefore administrative databases may be prone to incompleteness and errors. However, the NIS is an appropriate database with consistently demonstrated utility for predictive large population-based studies. This current study utilized a recent version of this validated database. Another limitation is our inability to externally validate the findings of this study. Although external validation was not within the scope of the study, efforts were placed to internally validate the results, as the dataset was split into 64% training, 16% testing, and 20% validating groups. The internal validity was indicated during the analysis of every model within each phase was concurrent with grossly similar results. Direct comparison with another database would provide utility in the assessment of the generalizability and replicability of each ML algorithm within this study.
5. Conclusions
In conclusion, this study trained, tested, and validated 10 different machine learning algorithms to predict mortality, discharge disposition, and length of stay following primary total hip arthroplasty, with subsequent analysis of the predictive capacity of machine learning models developed using patient-specific versus situational variables. Our study demonstrated the value of analyzing multiple machine learning methods to identify the optimal machine learning model specific outcomes and those models trained using patient-specific factors consistently outperformed those trained using situational variables. The consideration of patient factors in a tiered reimbursement system would allow for the optimization of care delivery and avoid disincentivizing providers from providing care to high-risk patients.
Funding/sponsorship
Not applicable.
Informed consent (patient/Guardian), mandatory only for case reports/clinical image s
Not applicable.
Institutional ethical committee approval (for all human studies)
Not applicable.
Authors contribution
All authors (FN, TC, AZ, ME, RS) contributed to analyzing and interpreting data as well as manuscript preparation. All authors read and approved the final manuscript.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the National Inpatient Sample repository, https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp.
Declaration of competing interest
The authors declare that they have no competing interests.
Acknowledgements
Not applicable.
References
- 1.McDermott K.W., Liang L. Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Agency for Healthcare Research and Quality (US); Rockville (MD): 2006. Overview of operating room procedures during inpatient stays in U.S. Hospitals, 2018: statistical brief #281.http://www.ncbi.nlm.nih.gov/books/NBK574416/ [PubMed] [Google Scholar]
- 2.Tsertsvadze A., Grove A., Freeman K., et al. Total hip replacement for the treatment of end stage arthritis of the hip: a systematic review and meta-analysis. PLoS One. 2014;9 doi: 10.1371/journal.pone.0099804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hootman J.M., Helmick C.G., Barbour K.E., et al. Updated projected prevalence of self-reported doctor-diagnosed arthritis and arthritis-attributable activity limitation among US adults, 2015-2040. Arthritis Rheumatol Hoboken NJ. 2016;68:1582–1587. doi: 10.1002/art.39692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crawford A.M., Karhade A.V., Agaronnik N.D., et al. Development of a machine learning algorithm to identify surgical candidates for hip and knee arthroplasty without in-person evaluation. Arch Orthop Trauma Surg. 2023:1–8. doi: 10.1007/s00402-023-04827-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.El-Othmani M.M., Zalikha A.K., Shah R.P. Comparative analysis of the ability of machine learning models in predicting in-hospital postoperative outcomes after total hip arthroplasty. J Am Acad Orthop Surg. 2022;30:e1337–e1347. doi: 10.5435/JAAOS-D-21-00987. [DOI] [PubMed] [Google Scholar]
- 6.Navarro S.M., Wang E.Y., Haeberle H.S., et al. Machine learning and primary total knee arthroplasty: patient forecasting for a patient-specific payment model. J Arthroplasty. 2018;33:3617–3623. doi: 10.1016/j.arth.2018.08.028. [DOI] [PubMed] [Google Scholar]
- 7.Bini S.A. Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplasty. 2018;33:2358–2361. doi: 10.1016/j.arth.2018.02.067. [DOI] [PubMed] [Google Scholar]
- 8.Bzdok D., Altman N., Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15:233–234. doi: 10.1038/nmeth.4642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Haeberle H.S., Helm J.M., Navarro S.M., et al. Artificial intelligence and machine learning in lower extremity arthroplasty: a review. J Arthroplasty. 2019;34:2201–2203. doi: 10.1016/j.arth.2019.05.055. [DOI] [PubMed] [Google Scholar]
- 10.Carr C.J., Mears S.C., Barnes C.L., et al. Length of stay after joint arthroplasty is less than predicted using two risk calculators. J Arthroplasty. 2021;36:3073–3077. doi: 10.1016/j.arth.2021.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Endo A., Baer H.J., Nagao M., et al. Prediction model of in-hospital mortality after hip fracture surgery. J Orthop Trauma. 2018;32:34–38. doi: 10.1097/BOT.0000000000001026. [DOI] [PubMed] [Google Scholar]
- 12.Harris A.H.S., Kuo A.C., Weng Y., et al. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty? Clin Orthop. 2019;477:452–460. doi: 10.1097/CORR.0000000000000601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ramkumar P.N., Karnuta J.M., Navarro S.M., et al. Deep learning preoperatively predicts value metrics for primary total knee arthroplasty: development and validation of an artificial neural Network model. J Arthroplasty. 2019;34:2220–2227. doi: 10.1016/j.arth.2019.05.034. e1. [DOI] [PubMed] [Google Scholar]
- 14.Sniderman J., Stark R.B., Schwartz C.E., et al. Patient factors that matter in predicting hip arthroplasty outcomes: a machine-learning approach. J Arthroplasty. 2021;36:2024–2032. doi: 10.1016/j.arth.2020.12.038. [DOI] [PubMed] [Google Scholar]
- 15.Japkowicz N., Stephen S. The class imbalance problem: a systematic study. Intell Data Anal. 2002;6:429–449. [Google Scholar]
- 16.Chawla N.V., Bowyer K.W., Hall L.O., et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357. [Google Scholar]
- 17.Ho K.C., Speier W., El-Saden S., et al. Predicting discharge mortality after acute ischemic stroke using balanced data. AMIA Annu Symp Proc AMIA Symp. 2014;2014:1787–1796. [PMC free article] [PubMed] [Google Scholar]
- 18.Schwartz A.J., Bozic K.J., Etzioni D.A. Value-based total hip and knee arthroplasty: a framework for understanding the literature. J Am Acad Orthop Surg. 2019;27:1–11. doi: 10.5435/JAAOS-D-17-00709. [DOI] [PubMed] [Google Scholar]
- 19.Bozic K.J., Ward L., Vail T.P., et al. Bundled payments in total joint arthroplasty: targeting opportunities for quality improvement and cost reduction. Clin Orthop. 2014;472:188–193. doi: 10.1007/s11999-013-3034-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.McLawhorn A.S., Buller L.T. Bundled payments in total joint replacement: keeping our care affordable and high in quality. Curr Rev Musculoskelet Med. 2017;10:370–377. doi: 10.1007/s12178-017-9423-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ramkumar P.N., Navarro S.M., Haeberle H.S., et al. Development and validation of a machine learning algorithm after primary total hip arthroplasty: applications to length of stay and payment models. J Arthroplasty. 2019;34:632–637. doi: 10.1016/j.arth.2018.12.030. [DOI] [PubMed] [Google Scholar]
- 22.Kumar V., Roche C., Overman S., et al. What is the accuracy of three different machine learning techniques to predict clinical outcomes after shoulder arthroplasty? Clin Orthop. 2020;478:2351–2363. doi: 10.1097/CORR.0000000000001263. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analysed during the current study are available in the National Inpatient Sample repository, https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp.

