Skip to main content
Clinical Orthopaedics and Related Research logoLink to Clinical Orthopaedics and Related Research
. 2020 Jun 10;478(9):2088–2101. doi: 10.1097/CORR.0000000000001343

Can Machine-learning Algorithms Predict Early Revision TKA in the Danish Knee Arthroplasty Registry?

Anders El-Galaly 1,2,3,4,, Clare Grazal 1,2,3,4, Andreas Kappel 1,2,3,4, Poul Torben Nielsen 1,2,3,4, Steen Lund Jensen 1,2,3,4, Jonathan A Forsberg 1,2,3,4
PMCID: PMC7431253  PMID: 32667760

Abstract

Background

Revision TKA is a serious adverse event with substantial consequences for the patient. As the demand for TKA rises, reducing the risk of revision TKA is becoming increasingly important. Predictive tools based on machine-learning algorithms could reform clinical practice. Few attempts have been made to combine machine-learning algorithms with data from nationwide arthroplasty registries and, to the authors’ knowledge, none have tried to predict the likelihood of early revision TKA.

Question/purposes

We used the Danish Knee Arthroplasty Registry to build models to predict the likelihood of revision TKA within 2 years of primary TKA and asked: (1) Which preoperative factors were the most important features behind these models’ predictions of revision? (2) Can a clinically meaningful model be built on the preoperative factors included in the Danish Knee Arthroplasty Registry?

Methods

The Danish Knee Arthroplasty Registry collects patients’ characteristics and surgical information from all arthroplasties conducted in Denmark and thus provides a large nationwide cohort of patients undergoing TKA. As training dataset, we retrieved all preoperative variables of 25,104 primary TKAs from 2012 to 2015. The same variables were retrieved from 6170 TKAs conducted in 2016, which were used as a hold-out year for temporal external validation. If a patient received bilateral TKA, only the first knee to receive surgery was included. All patients were followed for 2 years, with removal, exchange, or addition of an implant defined as TKA revision. We created four different predictive models to find the best performing model, including a regression-based model using logistic regression with least shrinkage and selection operator (LASSO), two classification tree models (random forest and gradient boosting model) and a supervised neural network. For comparison, we created a noninformative model predicting that all observations were unrevised. The four machine learning models were trained using 10-fold cross-validation on the training dataset after adjusting for the low percentage of revisions by over-sampling revised observations and undersampling unrevised observations. In the validation dataset, the models’ performance was evaluated and compared by density plot, calibration plot, accuracy, Brier score, receiver operator characteristic (ROC) curve and area under the curve (AUC). The density plot depicts the distribution of probabilities and the calibration plot graphically depicts whether the predicted probability resembled the observed probability. The accuracy indicates how often the models’ predictions were correct and the Brier score is the mean distance from the predicted probability to the observed outcome. The ROC curve is a graphical output of the models’ sensitivity and specificity from which the AUC is calculated. The AUC can be interpreted as the likelihood that a model correctly classified an observation and thus, a priori, an AUC of 0.7 was chosen as threshold for a clinically meaningful model.

Results

Based the model training, age, postfracture osteoarthritis and weight were deemed as important preoperative factors within the machine learning models. During validation, the models’ performance was not different from the noninformative models, and with AUCs ranging from 0.57 to 0.60, no models reached the predetermined AUC threshold for a clinical useful discriminative capacity.

Conclusion

Although several well-known presurgical risk factors for revision were coupled with four different machine learning methods, we could not develop a clinically useful model capable of predicting early TKA revisions in the Danish Knee Arthroplasty Registry based on preoperative data.

Clinical relevance

The inability to predict early TKA revision highlights that predicting revision based on preoperative information alone is difficult. Future models might benefit from including medical comorbidities and an anonymous surgeon identifier variable or may attempt to build a postoperative predictive model including intra- and postoperative factors as these may have a stronger association with early TKA revisions.

Introduction

Revision TKA is a devastating and costly adverse event and thus, contemplating the risk of revision must be considered before offering TKA to a patient. A recent study forecasted a rise in the demand of TKAs in the United States, estimating a prevalence of 1.5 million primary TKAs in 2050 [22]. If the rate of revision within 2 years remains stable at 2% to 3%, 30,000 to 45,000 US citizens are predicted to undergo early TKA revision after 2050 [7, 36]. Revision TKA is more complicated and more expensive than primary TKA, and it is associated with reduced implant survival and inferior patient-reported outcomes [8, 18]. Preoperative identification of patients with an increased risk of early revision might reduce the future prevalence of revision TKAs and would therefore be beneficial for patients and the overall cost of health care.

Predicting the risk of certain outcomes by applying advanced statistics (such as, machine learning algorithms) to robust databases is expected to transform medicine [34] and has already been promising in orthopaedic research [1, 13, 19]. Machine learning algorithms might expose unseen associations in large datasets, making them capable of producing qualified predictions of future events. The field of hip and knee arthroplasty has a long tradition of data collection through nationwide registries that provide readily accessible large datasets from population-based cohorts [32]. Yet, to the authors’ knowledge, only a few researchers have attempted to use data like these in predictive models, and none have attempted to predict TKA revision [16, 19, 21]. As a nationwide registry, the Danish Knee Arthroplasty Registry has collected information on all knee arthroplasties in Denmark (estimated population of 5.8 million in 2019) since 1997 [35]. Applying machine-learning methods to the Danish Knee Arthroplasty Registry provides the potential of developing predictive models readily usable in Denmark and possibly transferable to countries with similar registries (such as, other Scandinavian countries).

We used the Danish Knee Arthroplasty Registry to build models predicting the likelihood of revision TKA within 2 years of primary TKA, and asked: (1) Which preoperative factors are the most important features behind these models’ predictions of revision? and (2) Can a clinically meaningful model be built on the preoperative factors included in the Danish Knee Arthroplasty Registry?

Patients and Methods

The study was reported in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines [6] and the Strengthening the Reporting of Observational studies in Epidemiology (STROBE) statement [45]. The study was approved by the Danish Data Protection Agency before data collection (entry no. 2008-58-0028).

Data Source

The Danish Knee Arthroplasty Registry longitudinally maintains information on all knee arthroplasties performed in Denmark; since 2011, the completeness of the database has been above 96% for primary arthroplasties and 93% for revision arthroplasties [7]. The Danish Knee Arthroplasty Registry is linked with the Danish Civil Registration Registry, providing vital and emigration status for all observations. This linkage enables complete follow-up of the observations in the Danish Knee Arthroplasty Registry [39].

Study Cohort

Data on patients undergoing TKA before 2012 were omitted because of a lack of information on patient height, which was introduced to the registry in 2011 [7]. Patients younger than 30 years were excluded, as these patients do not represent the typical patient undergoing TKA. In the case of death or emigration within the first 2 years of the index surgery, we excluded these observations because of incomplete follow-up. If a patient received bilateral TKA during the study period, we included only the first knee to receive surgery in the study cohort. Data from primary TKAs conducted from 2012 to 2015 were used to build the models (the training dataset), while data on primary TKAs from 2016 were retrieved separately and used to validate the models (the validation dataset) (Fig. 1). The use of a holdout year (2016) allows for temporal external validation of the models as opposed to internal validation where a percentage of the training dataset (for example, 20%) is randomly selected and used for validation. Internal validation is sensitive to temporal changes in surgical practice such as patella resurfacing or implant fixation and thus, an internally validated model is only valid within the used dataset. By conducting temporal external validation, the models are evaluated in the light of temporal changes and thus reliable models will be readily applicable to TKAs conducted in Denmark.

Fig. 1.

Fig. 1

This flowchart illustrates the training (2012-2015) and validation (2016) dataset.

Variables

All variables were obtained from the Danish Knee Arthroplasty Registry. Patient demographics included sex, age, weight, height, and BMI. Indications for TKA were classified as primary osteoarthritis; secondary osteoarthritis (such as that following a traumatic meniscus tear or an ACL injury); post-fracture osteoarthritis because of tibia, femur, or patella fractures; rheumatoid arthritis; other types of inflammatory arthritis (such as psoriatic arthritis); and other indications. Prior knee procedures included arthroscopy, meniscus or cruciate ligament repair, high tibial osteotomy, open reduction and internal fixation of distal femur and/or proximal tibia fractures, patellectomy, and other procedures without subclassification. The Danish Knee Arthroplasty Registry describes patient comorbidities by Charnley class, divided into Class A (unilateral arthritis), B1 (bilateral arthritis), B2 (arthroplasty of the opposite knee), and C (other conditions affecting walking capacity) [3]. Knee functionality is described by the original (1989) American Knee Society Score [23] and divided into clinical and functional subscores ranging from 0 to 100, with higher scores representing less pain and better function. In addition, clinical information underlying the American Knee Society Score was retrieved individually and included pain (ranging from none to severe), knee instability (AP and mediolateral), coronal alignment, walking ability (ranging from unlimited to unable), stair-walking ability (ranging from normal to unable), and need for a walking aid (ranging from none to walker). We only retrieved surgical information considered a part of the surgical planning, which included choice of implant constraint (cruciate-retaining, posterior-stabilized, constrained condylar, and hinged), whether patella resurfacing was done, need for additional components (stems, augments, and cones), choice of fixation (cemented, uncemented, or hybrid), and use of intraoperative navigation. In the Danish Knee Arthroplasty Registry, data are collected at the hospital-level and the individual surgeon is not registered. Thus, the administrative registrations consisted of the hospital’s annual arthroplasty volume and the hospital’s geographical location in Denmark.

Missing Data

In total, 1% of data in the training dataset (years 2012 to 2015) were missing, with the highest percentage of missing data points in weight (5%), height (5%), and mediolateral knee instability (4%). Similarly, 1% of data in the validation dataset (2016) were missing, with the highest percentage of missing data points in walking aid (5%), stair-walking ability (5%), and mediolateral knee instability (5%). The missing data points were imputed separately for the training and validation dataset. Before imputation, we omitted the outcome variable (revision) to prevent the outcome from affecting the imputed values. Imputation was conducted through the R-package “MissForest” with 50 trees and a maximum of five iterations to limit the computation time (other parameters were at default) [42].

Study Outcome

The models aimed to predict early revision TKA, which was characterized as revision for any indication within 2 years of the primary TKA. In accordance with the Danish Knee Arthroplasty Registry, revision was defined as the removal, substitution, or addition of an implant [7]. The Danish Knee Arthroplasty Registry allows multiple indications for a single revision, and therefore, a previously defined clinical hierarchy was used to present only the most important clinical indication for each revision TKA [10].

Predictive Models

We used the training dataset to fit the predictive models. The low percentage of revision (3%) constituted an imbalanced class within our dataset. To generate a more balanced dataset suitable for machine learning, we over-sampled revised observations and under-sampled unrevised observations using the R-package “ROSE” (default settings, except p = 0.1) [31]. In short, ROSE draws a synthetic sample by a smoothed bootstrap approach to create a modified training dataset with a more-even distribution of revised and unrevised observations. The modified training dataset is created from a conditional kernel density estimate, meaning that the modified dataset is drawn from the neighborhood of the revised/unrevised observations in the original dataset. By default, ROSE creates a 50-50 distribution, but in the present study we over-sampled revised observations to 10% and under-sampled the unrevised observations to 90% to keep the proportion of revisions within clinical reason.

All machine-learning models were trained using 10-fold cross-validation to optimize the models’ hyperparameters. In short, we randomly split the modified training dataset in 80/20 fractions and fitted the models’ hyperparameters to 80% of the data, followed by testing the models on the remaining 20%. The tested hyperparameters were picked through random search and the parameters resulting in the best performing model was chosen. This procedure, including the random split of the training dataset, was repeated 10 times (that is, in a 10-fold cross-validation) to mitigate the risk of fitting the models’ hyperparameters to random noise within the training dataset (overfitting). Finally, the best performing hyperparameters over the 10-fold cross-validation were chosen for training of the final model on the entire training dataset.

We created four different predictive models using the following four machine learning methods: the logistic regression with absolute shrinkage and selection operator, the random forest classifier, the gradient boosting model, and the neural network. Logistic regression with absolute shrinkage and selection operator (LASSO regression) was created through the R-package “glm-net” with default settings except shrinkage, which was optimized through the cross-validation [14]. Briefly, LASSO regression is based on a simple logistic regression including all available covariates. Subsequently, this regression is modified through L1-penalization, which adds a penalty to each covariate, thus reducing or eliminating covariates with less importance. This procedure guards against inclusion of noninformative covariates in the final model, resulting in a simplified model with optimized performance [44]. Second, we built a random forest classifier through the R-package “Caret” [27, 30]. The random forest classifier builds a “forest” of decision trees, each randomly selecting a subset of covariates used to split the training dataset in revised and unrevised observations. The final model is then averaged across all these trees to optimize the final predictions. In this study, the random forest classifier was predetermined to consist of 250 trees while the number of features (covariates) included in each tree and the minimum number of observations in each prediction-class were optimized through cross-validation. Third, we fitted a gradient boosting model building a group of decision trees like those in the random forest classifier. However, in contrast to random forest, the gradient boosting model collects the trees longitudinally and thus, each new tree builds on information from the previous tree. This method is sensitive to fitting the final model to random noise within the training dataset as the method boosts weak associations (overfitting the model). To counter this, the gradient boosting model was preceded by feature selection through the R-package “Boruta” (default settings). Boruta uses a random forest classifier to eliminate unimportant features (covariates) by comparing their importance with randomly constructed features. Boruta thus reduces the number of variables before fitting the gradient boosting model [28]. The gradient boosting model was fitted consisting of 2000 trees with interaction depth, shrinkage, minimal observations in terminal nodes, and fraction of the training dataset used in the next tree expansion (bag fraction) determined through cross-validation [17]. Finally, we built a supervised neural network through the R-package “Caret” [27]. A neural network is a “black-box” classifier consisting of layers and nodes imitating the human brain. The input layer is mathematically transformed into nodes interconnected through multiple “deep” layers, resulting in a final output layer [11]. After cross-validation, the input layer consisted of 76 variables processed through five hidden layers to construct the final prediction (or, output layer). To depict how the four models calculated their prediction, we used the varImp-function within the R-Package “Caret” which ranked the features by their importance in the final models [27]. For model comparison, we created a fifth model that a priori estimated a random probability for revision between 0 and 0.1 for all observations. The probability range was determined from the percentage of revision in the modified training dataset. This model is noninformative because it is not based on any of the included preoperative factors and predicts that all observations were unrevised within the follow-up. The four machine learning models and the noninformative model were validated and compared on the holdout year (2016). The validation dataset was left unadjusted and thus remained imbalanced with a lower percentage of TKA revision. The models’ predictions were evaluated by density plot, calibration plot, accuracy, and Brier score while their discriminative capacity (that is, how well they predict the outcome) was depicted by a receiver operator characteristic (ROC) curve and area under the curve (AUC). The density plot depicts the distribution of predicted probabilities divided by the actual outcome (revised/unrevised). The calibration plot graphically depicts the predicted probability against the observed probability and a well-calibrated model will approximate a diagonal line indicating agreement between the predicted and observed probabilities. The accuracy depicts the percentage of correct predictions and the Brier score is the mean distance from the predicted probability to the observed outcomes across the observations. Thus, a high accuracy and low Brier score indicate a precise model. The ROC curve depicts sensitivity against 1-specificity and collects these measurements from a range of probability thresholds. From the ROC curve, the AUC can be calculated and used to rank the models’ discriminative capacity as excellent (0.9 to 1), good (0.8 to 0.89), fair (0.7 to 0.79), poor (0.6 to 0.69), and failed (0.5 to 0.59) [12]. We defined an AUC of 0.7 as cutoff for a clinically meaningful model, indicating that the model correctly predicted the outcome in 70% of the cases.

Statistical Analysis

Categorical variables are presented by their distribution and continuous variables by their mean, SD, and range. Statistical significance in the comparison of the training and validation dataset is depicted with Bayes factor calculated in the R-package “BayesFactor” [33]. The Bayes factor enables an assessment of how well the datasets fit either the null hypothesis (H0) or the alternative hypothesis (H1). In contrast, the traditional p value only depicts the likelihood of observing a similar (or more extreme) result given that H0 is true [2]. The Bayes factor requires an a priori assumption of whether the H0 or H1 is expected, and in the current study, we expected the H0 for all comparisons (that is, we expected no substantial difference between variables in the training and validation datasets). The Bayes factor is the ratio between the marginal likelihood of the H1 and marginal likelihood of H0 given the observed data and thus, a high Bayes factor indicates evidence for H1 and against the null hypothesis, H0, in which meaningful differences between training and validation datasets were observed. Conversely, a low Bayes factor indicates evidence for H0, and against H1 [2]. As guidance, the Bayes factor can be scored as < 0.01 (extreme evidence of the H0), < 0.1 (strong evidence of the H0), < 0.33 (moderate evidence of the H0), < 1 (anecdotal evidence of the H0), > 1 (anecdotal evidence of the H1), > 3 (moderate evidence of the H1), > 10 (strong evidence of the H1), and > 100 (extreme evidence of the H1) [37]. The models’ accuracy, Brier score, ROC curve, and AUC are presented with a 95% confidence intervals calculated by stratified bootstrapping. All analyses were conducted in R© Version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria).

Results

Training Versus Validation dataset

The training and validation dataset differed statistically on several preoperative factors, but clinically the difference was limited (Table 1). Patient characteristics were mainly unchanged between the two datasets; however, there were some surgical differences. In the training dataset, more knees received posterior-stabilized implants (15% [3736 of 25,104] versus 4% [248 of 6170]), fewer had the patella resurfaced (78% [19,683 of 25,104] versus 83% [5130 of 6170]) and more implants were fully cemented (71% [17,892 of 25,104] versus 63% [3887 of 6170]) compared with the validation dataset. Similarly, more surgeons used a tourniquet during surgery in the training dataset than in the validation dataset (72% [18,011 of 25,104] versus 54% [3354 of 6170]). In both datasets, the rate of revision within 2 years was 3% (Table 1) with infection and instability as the most frequent indications (Table 2). In relation to patellar resurfacing during primary TKA, more revisions were indicated by secondary insertion of a patellar component in the training dataset than in the validation dataset (8% [61 of 812] versus 2% [3 of 175]).

Table 1.

Patient and surgical information for the predictive models

graphic file with name abjs-478-2088-g002.jpg

graphic file with name abjs-478-2088-g003.jpg

graphic file with name abjs-478-2088-g004.jpg

Table 2.

Indications for revision within 2 years

graphic file with name abjs-478-2088-g005.jpg

Which Preoperative Factors Are the Most Important Features in These Models?

Age was considered an important feature across all models while other top-10 features varied (Table 3). Use of navigation and hemophilia received the highest coefficients in the LASSO regression (that is, they were the most important factor in a regression model), but they were not considered in the top-10 of the other models (Table 3). Similarly, the neural network was the only model to consider different indications for primary TKA (such as, postfracture osteoarthritis) as top-10 features. The agreement between the two classification tree models (the random forest classifier and the gradient boosting model) was higher with American Knee Society Scores and patient characteristics such as height, weight and BMI, considered as important features in both models.

Table 3.

The top 10 most-important factors in each model

graphic file with name abjs-478-2088-g006.jpg

Can a Clinically Meaningful Model Be Built on the Preoperative Factors Included in the Danish Knee Arthroplasty Registry?

All four machine learning models predicted low probability of revision for both revised and unrevised observations (Fig. 2A-D). This resulted in high accuracy and low Brier score across the four models (Table 4). However, as depicted by the density plots, none of the models were able to differentiate between revised and unrevised observations (Fig. 2A-E). The inability to predict revision resulted in poor calibration during validation for all models (Fig. 2A-E). Consequently, none of the machine learning models surpassed an AUC of 0.7, predetermined as a threshold for reliable discriminative capacity (Table 4). In fact, the ROC-curves of the machine learning models resembled that of the non-informative model (Fig. 2A-E) and thus, the AUCs of the machine learning models were only slightly better that the AUC of the noninformative model (Table 4).

Fig. 2 A-E.

Fig. 2 A-E

Fig. 2 A-E

The density plot, calibration plot and ROC-curve for the models' predictions during validation are shown here and the next page. Neither the (A) Lasso-regression, (B) the random forest, (C) the gradient boosting model, (D) the neural network nor the (E) noninformative model a priori (D and E shown on the next page) predicted no revision for all observations.

Table 4.

Performance parameters for the models

graphic file with name abjs-478-2088-g009.jpg

Discussion

In orthopaedic surgery, machine-learning algorithms have shown promise in predicting the survival of patients treated for bone metastases [1], opioid use after spinal surgery [25, 26], and a range of outcomes related to total joint arthroplasty, such as patient-reported outcomes [13, 21]. In the present study, we attempted to combine machine learning with data from a nationwide arthroplasty registry to predict revision TKA. Although none of the models reached the threshold for clinical utility, the Danish Knee Arthroplasty Registry has a strong potential as a data source for prediction models. First, the Danish Healthcare system is financed through taxes with resident-based entitlement and each Danish citizen has a unique social security number, by which every healthcare contact is registered. This combination enabled us to build the models on population-based data with complete follow-up and thereby, ensure a high external validity [40]. Secondly, the Danish Knee Arthroplasty Registry is a part of the Danish Clinical Quality Program that is used to measure the quality of treatment in all Danish hospitals. Consequently, reporting is mandatory, which leads to high completeness of registered arthroplasties and few missing datapoints [35]. Thirdly, the longitudinal maintenance of the Danish Knee Arthroplasty Registry enabled the inclusion of a contemporary hold-out year as validation dataset. This ensured that the models were validated in another setting than the training dataset, depicted by the differences in implant constraint, fixation and patellar resurfacing presented in this study. Such differences might occur at random or reflect changes in surgical practice, and by incorporating these in a temporal external validation, the model validation resembled the real-world use of diagnostic tools. Lastly, by including variables currently collected by the Danish Knee Arthroplasty Registry, a final clinical usable model would have been readily applicable in the Danish Healthcare system, providing risk estimates to the surgeons reporting to the Danish Knee Arthroplasty Registry.

Limitations

The study also had several limitations hindering the development of clinical useful models. First, the aim to create a preoperative predictive model guiding patients and surgeons in the decision to do surgery may be the main restriction for a clinical usable model. Early revision might be a consequence of intraoperative circumstances or happen at random and would thus be hard to predict from preoperative factors. The Danish Knee Arthroplasty Registry includes some intraoperative variables such as duration of surgery and surgical complications (such as, iatrogenic fractures or rupture of the patella tendon). Yet, some complications, like suboptimal implant positioning, might not be identified intraoperatively. Postoperative variables, such as postoperative coronal alignment or component overhang, might have improved the models’ capability of predicting early revisions, but these are currently not registered in the Danish Knee Arthroplasty Registry. Preoperatively, the risk of intraoperative errors might be associated with surgeon experience as low-volume surgeons seem to increase the risk of surgical complications [29]. However, information regarding the surgeon(s) performing the procedure is not captured within the Danish Knee Arthroplasty Registry. Instead, we included each hospital’s volume of TKAs as an approximation, which has also been associated with the rate of complications following total joint arthroplasties [38]. Secondly, although the Danish Knee Arthroplasty Registry contains a wide range of preoperative factors, it does not include medical comorbidities such as diabetes. Diabetes is a recognized risk factor for deep infections after joint arthroplasty [4]. As more than 25% of the early revisions in this study were due to infection, the inclusion of diabetes might have improved the models’ discriminative capacity. Diabetes status can be acquired through the International Classification of Diseases, 10th Revision, codes registered in the Danish National Patient Registry [41]. However, as with most diseases, diabetes lies on a continuum ranging from well-controlled to uncontrolled. Therefore, it might be more relevant for future models to acquire the perioperative hemoglobin A1c level (glycated hemoglobin, which depicts the patient’s glycemic status during a 2- to 3-month period) as a measure of diabetes status [5]. Thirdly, missing datapoints should be considered a potential limitation as important information can be lost. However, the high completeness of the Danish Knee Arthroplasty Registry resulted in just 1% missing datapoints in this study. In an attempt not to lose valuable information by including only complete cases, we used MissForest to impute the missing datapoints. As a random forest-based method, MissForest is a rather novel approach to missing data, but it has been reported superior to more traditional imputation methods such as k-nearest neighbor [43]. The combination of a high-performing imputation method and low percentage of missing data points makes this study rather robust to bias from missing data. Finally, although we used four different machine learning methods, other methods might have resulted in a better model. In this study, we build one regression algorithm (LASSO regression), two classification tree algorithms (the random forest classifier and gradient boosting model), and a neural network. All of these have previously resulted in reliable predictive models in related fields [13, 26]. Support vector machine is another method, which is based on linear division of the datapoints. This method seeks to find a multidimensional plane (a hyperplane) dividing the data into a binary outcome (such as, revised/unrevised) [20]. Support vector machine might have resulted in different predictions compared with the presented models. However, it is questionable if these predictions would be of clinical use, given the low performance of the presented models.

Which Preoperative Factors Are the Most Important Features in These Models?

Previously identified risk factors for revision, such as age and weight, were the top 10 important features in both the LASSO regression and the classification tree models (random forest classifier and gradient boosting models) [15, 24]. In contrast, the neural network assigned more importance to other known risk factors such as postfracture osteoarthritis [9]. Yet, the most noticeable difference was that the LASSO regression ranked use of navigation and hemophilia as the most important features. This highlighted fundamental differences between the models’ methodologies that should be considered when interpreting the assigned importance. In a regression model, assigning a high coefficient (that is, high importance) to a factor will not affect observation without this factor, whereas a classification model will split observations based on this factor. For example, nine observations in the training dataset had hemophilia and splitting the dataset by this factor result in two subsamples of nine observations with hemophilia and 25,095 without hemophilia. As none of the observations with hemophilia were revised in the training dataset (data not shown), this split would not add value to a classification tree (the random forest). However, these nine observations might be strongly related to other presurgical factors associated with revisions (collinearity) making the LASSO regression assigning hemophilia a high coefficient. This exemplified that in predictive models, feature importance does not reflect causality or unbiased associations, and less so in low-performing models like those presented in this study. Instead, feature importance should be regarded as a reflection of how the predictions were calculated and thus, serve as a reality check of predictive models.

Can a Clinically Meaningful Model Be Built on the Preoperative Factors Included in the Danish Knee Arthroplasty Registry?

Although all four models included variables previously associated with TKA revision as important features, none of them reached the predetermined AUC-threshold of 0.7 defining a clinically meaningful model. With AUCs ranging from 0.57 to 0.60, the machine learning models were only slightly better than the noninformative model, which predetermined all observations to be unrevised. The low AUCs indicate that the association between the broad range of preoperative factors captured in the Danish Knee Arthroplasty Registry and early TKA revision were too weak to produce a clinically usable model. Although defeating the concept of a preoperative model, the inclusion of intra- and postoperative covariates might improve the models’ performance, as highlighted earlier. In a recent study, Fontana et al. [13] reported substantially better model performance when including information available before surgery as opposed to only relying on information available before deciding to conduct surgery, when aiming to predict the minimum clinical important differences from arthroplasty surgery. However, even with the inclusion of intra- and postoperative information, building a clinically meaningful model predicting early revision might be more difficult than predicting the minimum clinical important difference. With an early TKA revision rate of 3%, the noninformative model reached an accuracy of 0.97 and Brier score of 0.03 by predicting that none of the observations were revised. The seemingly high performance of the noninformative model highlighted the inherent difficulty in predicting rare outcomes. We tried to overcome this difficulty by over-sampling the revised observations and under-sampling the unrevised observations in the training dataset and thereby “inflating” potential associations between the preoperative factors and early revisions. The modified training dataset consisted of 10% early TKA revisions to keep the model predictions within clinical range. Consequently, the predicted probabilities of revision in the machine learning models were reasonably low during validation. However, the similar probabilities for revised and unrevised observations resulted in the models’ poor discriminative capacities (Fig. 2A-E). Future models might benefit from including longer follow-up as more implants will be revised and the association between the collected information and later revision might be stronger.

Conclusions

In this study, the combination of machine-learning algorithms and preoperative information from the national Danish Knee Arthroplasty Registry did not result in clinically useful predictions of TKA revision within the first 2 years after surgery. However, this study did show the difficulty of predicting early revision based solely on preoperative information and it can be used as guidance for future models aiming to predict TKA revisions. First, future models might benefit from including more preoperative clinical information such as medical comorbidities, or predict revision over a longer follow-up period. Secondly, arthroplasty registries might aid future predictive models by providing a, preferably anonymous, surgeon identification variable making the inclusion of surgeon-level information possible. Lastly, the inclusion of intra- and postoperative information might lead to a clinically meaningful postoperative predictive model, which could serve as guidance for personalized follow-up and rehabilitation.

Acknowledgments

We thank Danish knee surgeons for thoroughly reporting their surgeries to the Danish Knee Arthroplasty Registry and the registry’s steering committee for their goodwill in providing the data behind this study.

Footnotes

Each author certifies that neither he or she, nor any member of his or her immediate family, has funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc) that might pose a conflict of interest in connection with the submitted article.

All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.

Each author certifies that his or her institution approved the human protocol for this investigation and that all investigations were conducted in conformity with ethical principles of research.

This work was performed at Uniformed Services University-Walter Reed Department of Surgery, Bethesda, MD, USA.

References

  • 1.Anderson AB, Wedin R, Fabbri N, Boland P, Healey J, Forsberg JA. External Validation of PATHFx Version 3.0 in Patients Treated Surgically and Nonsurgically for Symptomatic Skeletal Metastases. Clin Orthop Relat Res . 2020;478:808–818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Benjamin DJ, Berger JO. Three Recommendations for Improving the Use of p-Values. Am Stat . 2019;73:186–191. [Google Scholar]
  • 3.Bjorgul K, Novicoff WM, Saleh KJ. Evaluating comorbidities in total hip and knee arthroplasty: available instruments. J Orthop Traumatol. 2010;11:203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bolognesi MP, Marchant MH, Jr., Viens NA, Cook C, Pietrobon R, Vail TP. The impact of diabetes on perioperative patient outcomes after total hip and total knee arthroplasty in the United States. J Arthroplasty . 2008;23:92–98. [DOI] [PubMed] [Google Scholar]
  • 5.Cancienne JM, Werner BC, Browne JA. Is There an Association Between Hemoglobin A1C and Deep Postoperative Infection After TKA? Clin Orthop Relat Res . 2017;475:1642–1649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD statement. J Clin Epidemiol . 2015;68:112–121. [DOI] [PubMed] [Google Scholar]
  • 7.Danish Knee Arthroplasty Registry. Annual Report 2019. Available at: https://www.sundhed.dk/content/cms/99/4699_dkr-aarsrapport-2019_til-offentliggoerelse.pdf. Accessed October 14, 2019.
  • 8.Delanois RE, Mistry JB, Gwam CU, Mohamed NS, Choksi US, Mont MA. Current Epidemiology of Revision Total Knee Arthroplasty in the United States. J Arthroplasty. 2017;32:2663–2668. [DOI] [PubMed] [Google Scholar]
  • 9.El-Galaly A, Haldrup S, Pedersen AB, Kappel A, Jensen MU, Nielsen PT. Increased risk of early and medium-term revision after post-fracture total knee arthroplasty: Results from the Danish Knee Arthroplasty Register. Acta Orthop . 2017;3674:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.El-Galaly A, Nielsen PT, Jensen SL, Kappel A. Prior High Tibial Osteotomy Does Not Affect the Survival of Total Knee Arthroplasties: Results From the Danish Knee Arthroplasty Registry. J Arthroplasty. 2018;33:2131–2135. [DOI] [PubMed] [Google Scholar]
  • 11.Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med . 2019;25:24–29. [DOI] [PubMed] [Google Scholar]
  • 12.Fischer JE, Bachmann LM, Jaeschke R. A readers’ guide to the interpretation of diagnostic test properties: Clinical example of sepsis. Intensive Care Med . 2003;29:1043–1051. [DOI] [PubMed] [Google Scholar]
  • 13.Fontana MA, Lyman S, Sarker GK, Padgett DE, MacLean CH. Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res. 2019;477:1267–1279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw . 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
  • 15.Gøttsche D, Gromov K, Viborg PH, Bräuner E V., Pedersen AB, Troelsen A. Weight affects survival of primary total knee arthroplasty: study based on the Danish Knee Arthroplasty Register with 67,810 patients and a median follow-up time of 5 years. Acta Orthop . 2019;90:60–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gowd AK, Agarwalla A, Amin NH, Romeo AA, Nicholson GP, Verma NN, Liu JN. Construct validation of machine learning in the prediction of short-term postoperative complications following total shoulder arthroplasty. J Shoulder Elb Surg . 2019;28:410–421. [DOI] [PubMed] [Google Scholar]
  • 17.Greenwell B, Boehmke B, Cunningham J. Package `gbm’ - Generalized Boosted Regression Models. Available at: https://cran.r-project.org/web/packages/gbm/gbm.pdf. Accessed September 13, 2019.
  • 18.Greidanus NV, Peterson RC, Masri BA, Garbuz DS. Quality of Life Outcomes in Revision Versus Primary Total Knee Arthroplasty. J Arthroplasty. 2011;26:615–620. [DOI] [PubMed] [Google Scholar]
  • 19.Harris AHS, Kuo AC, Weng Y, Trickey AW, Bowe T, Giori NJ. Can Machine Learning Methods Produce Accurate and Easy-to-use Prediction Models of 30-day Complications and Mortality after Knee or Hip Arthroplasty? Clin Orthop Relat Res . 2019;477:452–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning The Elements of Statistical Learning. 2017. Available at: https://web.stanford.edu/∼hastie/Papers/ESLII.pdf. Accessed September 1, 2019. [Google Scholar]
  • 21.Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak . 2019;19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Inacio MCS, Paxton EW, Graves SE, Namba RS, Nemes S. Projected increase in total knee arthroplasty in the United States – an alternative projection model. Osteoarthr Cartil . 2017;25:1797–1803. [DOI] [PubMed] [Google Scholar]
  • 23.Insall JN, Dorr LD, Scott RD, Scott WN. Rationale of the Knee Society clinical rating system. Clin Orthop Relat Res. 1989:13–14. [PubMed] [Google Scholar]
  • 24.Julin J, Jämsen E, Puolakka T, Konttinen YT, Moilanen T. Younger age increases the risk of early prosthesis failure following primary total knee replacement for osteoarthritis. A follow-up study of 32,019 total knee replacements in the Finnish Arthroplasty Register. Acta Orthop . 2010;81:413–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Karhade AV, Ogink PT, Thio QCBS, Cha TD, Gormley WB, Hershman SH, Smith TR, Mao J, Schoenfeld AJ, Bono CM, Schwab JH. Development of machine learning algorithms for prediction of prolonged opioid prescription after surgery for lumbar disc herniation. Spine J . 2019;19:1764–1771. [DOI] [PubMed] [Google Scholar]
  • 26.Karhade AV, Ogink PT, Thio QCBS, Broekman MLD, Cha TD, Hershman SH, Mao J, Peul WC, Schoenfeld AJ, Bono CM, Schwab JH. Machine learning for prediction of sustained opioid prescription after anterior cervical discectomy and fusion. Spine J . 2019;19:976–983. [DOI] [PubMed] [Google Scholar]
  • 27.Kuhn M. Building predictive models in R using the caret package. J Stat Softw . 2008;28:1–26.27774042 [Google Scholar]
  • 28.Kursa MB, Rudnicki WR. Feature selection with the boruta package. J Stat Softw . 2010;36:1–13. [Google Scholar]
  • 29.Lau RL, Perruccio A V, Gandhi R, Mahomed NN. The role of surgeon volume on patient outcome in total knee arthroplasty: A systematic review of the literature. BMC Musculoskelet Disord . 2012;13:250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2003;3:18–22. [Google Scholar]
  • 31.Lunardon N, Menardi G, Torelli N. ROSE: A package for binary imbalanced learning. R J . 2014;6:79–89. [Google Scholar]
  • 32.Malchau H, Garellick G, Berry D, Harris WH, Robertson O, Karrlholm J, Lewallen D, Bragdon CR, Lidgren L, Herberts P. Arthroplasty implant registries over the past five decades: Development, current, and future impact. J Orthop Res . 2018;36:2319–2330. [DOI] [PubMed] [Google Scholar]
  • 33.Morey RD, Rouder JN, Jamil T, Urbanek S, Forner K, Ly A. Computation of Bayes Factors for Common Designs. Available at: https://cran.r-project.org/web/packages/BayesFactor/index.html. Accessed October 15, 2019.
  • 34.Obermeyer Z, Emanuel EJ. Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. N Engl J Med . 2016;375:1216–1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pedersen AB, Mehnert F, Odgaard A, Schroder HM. Existing data sources for clinical epidemiology: The Danish Knee Arthroplasty Register. Clin Epidemiol. 2012;4:125–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pitta M, Esposito CI, Li Z, Lee Y. yu, Wright TM, Padgett DE. Failure After Modern Total Knee Arthroplasty: A Prospective Study of 18,065 Knees. J Arthroplasty. 2018;33:407–414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Quintana DS, Williams DR. Bayesian alternatives for common null-hypothesis significance tests in psychiatry: A non-technical guide using JASP. BMC Psychiatry. 2018;18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ricciardi BF, Liu AY, Qiu B, Myers TG, Thirukumaran CP. What Is the Association between Hospital Volume and Complications after Revision Total Joint Arthroplasty: A Large-database Study. Clin Orthop Relat Res . 2019;477:1221–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Schmidt M, Pedersen L, Sorensen HT. The Danish Civil Registration System as a tool in epidemiology. Eur J Epidemiol. 2014;29:541–549. [DOI] [PubMed] [Google Scholar]
  • 40.Schmidt M, Schmidt SAJ, Adelborg K, Sundbøll J, Laugesen K, Ehrenstein V, Sørensen HT. The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol . 2019;11:563–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Schmidt M, Schmidt SAJ, Sandegaard JL, Ehrenstein V, Pedersen L, Sørensen HT. The Danish National patient registry: A review of content, data quality, and research potential. Clin Epidemiol . 2015;7:449–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Stekhoven DJ, Bühlmann P. Data and text mining MissForest-non-parametric missing value imputation for mixed-type data. 2012;28:112–118. Available at: http://stat.ethz.ch/CRAN/. Accessed October 30, 2019. [DOI] [PubMed] [Google Scholar]
  • 43.Tang F, Ishwaran H. Random forest missing data algorithms. Stat Anal Data Min . 2017;10:363–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
  • 45.Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. Epidemiology. 2007;18:805–835. [DOI] [PubMed] [Google Scholar]

Articles from Clinical Orthopaedics and Related Research are provided here courtesy of The Association of Bone and Joint Surgeons

RESOURCES