Skip to main content
Clinical Orthopaedics and Related Research logoLink to Clinical Orthopaedics and Related Research
. 2020 Jun 4;478(10):2364-2366. doi: 10.1097/CORR.0000000000001344

CORR Insights®: What Is the Accuracy of Three Different Machine Learning Techniques to Predict Clinical Outcomes After Shoulder Arthroplasty?

Jonathan A Forsberg 1,2,
PMCID: PMC7491900  PMID: 32511144

Where Are We Now?

Machine learning is at the heart of a variety of applications in orthopaedic surgery. Guided by increasingly stringent statistical standards, such as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement [2] and the Strengthening the Reporting of Observational studies in Epidemiology initiative, available at www.strobe-statement.org, investigators with good-quality clinical data can produce models trained to estimate the likelihood of any number of outcomes.

In this issue of Clinical Orthopaedics and Related Research®, Kumar et al. [3] conducted an experiment designed to determine whether three machine-learning techniques could predict surgical outcomes in patients undergoing total shoulder arthroplasty. Rather than designing classifiers, which estimate the likelihood of a particular event, they used regression, extreme gradient boosting, and a convolutional artificial neural network to generate discrete estimates of pain scores, ROM, and scores for a variety of measures, including the American Shoulder and Elbow Surgery, University of California Los Angeles, Constant, and global shoulder function scores. In an impressive application of this technology, the team was able to predict both short- and long-term outcomes 5 years postoperatively, while considering the minimal clinically important difference thresholds for each measure.

Kumar et al. [3] applied what is called supervised machine learning. They allowed the algorithms to consider only specific preoperative data, then evaluated the output to ensure it made sense clinically. Their input data were meticulously collected from 1895 patients who underwent total shoulder arthroplasty by one of 30 surgeons using a single implant system.

The algorithms learned patterns within the data that may not be readily apparent. In some cases, the patterns may be unique to the training data and may not be generalizable to a broader patient population. However, Kumar et al. [3] mitigated this risk by starting with data provided by a large group of surgeons. In addition, they validated the model’s performance on a subset of their data that had been sequestered and hidden from the training process. In doing so, they required each model to predict outcomes using data it had never before encountered. Using good-quality data and rigorous validation methods is a recipe for success, and increases the likelihood that the models developed in this experiment are generalizable to a wider patient population.

Where Do We Need To Go?

Machine learning and artificial intelligence are creeping into our lives from all directions. However, as my wife reminds me, not all of us are ready for a self-driving car. I imagine fewer of us are ready for “self-driving” or unsupervised algorithms in health care. Nevertheless, Kumar et al. [3] opine that machine-learning applications will soon be ubiquitous in medicine, as well as in the management of healthcare resources. I tend to agree. As an alternative to unsupervised methods, supervised machine learning, as exemplified by Kumar et al. [3], may be able to balance the speed and accuracy of machine-based solutions with the transparency required in peer-reviewed studies. Still, with this increase in computing power comes a responsibility to ensure bias does not creep into models we build.

How Do We Get There?

One of the most important reasons we publish in a peer-reviewed manner is to allow others to evaluate our methods and repeat our experiment, if necessary. In translational or basic science, all of the information necessary to replicate the work is usually described in intricate detail in the methods section. Just as the vendor, reagent, and concentration are necessary to reproduce preclinical work, the type of statistical software, library or package information, and modeling parameters are essential pieces of information for data scientists.

Most machine-learning algorithms, including those used by Dr. Kumar and his team, can be quite complicated. The process of generating models includes a series of steps designed to impute missing data, improve the learning process, and minimize error. A number of “tuning” parameters must be specified by the user, resulting in hundreds of thousands of possible combinations, in some cases. In an effort to make model building more efficient, a number of algorithms, including the XGBoost library used by Drs. Kumar and his team, may also contain an automated “hyper-tuning” function designed to select the optimum parameters based on the training data. Once the tuning parameters have been specified, each algorithm then produces thousands of iterations of trial models, varying each in some way, in an attempt to minimize erroneous predictions. The iteration producing the best-performing model is recorded, so that a unique model can be generated quickly for later use. Given the diversity and complexity of the modeling process, it is impossible to fully understand and repeat experiments using machine-learned methods, unless we are able to examine the coded instructions used to produce these one-of-a-kind models.

Another important function of the peer review process is to allow others to perform external or independent validation using new sources of data. To facilitate this process, some machine-learned methods, such as the algorithms used by Dr. Kumar and his team, produce an equation of sorts. These equations contain all of the information necessary to derive a prediction and can be published or otherwise made available to those seeking to perform external validation studies. Other algorithms require the training data to be used in conjunction with the output file to produce estimations. In this case, further validation is impossible, without also publishing the training data used to generate the final models.

The input data are the foundation of any machine-learning technique. Readers must be given the opportunity to examine the distributions of demographics, other prognostic features, and outcomes in the training and validation sets. Doing so allows one to extrapolate the results and assess whether the models may be generalizable to a broader patient population, arguably a prerequisite to considering whether independent validation studies are desirable. Further, making the training and validation data available allows readers to evaluate the degree of missing data, as well as unbalanced, or skewed, distributions, that could bias the machine-learning process. I am surprised how often this information is lacking from studies applying machine-learning techniques. Each of these considerations is important for the reader to determine whether the imputation strategy or modeling method was appropriate for the available data and whether the resultant models may contain hidden biases.

As computer-intensive methods become more popular, we have an opportunity to standardize the reporting, above and beyond the recommendations published in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis and Strengthening the Reporting of Observational studies in Epidemiology initiatives. In fact, the unique nature of machine-learned models has prompted several journals to publish guidelines and requirements for submissions. Nature Medicine, for instance, requires authors to submit the source code and test data if “the custom algorithm is central to the paper, and has not been reported previously” in peer-reviewed form [4]. Bioinformatics, one of the leading journals on this subject, requires that authors who publish “novel algorithms” must make the code and test data available for 2 years after the date of publication [1]. It appears that the top journals on this subject recognize that the training data and coding instructions are required to fully understand the methods.

As orthopaedic researchers, let’s get used to submitting the code. As readers and referees, let’s insist on it. Remember, before we can determine whether the conclusions are supported by the methods and the data, we need access to this important information.

Footnotes

This CORR Insights® is a commentary on the article “What Is the Accuracy of Three Different Machine Learning Techniques to Predict Clinical Outcomes After Shoulder Arthroplasty?” by Kumar and colleagues available at: DOI: 10.1097/CORR.0000000000001263.

The author certifies that he, or a member of his immediate family, has no funding or commercial associations (eg, consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article.

All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.

The opinions expressed are those of the writer, and do not reflect the opinion or policy of CORR® or The Association of Bone and Joint Surgeons®.

References


Articles from Clinical Orthopaedics and Related Research are provided here courtesy of The Association of Bone and Joint Surgeons

RESOURCES