Where Are We Now?
In the current study, Bongers and colleagues [3] present their external validation study of a prognostic machine-learning algorithm. Through this, the authors have tapped into the hot topic of the imminent artificial-intelligence (AI) revolution.
Machine learning is a subset of AI, and essentially is the combination of statistics, which determines relationships from data, and computer science, with its emphasis on efficient algorithms [5]. The efficiency of computer algorithms allows for more-refined modelling that considers nonlinear relationships and interactions, with the goal of generating more-accurate predictions. However, like any other prediction tool, a robust assessment and reporting of its validity is still required [6].
Observing the proliferation of machine-learning abstracts presented at surgical conferences, it is my impression that many clinicians view machine learning as a “black box” that provides a magical solution to our clinical challenges. In reality, however, the statistical principles are decades old, so the same threats to the validity still apply: inadequate risk adjustment, a dearth of quality data, overfitting of the models, and questionable generalizability to settings other than the ones studied.
Altman and Royston [2] list five considerations when evaluating the validity of prediction models. The ideal models should (1) be validated against external datasets, (2) provide wide ranging prognostic outcomes, (3) have good accuracy and calibration between the predicted and actual outcomes, (4) present a quantification of the model’s performance, and (5) prespecify adequate performance, including presenting an unbiased estimate of the prediction error. Many of these principles have been summarized in The Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Initiative, which should serve as the minimum requirement for reporting of these studies [4].
At its core, surgical decision-making involves collaborative and informed conversations between the patient and the surgeon regarding the expected risks and benefits of the procedure. Traditionally, the prediction of benefits and harms is based on the surgeon’s experience, supported by his or her understanding of the available evidence. Clinical Decision Support (CDS) tools, based on prediction models, might provide more-accurate assessments of risk and reward, so as to better support the decision process.
Several CDS tools exist and are available for use by the surgeon, including the American College of Surgeons NSQIP Surgical Risk Calculator (https://riskcalculator.facs.org/RiskCalculator/), and the SORG algorithm (https://sorg-apps.shinyapps.io/chondrosarcoma). The latter was assessed in the study here by Bongers and colleagues [3]; they found that the SORG algorithm accurately predicts 5-year survival in patients surgically treated for chondrosarcoma, and so it might be very useful in making decisions in consultation with patients who have that condition. The tool is freely available at https://sorg-apps.shinyapps.io/chondrosarcoma, so any surgeon can use it to guide these important conversations with patients who have chondrosarcoma.
Where Do We Need To Go?
I predict that we soon will see an explosion of machine-learning CDS tools. We need to ensure they are developed, validated, and reported robustly, using standards like the TRIPOD statements. However, there are additional challenges for validation and development of machine-learning CDS tools in surgery and orthopaedics that should be considered.
Most CDS tools predict binary outcomes, such as 5-year survival rate of a cancer. These binary outcomes are relatively easy to analyze and interpret. However, compared to other surgical disciplines, orthopaedic surgery is unique in that survival is not the only goal; almost equally important is restoration of quality of life (QOL). This outcome is nonbinary and multidimensional, and different domains within the construct of QOL have different weights of importance for the individual. Further research is required in how we predict these types of outcomes and how they are presented for meaningful CDS.
Another issue is that most outcomes in surgical trials are “unbalanced”. The receiver operating curve (ROC) [1], which is commonly used to measure discrimination, may be less appropriate in datasets that have a disproportionate ratio of events to nonevents. Common unbalanced surgical events include complication rates or poor functional outcomes. Because the ROC is a plot of true positives (sensitivity) versus false positives (specificity), it is prone to placing disproportionate emphasis on the most-common event. In the setting of unbalanced outcomes, the proportion of true positives compared to false negatives are low, this can lead to a misleading report of accuracy [7].
Saito and Rehmsmeier [7] consider the F1 index (precision-recall plot) as more informative in unbalanced data. Precision is the measure of true positive relative to all positive predictions, which is less prone to bias in datasets with an unbalanced proportion of true negatives (not considered). Simply put, any model or surgeon would be found to be relatively accurate if (s)he or it simply predicted that all patients after moderate to severe spinal cord injuries would not be able to walk, as most of these patients are unlikely to walk (and so the prediction would be a true negative for most). Use of the F1 index was more robust than the ROC at determining the true discriminatory power of the prognostic models. Regardless, model performance measures need to consider the clinical importance of the model’s sensitivity, specificity, and positive predictive value.
Because these validation approaches are new and unfamiliar, readers who are new to this field may struggle with them; for this reason, it is important that these studies be written for relatively naïve audiences, and that they be tightly edited by journals. Elements of these studies, and eventually the CDS tools, that make interpretation somewhat more intuitive are confidence limits on the predictions offered, calibration results, and slider tools that allow for quick adjustments of risk factors so that the changes in prediction, reflects the clinician’s understanding.
How Do We Get There?
A developed and validated CDS tool does not automatically translate into something that is clinically useful. “If you build it … will they come?” Creating an interactive web-based tool may be a first step. But research in the presentation and use of CDS to inform better care is still at its infancy. The National Academy of Health lists seven items that should also be considered in CDS tools development and researched [8]: (1) Provide measurable value in addressing a problem area, (2) incorporate multiple data types including the most-current evidence, (3) produce actionable insights for the clinician and patient, (4) deliver information that is transparent and allows the user to make a practice decision, (5) demonstrate good usability including clear and relevant user-friendly displays, (6) is testable in small settings, but can be scaled up, and (7) support participation in quality-improvement initiatives.
Despite the futuristic appeal, machine learning and AI alone should not be considered the panacea of predictive CDS. The algorithms and formulas for these have been around for decades. Their current limitations are still the need to develop and research meaningful user interfaces and the requirement of large amount of accurate data. Currently, the majority of “Big Data” used by machine learning, is derived from administrative sources. Although this type of data is good for billing and other purposes, it lacks the nuances that are important for clinical decision-making. Furthermore, functional and QOL information usually is not captured at all. These outcomes, clinically relevant baseline risk factors, and accurate descriptions of the invasiveness of the surgery need to be collected with high quality and be available for machine learning to train on. It is incumbent on us surgeons, along with our patient partners, to take the lead ensuring that correct data are accurately collected and used to design meaningful CDS tools and interfaces to assist in the improvement of patient care.
Footnotes
This CORR Insights® is a commentary on the article “Does the SORG Algorithm Predict 5-year Survival in Patients with Chondrosarcoma? An External Validation” by Bongers and colleagues available at: DOI: 10.1097/CORR.0000000000000748.
The author certifies that neither he, nor any members of his immediate family, have any commercial associations (such as consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article.
All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.
The opinions expressed are those of the writer, and do not reflect the opinion or policy of CORR® or The Association of Bone and Joint Surgeons®.
References
- 1.Alba AC, Agoritsas T, Walsh M, Hanna S, Iorio A, Devereaux PJ, McGinn T, Guyatt G. Discrimination and calibration of clinical prediction models: Users’ guides to the medical literature users’ guides to discrimination and calibration of clinical prediction models users’ guides to discrimination and calibration of clinical prediction models. JAMA. 2017;318:1377-1384. [DOI] [PubMed] [Google Scholar]
- 2.Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19:453-473. [DOI] [PubMed] [Google Scholar]
- 3.Bongers MER, Thio QCBS, Karhade AV, Stor ML, Raskin KA, Lozano Calderon SA, DeLaney TF, Ferrone ML, Schwab JH. Does the SORG algorithm predict 5-year survival in patients with chondrosarcoma? An external validation. Clin Orthop Relat Res. [Published online ahead of print]. DOI: 10.1097/CORR.0000000000000748. [DOI] [PMC free article] [PubMed]
- 4.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. Ann Intern Med. 2015;162:55-63. [DOI] [PubMed] [Google Scholar]
- 5.Deo RC. Machine learning in medicine. Circulation. 2015;132:1920-1930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial intelligence in surgery: Promises and perils. Ann Surg. 2018;268:70-76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS One. 2015;10:e0118432-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tcheng JE. Optimizing Strategies for Clinical Decision Support. Washington DC: National Academy of Medicine; 2017. [PubMed] [Google Scholar]
