Abstract
This cross-sectional study uses the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis reporting guideline to assess 120 published studies about surgical prediction models.
Despite the promises of machine learning and prediction models, many studies have failed to demonstrate a clear clinical benefit of predictive model-augmented medicine over standard practice.1,2 The limited clinical application of published prediction models has been attributed to multiple factors such as limited interpretability of the underlying algorithm, differences in data between retrospective and prospective applications, and limited external validity of the model.1,2,3
To our knowledge, the quality with which surgical prediction models are designed and reported has not been investigated as a factor. Suboptimal adherence to best practices for development, validation, reporting, and transparency is a barrier to reproducibility and meaningful use of new tools and findings in clinical settings.1,4 We aimed to evaluate the quality of surgical prediction models using established guidelines and identify opportunities to improve their potential clinical utility.
Methods
We included original articles published from January 1, 2018, to August 31, 2021, in 4 surgery journals with the highest SJR2 rankings1 that describe the development or validation of a surgical prediction model. Articles were manually screened and assessed according to the TRIPOD reporting guideline.5 The Cohen κ was used for interrater reliability of adherence to individual TRIPOD items. Complete details are available in eMethods of the Supplement. This study followed the STROBE reporting guideline and was exempt from institutional review board review and the need for informed consent in accordance with 45 CFR §46.102 because no human participants were involved.
Results
We included 120 studies describing a surgical prediction model (Table 1). Most studies (66 [55%]) described model development with internal validation; fewer studies (37 [31%]) externally validated the model on an independent cohort. Eighty-three models (69%) used conventional statistical methods (eg, logistic regression without regularization), and 37 (31%) used machine learning. Most models (97 [81%]) performed prognostic prediction tasks, most commonly prediction of postoperative mortality (47 [49%]). Fewer models performed diagnostic prediction tasks (23 [19%]), most commonly diagnosis of disease metastasis (6 [26.1%]). Across all represented surgical specialties, the largest share of models was designed for hepatopancreaticobiliary prediction tasks (32 [27%]).
Table 1. Baseline Characteristics of Surgical Prediction Model Original Research Studies Included in This Study.
Characteristic | No. (%) |
---|---|
No. | 120 (100) |
Journal | |
JAMA Surgery | 15 (12.5) |
British Journal of Surgery | 34 (28.3) |
Annals of Surgery | 49 (40.8) |
Journal of the American College of Surgeons | 22 (18.3) |
Year | |
Jan-Dec | |
2018 | 30 (25.0) |
2019 | 32 (26.7) |
2020 | 34 (28.3) |
Jan-Aug 2021 | 24 (20.0) |
Study type | |
Development and internal validation | 66 (55.0) |
External validation | 11 (9.2) |
Incremental value | 6 (5.0) |
Development and external validation | 37 (30.8) |
Model type | |
Conventional statistical learninga | 83 (69.2) |
Machine learningb | 31 (25.8) |
Deep learningc | 6 (5.0) |
Prediction type | |
Diagnostic | 23 (19.2) |
Metastasis | 6 (26.1) |
Malignancy/recurrence | 5 (21.7) |
Postoperative complication | 4 (17.4) |
Intraoperative phase | 4 (17.4) |
Otherd | 4 (17.4) |
Prognostic | 97 (80.8) |
Postoperative | |
Mortality | 47 (48.5) |
Outcomese | 31 (32.0) |
Complications | 16 (16.5) |
Otherf | 3 (3.1) |
Surgical specialty | |
Hepatopancreaticobiliary | 32 (26.7) |
General | 15 (12.5) |
Surgical oncology | 14 (11.7) |
Multispecialty | 12 (10.0) |
Thoracic | 10 (8.3) |
Colorectal | 10 (8.3) |
Trauma/burn | 9 (7.5) |
Vascular | 6 (5.0) |
Transplant | 5 (4.2) |
Endocrine | 3 (2.5) |
Minimally invasive surgery | 3 (2.5) |
Orthopedics | 1 (0.8) |
Conventional learning included Cox proportional hazard models and logistic regression without regularization.
Machine learning models included penalized regression techniques such as least absolute shrinkage and selection operator, support vector machines, decision trees, random forests, and ensemble techniques such as extreme gradient boosting.
Deep learning models included deep neural networks.
Other diagnostic prediction tasks included diagnosis of noncancerous surgical conditions such as appendicitis or diagnosis of a relevant comorbidity.
Postoperative outcomes prognostic prediction tasks included predicting nonmortality outcomes such as disease recurrence, progression, and remission.
Other prognostic prediction tasks included predicting health care costs, surgeon decisions about patient management, and operative time.
Median compliance with TRIPOD guidelines across all 120 studies was 53.8% (IQR, 47.1%-61.5%) (Table 2). The lowest compliance for any single published study was 19%. No study exhibited greater than 85% compliance. Models that used conventional statistical methods exhibited higher compliance (56%; IQR, 50%-63%) than machine learning–based models (50%; IQR, 42%-60%) (P = .002).
Table 2. Performance of 120 Studies in the Cohort on the TRIPOD Statement Criteria.
TRIPOD criteriona | No. (%) |
---|---|
All criteria, median (IQR) | 53.8 (47.1-61.5) |
Methods and results criteria, median (IQR) | 52.9 (44.3-58.8) |
Informative title (1) | 29 (24.2) |
Informative abstract (2) | 9 (7.5) |
Informative background (3a) | 41 (34.2) |
Clear objectives stated (3b) | 112 (93.3) |
Source of data specified (4a) | 114 (95.0) |
Key dates specified (4b) | 108 (90.0) |
Setting for cohorts specified (5a) | 102 (85.0) |
Eligibility criteria stated (5b) | 92 (76.7) |
Outcome defined (6a) | 65 (54.2) |
Predictors defined (7a) | 62 (27.3) |
Study size stated (8) | 33 (51.7) |
Handling of missing data explained (9) | 35 (29.2) |
Handling of predictors explained (10a) | 56 (46.8) |
Model building steps described (10b)b | 3 (2.8) |
Validation predictions described (10c) | 43 (35.8) |
Performance measures reported (10d) | 65 (54.2) |
Differences between development and validation cohorts shown (12) | 34 (28.3) |
Participant flow shown (13a) | 100 (83.3) |
Demographic and missing data in all cohorts are described (13b)b | 14 (11.7) |
Distribution of predictors and outcomes shown (13c) | 63 (52.8) |
Model development described (14a) | 96 (79.8) |
Model specifications and parameters shown (15a)b | 18 (14.7) |
Model fully presented for use (15b) | 65 (54.1) |
Model performance fully described (16) | 48 (40.0) |
Limitations described (18) | 118 (98.3) |
Validation performance interpreted (19a) | 77 (64.2) |
Interpretation of results described (19b) | 120 (100) |
Implications for clinical use and research (20) | 96 (80.0) |
Supplemental information provided (21) | 93 (77.5) |
Funding explicitly stated (22) | 21 (17.5) |
Abbreviation: TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis.
The parenthetical numbers enumerate the TRIPOD criterion.
Describing all model-building steps, reporting missing data and demographic details in all cohorts, and reporting all model parameters were among the criteria with the lowest compliance.
TRIPOD criteria with the lowest compliance included reporting all model building procedures (3 [2.8%]), displaying full model specifications (18 [14.7%]), and describing characteristics of the included cohorts (14 [11.7%]). Many studies did not explain how readers could use their model for future individualized predictions (65 [54.1%]).
Discussion
We found that surgical prediction models exhibited suboptimal compliance with established guidelines for model development, validation, and reporting, consistent with typical TRIPOD compliance rates for clinical prediction models in other medical fields.6 Notably, most researchers developed their model without external validation on an independent cohort. TRIPOD items with the lowest rates of compliance were those that addressed critical elements of model design and performance (Table 2); addressing these gaps may remove barriers to translating models into clinical practice and may help improve patient care. A study limitation is that this was not a formal systematic review; studies published in nonsurgical or surgical subspecialty journals were not included. The reproducibility and interpretation of these models may be improved by overcoming gaps in reporting model parameters and missing explanations for model applications.
Findings of this study suggest that models trained to deliver critical information to surgeons—such as a patient’s risk of mortality or serious morbidity—should be designed with utmost attention to transparency and proper methods. Compliance with established best practices may facilitate the translation of surgical prediction models to the bedside and give them a better chance of improving patient care.
eMethods. Cohort Selection, Data Collection, Methodologic Limitations, and Data Availability
eReferences
References
- 1.Sounderajah V, Ashrafian H, Karthikesalingam A, et al. ; STARD-AI Study Group . Developing specific reporting standards in artificial intelligence centred research. Ann Surg. 2022;275(3):e547-e548. doi: 10.1097/SLA.0000000000005294 [DOI] [PubMed] [Google Scholar]
- 2.Zhou Q, Chen ZH, Cao YH, Peng S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit Med. 2021;4(1):154. doi: 10.1038/s41746-021-00524-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Marwaha JS, Kvedar JC. Crossing the chasm from model performance to clinical impact: the need to improve implementation and evaluation of AI. npj. Digit Med. 2022;5(1):1-7. doi: 10.1038/s41746-022-00572-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Casas JP, Kwong J, Ebrahim S. Telemonitoring for chronic heart failure: not ready for prime time. Cochrane Database Syst Rev. 2010;2011(8):ED000008. doi: 10.1002/14651858.ED000008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. doi: 10.1136/bmj.g7594 [DOI] [PubMed] [Google Scholar]
- 6.Andaur Navarro CL, Damen JAA, Takada T, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22(1):12. doi: 10.1186/s12874-021-01469-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eMethods. Cohort Selection, Data Collection, Methodologic Limitations, and Data Availability
eReferences