Skip to main content
PLOS Medicine logoLink to PLOS Medicine
. 2020 Oct 15;17(10):e1003253. doi: 10.1371/journal.pmed.1003253

Developing and validating subjective and objective risk-assessment measures for predicting mortality after major surgery: An international prospective cohort study

Danny J N Wong 1,2, Steve Harris 3, Arun Sahni 1,2, James R Bedford 1,2, Laura Cortes 2, Richard Shawyer 4, Andrew M Wilson 5, Helen A Lindsay 5, Doug Campbell 5, Scott Popham 6, Lisa M Barneto 7, Paul S Myles 8; SNAP-2: EPICCS collaborators, S Ramani Moonesinghe 1,2,*
Editor: David Menon9
PMCID: PMC7561094  PMID: 33057333

Abstract

Background

Preoperative risk prediction is important for guiding clinical decision-making and resource allocation. Clinicians frequently rely solely on their own clinical judgement for risk prediction rather than objective measures. We aimed to compare the accuracy of freely available objective surgical risk tools with subjective clinical assessment in predicting 30-day mortality.

Methods and findings

We conducted a prospective observational study in 274 hospitals in the United Kingdom (UK), Australia, and New Zealand. For 1 week in 2017, prospective risk, surgical, and outcome data were collected on all adults aged 18 years and over undergoing surgery requiring at least a 1-night stay in hospital. Recruitment bias was avoided through an ethical waiver to patient consent; a mixture of rural, urban, district, and university hospitals participated. We compared subjective assessment with 3 previously published, open-access objective risk tools for predicting 30-day mortality: the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT). We then developed a logistic regression model combining subjective assessment and the best objective tool and compared its performance to each constituent method alone. We included 22,631 patients in the study: 52.8% were female, median age was 62 years (interquartile range [IQR] 46 to 73 years), median postoperative length of stay was 3 days (IQR 1 to 6), and inpatient 30-day mortality was 1.4%. Clinicians used subjective assessment alone in 88.7% of cases. All methods overpredicted risk, but visual inspection of plots showed the SORT to have the best calibration. The SORT demonstrated the best discrimination of the objective tools (SORT Area Under Receiver Operating Characteristic curve [AUROC] = 0.90, 95% confidence interval [CI]: 0.88–0.92; P-POSSUM = 0.89, 95% CI 0.88–0.91; SRS = 0.85, 95% CI 0.82–0.87). Subjective assessment demonstrated good discrimination (AUROC = 0.89, 95% CI: 0.86–0.91) that was not different from the SORT (p = 0.309). Combining subjective assessment and the SORT improved discrimination (bootstrap optimism-corrected AUROC = 0.92, 95% CI: 0.90–0.94) and demonstrated continuous Net Reclassification Improvement (NRI = 0.13, 95% CI: 0.06–0.20, p < 0.001) compared with subjective assessment alone. Decision-curve analysis (DCA) confirmed the superiority of the SORT over other previously published models, and the SORT–clinical judgement model again performed best overall. Our study is limited by the low mortality rate, by the lack of blinding in the ‘subjective’ risk assessments, and because we only compared the performance of clinical risk scores as opposed to other prediction tools such as exercise testing or frailty assessment.

Conclusions

In this study, we observed that the combination of subjective assessment with a parsimonious risk model improved perioperative risk estimation. This may be of value in helping clinicians allocate finite resources such as critical care and to support patient involvement in clinical decision-making.


Danny Wong and colleagues reveal measures for predicting mortality after surgery.

Author summary

Why was this study done?

  • Over 3 million postoperative deaths occur worldwide per year.

  • Some of these may be avoidable through risk-assessment–based modification of treatment pathways, such as postoperative critical care admission.

  • There are multiple methods for predicting which patients are at high risk of death or complications from surgery, but these are not widely used, with clinicians instead usually relying on their subjective clinical judgement alone.

  • Before this study, there was little information about whether clinical judgement was of better, worse, or equivalent accuracy to objective risk scores.

What did the researchers do and find?

  • We conducted a 1-week cohort study in 274 hospitals in the UK, Australia, and New Zealand, during which we collected data on risk and surgical outcome on every patient who had an operation requiring an overnight stay in hospital.

  • The clinical team (surgeons, anaesthetists) looking after the patient were asked to provide a subjective assessment of risk. We compared these assessments with the results of 3 freely available objective risk-assessment tools.

  • We included data from 22,631 patients in our analyses and found that subjective assessment was as accurate as the best of the objective risk tools (the Surgical Outcome Risk Tool or SORT) for predicting death in hospital within 30 days of surgery.

  • However, combining subjective and objective measurement using the SORT provided an even more accurate estimate.

What do these findings mean?

  • The new SORT–clinical judgement calculator can be used by clinicians to risk-stratify patients and so identify which individuals are most likely to benefit from limited resources such as access to postoperative critical care.

  • At the policy level, the tool can be used to plan surgical services, including the number of critical care beds required to serve a surgical population.

  • This study is limited by being conducted solely in high-income countries, limiting its global generalisability.

Introduction

The provision of safe surgery is an international healthcare priority [1]. Guidelines recommend that preoperative risk estimation should guide treatment decisions and facilitate shared decision-making [2,3]. Furthermore, there is an ethical imperative (and in the United Kingdom [UK], a legal requirement) to provide an individualised assessment of a patient’s risk of adverse outcomes [4]. Increasing evidence suggests that postoperative mortality in both high and low/middle-income settings is due less to what happens in the operating theatre and more to our ‘failure to rescue’ patients who develop postoperative complications [5,6]. These observations also point towards opportunity: once a patient has been identified as high risk, mitigation strategies such as pre-emptive admission to critical care or enhanced postoperative surveillance may prevent adverse outcomes [2]. However, critical care is a finite resource, with competition for beds between surgical and emergency medical admissions. To that end, the requirement for a postoperative critical care bed is itself a risk factor for last-minute cancellation, with consequent potential for disruption and harm for both patients and healthcare providers [7]. Thus, there is a need to accurately stratify patient risk so as to make the most of limited resources and improve perioperative outcomes. This is especially true given the scale of demand; more than 300 million operations take place annually worldwide [8]. With a major postoperative morbidity rate of around 15% [9,10], a short-term mortality rate between 1 and 3% [11], and a reproducible association between short-term morbidity and long-term survival [9,12,13], the impact of surgical complications on individual patients, healthcare resources, and society at large is clearly evident. Furthermore, if resources permitted, substantially larger numbers of patients would be considered for surgical intervention [1].

There are numerous methods available to help clinicians estimate perioperative risk, including frailty indices [14], functional capacity assessments such as cardiopulmonary exercise testing (CPET) [15], and dozens of risk prediction scores and models, many of which are open-source, are easily applied, and have been validated in multiple heterogeneous surgical cohorts [16]. Despite this myriad of choices, data from national Quality Improvement (QI) programmes indicate that clinicians do not routinely document an individualised risk assessment before surgery [10,17]. In part, this may relate to the availability of complex investigations and equipoise over which method is most accurate, particularly when the accuracy of objective methods compared with subjective assessment alone is disputed [15]. We therefore performed a prospective cohort study with the following objectives: to describe how clinicians assess risk in routine practice, to externally validate and compare the performance of 3 open-access risk models with subjective assessment, and to investigate whether objective risk tools add value to subjective assessment.

Methods

This is a planned analysis of the Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery (SNAP-2: EPICCS) study, a prospective observational cohort study conducted in 274 hospitals from the UK, Australia, and New Zealand [18]. We report our findings in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE; S1 Text) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD; S2 Text) statements [19,20]. National research networks, including trainee-led networks, were used to maximise recruitment from public hospitals in all countries. All adult (≥18 years) patients undergoing inpatient surgery and meeting our criteria (see ‘Data set’, below) during a 1-week period were included in our analyses for this paper. Patients were recruited between 21–27 March 2017 in the UK, 21–27 June 2017 in Australia, and 6–13 September 2017 in New Zealand.

Ethical and governance approvals

UK-wide ethical approval for the study was obtained from the Health Research Authority (South Central–Berkshire B REC, reference number: 16/SC/0349); additional permission to collect patient-identifiable data without consent was granted through Section 251 exemption from the Confidentiality Advisory Group for England and Wales (CAG reference: 16/CAG/0087), the NHS Scotland Public Benefit and Privacy Panel for Health and Social Care (PBPP reference: 1617–0126), and individual Health and Social Care Trust research and development departments for each site in Northern Ireland (Belfast, Northern, South Eastern, and Western Health and Social Care Trusts, IRAS reference number: 154486). In Australia, each state had different regulatory approval processes, and approvals were received from the following ethics committees: New South Wales—Hunter New England and Greater Western Human Research Ethics Committee; Queensland—Metro South Hospital and Health Service Human Research Ethics Committee; South Australia—Southern Adelaide Clinical Human Research Ethics Committee; Tasmania—Tasmania Health and Medical Human Research Ethics Committee; Victoria—Alfred Health, Eastern Health, Goulburn Valley, Mercy Health, Monash Health, Peter MacCallum Cancer Centre Research Ethics Committees; Western Australia—South Metropolitan Health Service Human Research Ethics Committee. In New Zealand, the study received national approval from the Health and Disability Ethics Committees (Ethics ref: 17/NTB/139).

Data set

All data (S3 Text) were collected prospectively. In this study, we defined objective risk assessment as the use of a risk calculation model or equation or tool that supplies a prediction of risk on a probability scale. Before surgery, perioperative teams answered the following question for each patient: ‘What is the estimate of the perioperative team of the risk of death within 30 days?’, with 6 categorical response options (<1%, 1%–2.5%, 2.6%–5%, 5.1%–10%, 10.1%–50%, and >50%). These thresholds were decided by expert consensus within the study steering group and study authors. Teams were then asked to record how they arrived at this estimate (for example, clinical judgement and/or an objective risk tool). The patient data for this study were collected from a wide range of participating publicly funded hospitals in the UK (n = 245), Australia (n = 21), and New Zealand (n = 8). These were a heterogeneous mix of secondary (42%) and tertiary care (58%) institutions and likely reflective of the general composition of hospitals in these countries. We have previously described the hospitals and their available facilities for providing perioperative care [21].

Patients included in the study were adults (≥18 years) undergoing surgery or other interventions that required the presence of an anaesthetist and who were expected to require overnight stay in hospital. We included all procedures taking place in an operating theatre, radiology suite, endoscopy suite, or catheter laboratory for which inpatient (overnight) stay was planned, including both planned and emergency/urgent surgery of all types, endoscopy, and interventional radiology procedures.

Patients were excluded if they indicated they did not want to participate in the study. We also excluded ambulatory surgery, obstetric procedures (for example, cesarean sections and surgery for complications of childbirth), procedures on ASA-PS (American Society of Anesthesiologists Physical Status score) grade VI patients, noninterventional diagnostic imaging (for example, CT or MRI scanning without interventions), and emergency department or critical care interventions requiring anaesthesia or sedation but no interventional procedure.

Statistical analysis

The protocol for SNAP-2: EPICCS was previously published with aims, objectives, and research questions outlined [18]. Our primary outcome for the study described in this paper was inpatient 30-day mortality, recorded prospectively by local collaborators. We conducted 3 inferential analyses, the first using the entire patient data set and the second and third omitting the patients for whom an objective tool was used to predict perioperative risk (Fig 1). For the first analysis, we evaluated performance of the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT) [16,2224]. The calibration and discrimination of all models was assessed in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) recommendations [20]. Calibration was assessed by graphical inspection of observed versus expected mortality and by the Hosmer–Lemeshow goodness-of-fit test [25]. Discrimination was assessed by calculating the Area Under Receiver Operating Characteristic curve (AUROC) [26]. AUROCs were compared using DeLong’s test for 2 correlated ROC curves [27]. ROC curves can be constructed for both continuous predictions (for example, P-POSSUM, SRS, and SORT) and ordinal categorical predictions (for example, ASA-PS or the 6-category subjective predictions that clinicians were asked to make): in the former, sensitivities and specificities are calculated for every value in the probability range of 0 to 1, and then each point is plotted to obtain a smooth curve; in the latter, sensitivities and specificities are computed for each category, and the points form a polygon on the ROC plot.

Fig 1. Participant flowchart.

Fig 1

The second analysis compared the performance of subjective assessment (defined as either using clinical judgement and/or ASA-PS) against the best-performing risk tool. For this, we included only patients for whom subjective assessment alone was used to predict the risk of 30-day mortality. Subjective assessment was then evaluated on calibration and discrimination. Point estimates of risk prediction were taken as the midpoint of the predicted risk intervals provided by clinicians (i.e., 0.5% for the interval <1%, 1.75% for the interval 1%–2.5%, and so on), and the proportion of observed mortality in each of these risk categories was calculated. Calibration was then assessed by plotting the observed mortality proportions against the midpoints of clinician-predicted risk intervals. We then compared the performance of subjective assessment against the best-performing risk model, using AUROC and the continuous Net Reclassification Improvement (NRI) statistic [25]. The NRI quantifies the proportion of individuals whose predictions improve in accuracy (positive reclassification) subtracted by the proportion whose predictions worsened in accuracy (negative reclassification) when using one prediction model versus another [28]. An NRI >0 indicates an overall improvement, <0 an overall deterioration, and zero no difference in prediction accuracy.

The third analysis evaluated the added value of combining subjective assessment with the best-performing risk tool by creating a logistic regression model with variables from both sources.

For this, we fitted a logistic regression model with 2 variables: the subjective assessment of risk and the mortality prediction from the best objective risk tool according to the following logit formula: ln(R/(1 − R)) = β0 +β1Xsubjective + β2Xobjective, where R is the probability of 30-day mortality; β0, β1, and β2 are the model coefficients; Xsubjective is the subjective clinical assessment (6 ordered categories, as above); and Xobjective is the risk of mortality as predicted using the most accurate risk model. An optimism-corrected performance estimate of the combined model was obtained using bootstrapped internal validation with 1,000 repetitions; this was then compared with subjective assessment and the most accurate risk model alone.

We used decision-curve analysis (DCA) to describe and compare the clinical implications of using each risk model. In DCA, a model is considered to have clinical value if it has the highest net benefit across the whole range of thresholds for which a patient would be labelled as ‘high risk’. The net benefit is defined as the difference between the proportion of true positives (labelled as high risk and then going on to die within 30 days of surgery) and the proportion of false positives (labelled as high risk but not going on to die within 30 days) weighted by the odds of the selected threshold for the high-risk label. At any given threshold, the model with the higher net benefit is the preferred model [20, 25, 29].

Missing data

The P-POSSUM requires biochemical and haematological data for calculation; however, fit patients may not have preoperative blood tests [30], and in other cases, there may be no time for blood analysis before surgery. Therefore, in cases for which these data were missing, normal physiological ranges were imputed because this most closely follows what clinicians might reasonably do in practice when tests are not indicated or not feasible or results are missing. Following imputation, we performed a complete case analysis because we considered the proportion of cases with missing data in the remaining variables to be low (1.08%) [31].

Sensitivity analyses

We conducted a number of sensitivity analyses to examine the potential effects of differences in population characteristics on our main study findings. First, we repeated our analyses in a full cohort of patients, including those undergoing obstetric procedures. Second, we repeated the analysis in a subgroup of high-risk patients, defined according to previously published criteria based on age, type of surgery, and comorbidities [15,32]. Third, we evaluated the impact on the accuracy of subjective assessment of using objective tools by comparing discrimination and calibration of subjective assessment in the subgroup of patients whose risk estimates were not solely informed by clinical judgement. Fourth, we repeated our analyses separately in the UK and Australian/New Zealand cohorts to investigate the potential for geographical influences on our findings. Fifth, we examined the potential impact of normal value imputation on missing P-POSSUM values by repeating the analysis on only cases in which no missing P-POSSUM variables were present. Finally, we conducted analyses on surgical specialty subgroups to evaluate the accuracy of the new model created on different subcohorts.

Analyses were performed using R version 3.5.2; p < 0.05 was considered statistically significant. Statistical code is available on request.

Results

Patient data were collected on 26,502 surgical episodes in 274 hospitals across the UK, Australia, and New Zealand (Table 1). A total of 3,871 cases were excluded from all analyses: 3,660 obstetric cases in which there were no deaths, plus a further 286 cases for missing values. This left 22,631 cases with adequate data for external validation of the P-POSSUM, SRS, and SORT models, the first part of our analyses (Fig 1). For the second and third analyses, in which we compared subjective assessment against the best-performing objective risk tool and combined these measures to create a new model for internal validation, we excluded 4,891 cases in which clinician prediction was aided by the use of other risk tools. This left 21,325 cases for these analyses. There were 317 inpatient deaths within 30 days of surgery (1.40%). In most cases, subjective assessment alone was used to estimate risk (n = 17,845, 78.9%; Table 2). No patients were lost to follow-up.

Table 1. Patient demographics stratified by 30-day mortality.

30-Day Mortality
Overall Survived Died
N 22,631 22,314 317
Male sex (%) 10,671 (47.2) 10,481 (47.0) 190 (59.9)
Female sex (%) 11,960 (52.8) 11,833 (53.0) 127 (40.1)
Age (median [IQR]) 62 [46–73] 62 [45–73] 76.00 [64.00–83.00]
Operative urgency (%)
 Elective 12,061 (53.3) 12,029 (53.9) 32 (10.1)
 Expedited 3,311 (14.6) 3,270 (14.7) 41 (12.9)
 Urgent 6,617 (29.2) 6,460 (29.0) 157 (49.5)
 Immediate 642 (2.8) 555 (2.5) 87 (27.4)
ASA-PS class (%)
 I 4,462 (19.7) 4,458 (20.0) 4 (1.3)
 II 10,192 (45.0) 10,168 (45.6) 24 (7.6)
 III 6,574 (29.0) 6,454 (28.9) 120 (37.9)
 IV 1,337 (5.9) 1,206 (5.4) 131 (41.3)
 V 66 (0.3) 28 (0.1) 38 (12.0)
Procedure severity (%)*
 Minor 1,951 (8.6) 1,919 (8.6) 32 (10.1)
 Intermediate 4,523 (20.0) 4,476 (20.1) 47 (14.8)
 Major 7,478 (33.0) 7,369 (33.0) 109 (34.4)
 Xmajor 5,281 (23.3) 5,218 (23.4) 63 (19.9)
 Complex 3,398 (15.0) 3,332 (14.9) 66 (20.8)
Surgical specialty (%)
 Gastrointestinal surgery 4,472 (19.8) 4,384 (19.6) 88 (27.8)
 Gynaecology/urology 4,309 (19.0) 4,297 (19.3) 12 (3.8)
 Neuro/spinal surgery 1,208 (5.3) 1,181 (5.3) 27 (8.5)
 Orthopaedics 6,772 (29.9) 6,688 (30.0) 84 (26.5)
 Thoracic/cardiac surgery 1,033 (4.6) 1,015 (4.5) 18 (5.7)
 Vascular 674 (3.0) 645 (2.9) 29 (9.1)
 Other 4,163 (18.4) 4,104 (18.4) 59 (18.6)
Past medical history: coronary artery disease (%) 3,029 (13.4) 2,923 (13.1) 106 (33.4)
Past medical history: congestive cardiac failure (%) 893 (3.9) 839 (3.8) 54 (17.0)
Past medical history: metastatic cancer (active) (%) 825 (3.6) 799 (3.6) 26 (8.2)
Past medical history: dementia (%) 676 (3.0) 644 (2.9) 32 (10.1)
Past medical history: COPD (%) 1,955 (8.6) 1,909 (8.6) 46 (14.5)
Past medical history: pulmonary fibrosis (%) 180 (0.8) 173 (0.8) 7 (2.2)
Past medical history: liver cirrhosis (%) 224 (1.0) 206 (0.9) 18 (5.7)
Past medical history: renal disease (%) 381 (1.7) 362 (1.6) 19 (6.0)
Past medical history: diabetes (%)
 Type 1 274 (1.2) 265 (1.2) 9 (2.8)
 Type 2 (dietary-controlled) 614 (2.7) 598 (2.7) 16 (5.0)
 Type 2 (insulin-controlled) 761 (3.4) 743 (3.3) 18 (5.7)
 Type 2 (oral hypoglycaemic medication) 1,570 (6.9) 1,522 (6.8) 48 (15.1)
 No diabetes 19,399 (85.8) 19,173 (86.0) 226 (71.3)
Postoperative length of stay in days (median [IQR]) 3 [1–6] 3 [1–6] 7 [2–13
SORT-calculated mortality risk % (median [IQR]) 0.4 [0.2–1.6] 0.4 [0.2–1.6] 8.9 [4.2–20.6]
P-POSSUM-calculated mortality risk % (median [IQR]) 1.1 [0.6–2.9] 1.1 [0.6–2.9] 18.1 [5.7–41.6]
SRS-calculated mortality risk % (median [IQR]) 1.9 [0.8–4.4] 1.9 [0.8–4.4] 19.6 [4.4–36.1]
Subjective clinical assessment made on clinical judgement and/or ASA-PS grading alone (%) 17,845 (78.9) 17,657 (79.1) 188 (59.3)

*Procedure severity classification (minor, intermediate, major, Xmajor, and complex: ordinal scale).

Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Table 2. Methods used by clinicians to estimate 30-day mortality.

Clinicians could select one or more categories; therefore, the total percentages (in parentheses) exceed 100%.

Overall
n 22,631
Clinical judgement (%) 20,064 (88.7)
ASA-PS score (%) 8,622 (38.1)
Duke Activity Status Index or other activity index (%) 515 (2.3)
Six-minute walk test or incremental shuttle walk test (%) 48 (0.2)
Cardiopulmonary exercise testing (%) 215 (1.0)
Formal frailty assessment (for example, Edmonton Frail Scale) (%) 48 (0.2)
SRS (%) 315 (1.4)
SORT (%) 750 (3.3)
EuroSCORE (%) 442 (2.0)
POSSUM (%) 287 (1.3)
P-POSSUM (%) 1,397 (6.2)
Surgery-specific POSSUM (for example, Vasc-POSSUM) (%) 192 (0.8)
Other risk scoring system (%) 651 (2.9)

Abbreviations: ASA-PS, American Society of Anesthesiologists Physical Status; EuroSCORE, European System for Cardiac Operative Risk Evaluation; POSSUM, Physiology and Operative Severity Score for the enUmeration of Mortality; P-POSSUM, Portsmouth-POSSUM; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

External validation of existing risk prediction models

The SORT was the best calibrated of the pre-existing models; however, all overpredicted risk (Fig 2A–2C; Hosmer–Lemeshow p-values all <0.001 for the SORT, P-POSSUM, and SRS). All models exhibited good-to-excellent discrimination (Fig 2D; AUROC SORT = 0.90, 95% confidence interval [CI]: 0.88–0.92; P-POSSUM = 0.89, 95% CI: 0.88–0.91; SRS = 0.85, 95% CI: 0.82–0.87). The AUROC for the SORT was significantly better than SRS (p < 0.001), but not P-POSSUM (p = 0.298).

Fig 2. Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D).

Fig 2

In the calibration plots (A–C), nonparametric smoothed best-fit curves (blue) are shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each decile of predicted mortality. External validation of all 3 models were performed on the entire patient data set (n = 22,631). ASA-PS, American Society of Anesthesiologists Physical Status; CI, confidence interval; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; ROC, Receiver Operating Characteristic; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Subjective assessment

There were 188 deaths (1.05%) within 30 days of surgery in the subset of 17,845 patients who had mortality estimates based on clinical judgement and/or ASA-PS alone. Subjective assessment overpredicted risk (Fig 3A, Hosmer–Lemeshow test p < 0.001) but demonstrated good discrimination (Fig 3B and Table 3, AUROC = 0.89, 95% CI: 0.86–0.91), which was not significantly different from the SORT (p = 0.309). Continuous NRI analysis did not show improvement in classification when using the SORT compared with subjective assessment (Table 3 and S4 Text). The 30-day mortality outcomes at each level of clinician risk prediction were cross-tabulated, showing that clinician predictions correlated well with actual mortality outcomes (S2 Table).

Fig 3. Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone (n = 17,845).

Fig 3

For (A), a nonparametric smoothed best-fit curve (blue) is shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each range of clinician-predicted mortality. For (C), the apparent (blue) and optimism-corrected (red) nonparametric smoothed calibration curves are shown; the latter was generated from 1,000 bootstrapped resamples of the data set. CI, confidence interval; ROC, Receiver Operating Characteristic; SORT, Surgical Outcome Risk Tool.

Table 3. Coefficients of the logistic regression model combining subjective clinical assessment with SORT-predicted risk; p-values in a logistic regression model test the null hypothesis that the estimated coefficient is equal to zero using a z-test.

Coefficient Standard Error Z-Statistic p-Value
Intercept −6.403 0.2135 −30 <0.001
SORT-predicted risk (per 1% risk) 0.04028 0.007049 5.714 <0.001
Clinical assessment of risk
Clinical assessment of risk < 1% Reference
Clinical assessment of risk = 1%–2.5% 1.487 0.2962 5.021 <0.001
Clinical assessment of risk = 2.6%–5% 2.365 0.3177 7.444 <0.001
Clinical assessment of risk = 5.1%–10% 3.074 0.2976 10.33 <0.001
Clinical assessment of risk = 10.1%–50% 4.156 0.2852 14.57 <0.001
Clinical assessment of risk > 50% 5.028 0.3186 15.78 <0.001

Abbreviations: SORT, Surgical Outcome Risk Tool.

Combining subjective and objective risk assessment

Bootstrapped internal validation yielded an optimism-corrected AUROC of 0.92 for a combined model using both subjective assessment and SORT predictions as independent variables (Table 3); this was better than subjective assessment alone (p < 0.001) and SORT alone (p = 0.021) (Table 4). The model also significantly (p < 0.001) improved reclassification compared with subjective assessment alone in continuous NRI analysis (S4 Text). The improved NRI was largely attributable to the correct downgrading of patient risks—i.e., a large proportion of patients were correctly reclassified as lower risk using the combined model compared with subjective assessment. The DCA also favoured SORT over the other previously published models, but the combined clinician judgement–SORT model again performed best (Fig 4). The effect of combining information from subjective assessment and the SORT is further demonstrated by computing the conditional probabilities of 30-day mortality using the combined model over a full range of predictor values (Fig 5). When assessing the decision curves across all risk thresholds, the combined model outperformed P-POSSUM and SRS, and beyond approximately the 10% risk threshold, P-POSSUM and SRS demonstrated negative net benefits when they were used. The decision curve for our combined model incorporating both subjective assessment and SORT showed increased net benefit across almost the entire range of risk thresholds versus SORT alone.

Table 4. Performance metrics for clinician prediction versus SORT and versus a logistic regression model combining clinician and SORT prediction.

Calculations based on the subset of patients in whom clinician judgement alone was used to estimate risk (n = 17,845). The reported AUROC for the combined model is the optimism-corrected value from bootstrapped internal validation.

ROC Continuous NRI
Model AUROC 95% CI p-Value1 NRI 95% CI p-Value2
Clinical 0.886 0.858–0.914 Reference Reference
SORT 0.900 0.877–0.923 0.309 0.073 −0.062 to 0.208 0.288
Combined 0.920 0.899–0.940 <0.001 0.130 0.057–0.202 <0.001

1Differences between AUROCs are tested using DeLong’s test for 2 correlated ROC curves with a null hypothesis of no difference.

2Differences between continuous NRI statistics are tested using a z-test with a null hypothesis of no difference.

Abbreviations: AUROC, Area Under the Receiver Operating Characteristic curve; CI, confidence interval; NRI, Net Reclassification Improvement.

Fig 4. DCA.

Fig 4

DCA, decision-curve analysis; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Fig 5. Predicted risks from combined model, stratified by clinical assessments.

Fig 5

We model the changes to risk predictions (y-axis) based on subjective clinical assessments (coloured lines) as SORT-predicted risks (x-axis) change, to illustrate the change in risk predictions if information from both are combined. P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

Sensitivity analyses

A summary of the different sensitivity analyses is provided in S5 Text. In the first sensitivity analysis (S6 Text), we repeated the main study analyses using the full cohort of patients available from SNAP-2: EPICCS, including those undergoing obstetric procedures, and found that there were minimal differences seen from our main study findings. The SORT was again the best calibrated of the pre-existing models in this larger cohort, and all objective risk tools again overpredicted risk (S1 Fig; Hosmer–Lemeshow p-values all <0.001 for the SORT, P-POSSUM, and SRS). The estimates for AUROC were minimally affected (S1 Fig; AUROC SORT = 0.91, 95% CI: 0.90–0.93; P-POSSUM = 0.90, 95% CI: 0.88–0.92; SRS = 0.85, 95% CI: 0.83–0.88). The AUROC for the SORT was still significantly better than SRS (p < 0.001), but not P-POSSUM (p = 0.121). Subjective assessment in this first sensitivity analysis demonstrated similar overprediction of risk (S2 Fig, Hosmer–Lemeshow test p < 0.001) but similar discrimination (S2 Fig, AUROC = 0.89, 95% CI: 0.87–0.92) to the main study analysis. Differences in discrimination between subjective assessment and SORT were again not significantly different (p = 0.216). Continuous NRI analysis again did not show improvement in classification when using the SORT compared with subjective assessment in this larger group of patients.

For the second sensitivity analysis (S7 Text), we used a previously defined more restrictive inclusion criteria to identify high-risk patients [15, 32]. This yielded a subgroup of 12,985 patients in whom the 30-day mortality rate was 2.01%. In this subgroup, calibrations of P-POSSUM, SRS, and SORT predictions were similar to the full cohort (S3 Fig). The AUROCs were lower in this subgroup (SORT = 0.88, 95% CI: 0.86–0.90; P-POSSUM = 0.86; 95% CI: 0.84–0.89; SRS = 0.81, 95% CI: 0.78–0.84, S3 Fig). The calibration of subjective assessment was again similar to that of the full cohort, and discrimination was reduced but still good (AUROC = 0.85; 95% CI: 0.82–0.89, S3 Fig). The discrimination of subjective assessment in this subgroup was not significantly different from the full cohort (p = 0.155).

The third sensitivity analysis (S8 Text) used the subgroup whose mortality estimate was based on clinical judgement in conjunction with any objective risk tool (n = 4,751, S4 Fig). The AUROC for subjective assessment in this subgroup was 0.88, which was not significantly different from the AUROC in the main cohort (p = 0.769). The calibration of subjective assessment in this subgroup was similar to that in the main cohort, again with a tendency to overpredict risk.

In the fourth sensitivity analysis (S9 Text), we looked for differences in performance of subjective clinical assessment and objective risk tools between the UK and the Australia/New Zealand cohorts (S5 Fig and S3 Table). The 30-day mortality in the Australia/New Zealand cohort (1.09%) was comparable to that of the UK (1.45%, p = 0.127). Visual inspection of calibration plots showed SORT to be worse calibrated in Australasia than the UK. AUROCs for the objective tools in the Australasian subset (P-POSSUM = 0.90, SRS = 0.81, SORT = 0.87) were not significantly different from the AUROCs in the UK subset (P-POSSUM = 0.89, SRS = 0.85, and SORT = 0.90, p > 0.05 for all). The calibration of subjective clinical assessment was comparable in the 2 geographical subgroups, and there were also no significant differences in AUROCs (Australasia: 0.88, UK: 0.89, p = 0.860, S6 Fig).

For the fifth sensitivity analysis (S10 Text), we used the subgroup of patients who had no missing P-POSSUM variables (n = 18,362; see S1 Table for patient characteristics). Patients with complete P-POSSUM variables appeared to be older, have higher ASA-PS grades, and undergo higher-severity surgery in comparison with those with missing P-POSSUM variables. The AUROC for clinical assessments in the subgroup with full P-POSSUM variables was 0.90, which was not significantly different from the AUROC obtained for clinical assessments in the main study analysis (p = 0.587), and the predictions were similarly calibrated to clinical assessments in the main study analysis, again with a tendency to overpredict risk (S7 Fig). When comparing the performance of P-POSSUM (AUROC = 0.89), SRS (AUROC = 0.84), and SORT (AUROC = 0.90) in this subgroup, the performance was again similar to that of the main study cohort (p > 0.05 for all comparisons).

In the sixth and final sensitivity analysis (S11 Text), we evaluated the AUROC and calibration of the SORT–clinical judgement model in subgroups of patients according to surgical specialty (S4 Table). We found that the AUROC remained high within these subgroups (ranging from 0.87, 95% CI 0.75–0.98 in 1,033 cardiothoracic surgical patients through to 0.95, 95% CI 0.90–0.99 in 4,309 gynaecology and urology patients). Calibration was also good across different specialties, with the exception of vascular surgery (674 patients, AUROC 0.88, 95% CI 0.82–0.94; Hosmer-Lemeshow p-value = 0.009).

Discussion

We present data from an international cohort of patients undergoing inpatient surgery with a low risk of recruitment bias. Despite a plethora of options for objective risk assessment, in over 80% of patients, subjective assessment alone was used to predict 30-day mortality risk. All previously published risk models were poorly calibrated for this cohort of patients, reflecting the common problem of calibration drift over time. However, the combination of subjective clinical assessment with the parsimonious SORT model provides an accurate prediction of 30-day mortality, which is significantly better than any of the methods we evaluated used on their own. These findings should give confidence to clinicians that the combined SORT–clinical judgement model can be used to support the appropriate allocation of finite resources and to inform discussions with patients about the risks of surgery. The combined model accurately downgraded predicted risk compared with other methods; therefore, application of this approach may result in fewer low-risk patients inappropriately admitted to critical care (thus easing system pressures) and may result in fewer patients having their surgery cancelled for the lack of a critical care bed [7]. Finally, application of the SORT–clinical judgement model may assist hospital managers and policy makers in determining the likely demand for postoperative critical care, thus supporting best practice at the hospital, regional, or national level. This new model will now be incorporated into an open-access risk-assessment system (http://www.sortsurgery.com/), enabling clinicians to combine their clinical estimation of risk and the SORT model to evaluate patient risk from major surgery.

To our knowledge, this is the first study comparing subjective and objective assessment for predicting perioperative mortality risk in a large multicentre international cohort. The highest-quality previous studies in this field have been challenged by recruitment bias because of the predominant participation of research active centres and the need for patient consent. For example, the METS (Measurement of Exercise Tolerance before Surgery) study ([15], which compared clinical assessment of functional capacity with exercise testing, self-assessment, and a serum biomarker in 1,401 patients, and the VISION study (Vascular Events in non-cardiac surgery cohort study) [32], which evaluated postoperative biomarkers in 15,133 patients, had 27% and 68% screening to recruitment rates, respectively. One way of overcoming such biases would be to study the accuracy of prognostic models using routinely collected or administrative data; however, this is unlikely to enable the evaluation of subjective assessments in multiple centres. Our study avoided these issues through prospective data collection in an unselected cohort with an ethical waiver for patient consent. The mortality in our sample closely matches that recorded in UK administrative data of patients undergoing major or complex surgery [11], therefore supporting our assertion that our cohort was representative of the ‘real-world’ perioperative population.

Our observation that the majority of risk assessments conducted for perioperative patients do not involve objective measures is also noteworthy because subjective assessment is currently almost never incorporated into risk prediction tools for surgery. One exception is the American College of Surgeons National Surgical Quality Improvement Program Surgical Risk Calculator [33], which incorporates a 3-point scale of clinically assessed surgical risk (normal, high, or very high) to supplement a calculated prediction of mortality and various short-term outcomes. However, this system is proprietary, has rarely been evaluated outside the US, and is substantially more complex than the SORT–clinical judgement model, with 21 input variables compared with 8. Furthermore, their methodology for developing this ‘uplift’ was quite different from ours, using a panel of 80 surgeons to evaluate 10 case scenarios and grade them in retrospect.

We recognise some limitations to our study. First, models predicting rare events may appear optimistically accurate, as a model that identifies every patient as being at low risk of mortality in a group in which the probability of death approaches 0% would almost always appear to be correct. For this reason, we undertook several sensitivity analyses, including one that evaluated the performance of the various risk-assessment methods in a subgroup of patients who have been defined as high risk in previous studies of prognostic indicators and in whom the mortality rate was higher. We found that the performance of the SORT and subjective assessment remained good and compared favourably with previous evaluations of more complex risk-assessment methods [15,32]. Second, whilst we assumed that subjective assessments were truly clinically based judgements, because this was a pragmatic unblinded study, it was possible that information from other sources may have subconsciously influenced these assessments. For this reason, we undertook the second sensitivity analysis, which refuted this possible risk. Third, the very act of estimating mortality risk may lead clinicians to take actions that improve that risk, therefore biasing the outcome of the assessments made and in particular affecting the calibration of subjective risk estimates. The only way to avoid this risk would be to have used subjective assessments made by clinicians independent of the clinical management of individual patients, and this may be an interesting opportunity for future research. Fourth, since we undertook this study, other promising risk-assessment methods have been developed, including the Combined Assessment of Risk Encountered in Surgery (CARES) system, which was developed using electronic health records; unfortunately, we were unable to externally validate this system because we did not collect all the required variables [34]. We also did not evaluate the accuracy of other risk prediction methods such as frailty assessment or cardiopulmonary exercise testing. However, this was not an a priori objective of our study [18]; furthermore, our observation of the lack of ‘real-world’ use of these types of predictors is in itself an important finding, particularly given the substantial interest in such measures (some of which carry considerable cost) in the research literature [15,35]. Fifth, the UK cohort was substantially larger than the Australasian cohort; however, we found no significant differences in mortality or accuracy of the various risk-assessment methods between the 2 geographical groups. Finally, the study was conducted entirely in high-income countries; therefore, our findings should now be tested in low- and middle-income nations in order to evaluate global generalisability.

Our finding that the combination of subjective and SORT-based assessment is the best approach is important because it is likely to have face validity with clinicians, thereby improving the likelihood that our new model will be incorporated into clinical practice. There is a sound rationale for this finding, as it is likely that clinicians consider otherwise unmeasured factors that they recognise as important, such as severity of comorbid diseases, frailty, socioeconomic status, patient motivation, and anticipated technical challenges. Modern approaches to risk assessment using machine learning [36] provide promise for automation of risk prediction and incorporating data and calculations that clinicians may subconsciously consider when making subjective decisions; however, even these methods do not substantially outperform our simpler approach and are currently limited by recruitment biases and lack of availability. Future research could evaluate the benefits of incorporating clinical judgement into risk-assessment methods in medicine more generally.

Implementation of a widely available, parsimonious, and free-to-use risk-assessment tool to guide clinical decision-making about critical care allocations and other aspects of perioperative care may now be considered particularly important in view of the likely prevalence of endemic COVID-19 leading to an increased demand for critical care facilities. Therefore, now more than ever, risk-based allocation of these resources is important for the benefit of individual patients and the hospitalised population as a whole. Further to this, application of either the SORT or the SORT–clinical judgement model to perioperative population data may assist healthcare policy makers and managers in modelling the likely demand for postoperative critical care, thus improving system level planning and resource utilisation. Based on the results of this large generalisable cohort study, the focus of the perioperative academic community could now shift from evaluation of which risk prediction method might be best to testing the impact of SORT–clinical judgement-based decision-making on perioperative outcomes.

In conclusion, the combination of subjective and objective risk assessment using the SORT calculator provides a more accurate estimate of 30-day postoperative mortality than subjective assessment alone. Implementation of the SORT–clinical judgement model should lead to better clinical decision-making and improved allocation of resources such as critical care beds to patients who are most likely to benefit.

Supporting information

S1 Text. STROBE checklist.

STROBE, Strengthening the Reporting of Observational Studies in Epidemiology.

(PDF)

S2 Text. TRIPOD checklist.

TRIPOD, Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis.

(DOCX)

S3 Text. Case report form.

(DOCX)

S4 Text. Continuous net reclassification index analysis and reclassification tables.

(DOCX)

S5 Text. Sensitivity analyses overview.

(DOCX)

S6 Text. Sensitivity analysis 1.

(DOCX)

S7 Text. Sensitivity analysis 2.

(DOCX)

S8 Text. Sensitivity analysis 3.

(DOCX)

S9 Text. Sensitivity analysis 4.

(DOCX)

S10 Text. Sensitivity analysis 5.

(DOCX)

S11 Text. Sensitivity analysis 6.

(DOCX)

S12 Text. Acknowledgments and full list of SNAP2: EPICCS collaborators.

SNAP2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery.

(DOCX)

S1 Table. Characteristics of the patient subgroups used in all sensitivity analyses.

ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

(DOCX)

S2 Table. Confusion matrix of patients 30-day mortality outcomes versus clinician predictions.

(%) represents row percentage.

(DOCX)

S3 Table. AUROCs of the objective risk tools and subjective assessment, compared between the UK and Australian/New Zealand data subsets.

We found no significant difference in discrimination using any of the risk prediction tools or using subjective assessment when comparing their performance in the UK and Australian/New Zealand data sets. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

(DOCX)

S4 Table. Discrimination and calibration performance of the new combined prediction model in different specialty subgroups.

(DOCX)

S1 Fig. Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D) validated in the whole patient cohort, including those undergoing obstetric procedures.

In the calibration plots (A–C), nonparametric smoothed best-fit curves (blue) are shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each decile of predicted mortality. External validation of all 3 models were performed on the entire SNAP-2: EPICCS patient data set (n = 25,854). CI, confidence interval; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

(PDF)

S2 Fig. Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone, drawn from the full SNAP-2: EPICCS data set, including patients who underwent obstetric surgery (n = 21,325).

For (A), a nonparametric smoothed best-fit curve (blue) is shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each range of clinician predicted mortality. For (C), the apparent (blue) and optimism-corrected (red) nonparametric smoothed calibration curves are shown, the latter was generated from 1,000 bootstrapped resamples of the data set. CI, confidence interval; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool.

(PDF)

S3 Fig. Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with restricted inclusion criteria (n = 12,985).

The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.863, 0.810, 0.875, and 0.853 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

(PDF)

S4 Fig. Calibration plot (A) and ROC curve (B) for clinical assessments, validated in the sensitivity analysis patient subgroup in which clinical assessments were made in conjunction with 1 or more other risk prediction tools (n = 4,786).

The AUROC for clinical assessments was 0.880 in this subgroup. AUROC, Area Under Receiver Operating Characteristic curve.

(PDF)

S5 Fig. Calibration plots (A to F) and ROC curves (G & H) for objective risk tools, validated in patients stratified by their country groups.

There was minimal difference between countries. ROC, Receiver Operating Characteristic.

(PDF)

S6 Fig. Calibration plots (A & B) and ROC curves (C & D) for clinical assessments, validated in patients stratified by their country groups.

There was minimal difference between countries. ROC, Receiver Operating characteristic Curve.

(PDF)

S7 Fig. Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with complete P-POSSUM variables (n = 18,362).

The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.893, 0.838, 0.899, and 0.896 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

(PDF)

S1 Data. Raw data.

(ZIP)

Abbreviations

ASA-PS

American Society of Anesthesiologists Physical Status

AUROC

Area Under Receiver Operating Characteristic curve

CARES

Combined Assessment of Risk Encountered in Surgery

CI

confidence interval

COPD

Chronic Obstructive Pulmonary Disease

CPET

cardiopulmonary exercise testing

DCA

decision-curve analysis

IQR

interquartile range

METS

Metabolic Equivalents

NRI

Net Reclassification Improvement

P-POSSUM

Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality

QI

Quality Improvement

SNAP-2: EPICCS

Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery

SORT

Surgical Outcome Risk Tool

SRS

Surgical Risk Scale

STROBE

Strengthening the Reporting of Observational Studies in Epidemiology

TRIPOD

Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

VISION

Vascular Events in Non-cardiac Surgery patients cohort study

Data Availability

All relevant data are within the manuscript and its supporting information files.

Funding Statement

National Institute for Academic Anaesthesia (NIAA) Association of Anaesthetists project grant awarded to SRM: www.niaa.org.uk. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Meara JG, Leather AJ, Hagander L, Alkire BC, Alonso N, Ameh EA et al. Global Surgery 2030: evidence and solutions for achieving health, welfare, and economic development. Lancet. 2015;386: 569–624. 10.1016/S0140-6736(15)60160-X [DOI] [PubMed] [Google Scholar]
  • 2.Lees N, Peden CJ, Dhesi J, Quiney N, Lockwood S, Symons NR et al. The High-Risk General Surgical Patient: Raising the Standard. Updated recommendations on the Perioperative Care of the High-Risk General Surgical Patient [Internet]. 2018 [cited 2020 Jun 22]. https://www.rcseng.ac.uk/-/media/files/rcs/news-and-events/media-centre/2018-press-releases-documents/rcs-report-the-highrisk-general-surgical-patient—raising-the-standard—december-2018.pdf
  • 3.Levine GN, O’Gara PT, Beckman JA, Al-Khatib SM, Birtcher KK, Cigarroa JE et al. Recent Innovations, Modifications, and Evolution of ACC/AHA Clinical Practice Guidelines: An Update for Our Constituencies: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation. 2019;139: e879–e886. 10.1161/CIR.0000000000000651 [DOI] [PubMed] [Google Scholar]
  • 4.Mchale JV. Innovation, informed consent, health research and the Supreme Court: Montgomery v Lanarkshire—a brave new world. Health Econ Policy Law. 2017;12: 435–452. 10.1017/S174413311700010X [DOI] [PubMed] [Google Scholar]
  • 5.Biccard BM, Madiba TE, Kluyts HL, Munlemvo DM, Madzimbamuto FD, Basenero A et al. Perioperative patient outcomes in the African Surgical Outcomes Study: a 7-day prospective observational cohort study. Lancet. 2018;391: 1589–1598. 10.1016/S0140-6736(18)30001-1 [DOI] [PubMed] [Google Scholar]
  • 6.Ghaferi AA, Birkmeyer JD, Dimick JB. Complications, failure to rescue, and mortality with major inpatient surgery in medicare patients. Ann Surg. 2009;250: 1029–1034. 10.1097/sla.0b013e3181bef697 [DOI] [PubMed] [Google Scholar]
  • 7.Wong DJN, Harris SK, Moonesinghe SR, SNAP-2: EPICCS Collaborators. Cancelled operations: a 7-day cohort study of planned adult inpatient surgery in 245 UK National Health Service hospitals. Br J Anaesth. 2018;121: 730–738. 10.1016/j.bja.2018.07.002 [DOI] [PubMed] [Google Scholar]
  • 8.Weiser TG, Haynes AB, Molina G, Lipsitz SR, Esquivel MM, Uribe-Leitz T et al. Estimate of the global volume of surgery in 2012: an assessment supporting improved health outcomes. Lancet. 2015;385(Suppl 2): S11. [DOI] [PubMed] [Google Scholar]
  • 9.Moonesinghe SR, Harris S, Mythen MG, Rowan KM, Haddad FS, Emberton M et al. Survival after postoperative morbidity: a longitudinal observational cohort study. Br J Anaesth. 2014;113: 977–984. 10.1093/bja/aeu224 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.PQIP Project Team. 1st annual report of the Perioperative Quality Improvement Programme. 2018 [cited 2019 December 13]. https://pqip.org.uk/pages/ar2018
  • 11.Abbott TEF, Fowler AJ, Dobbs TD, Harrison EM, Gillies MA, Pearse RM. Frequency of surgical treatment and related hospital procedures in the UK: a national ecological study using hospital episode statistics. Br J Anaesth. 2017;119: 249–257. 10.1093/bja/aex137 [DOI] [PubMed] [Google Scholar]
  • 12.Khuri SF, Henderson WG, DePalma RG, Mosca C, Healey NA, Kumbhani DJ. Determinants of long-term survival after major surgery and the adverse effect of postoperative complications. Ann Surg. 2005;242: 326–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Toner A, Hamilton M. The long-term effects of postoperative complications. Curr Opin Crit Care. 2013;19: 364–368. 10.1097/MCC.0b013e3283632f77 [DOI] [PubMed] [Google Scholar]
  • 14.Partridge JS, Harari D, Dhesi JK. Frailty in the older surgical patient: a review. Age Ageing. 2012;41: 142–147. 10.1093/ageing/afr182 [DOI] [PubMed] [Google Scholar]
  • 15.Wijeysundera DN, Pearse RM, Shulman MA, Abbott TEF, Torres E, Ambosta A et al. Assessment of functional capacity before major non-cardiac surgery: an international, prospective cohort study. Lancet. 2018;391: 2631–2640. 10.1016/S0140-6736(18)31131-0 [DOI] [PubMed] [Google Scholar]
  • 16.Moonesinghe SR, Mythen MG, Das P, Rowan KM, Grocott MP. Risk stratification tools for predicting morbidity and mortality in adult patients undergoing major surgery: qualitative systematic review. Anesthesiology. 2013;119: 959–981. 10.1097/ALN.0b013e3182a4e94d [DOI] [PubMed] [Google Scholar]
  • 17.Peden CJ, Stephens T, Martin G, Kahan BC, Thomson A, Rivett K et al. Effectiveness of a national quality improvement programme to improve survival after emergency abdominal surgery (EPOCH): a stepped-wedge cluster-randomised trial. Lancet. 2019;393: 2213–2221. 10.1016/S0140-6736(18)32521-2 [DOI] [PubMed] [Google Scholar]
  • 18.Moonesinghe SR, Wong DJN, Farmer L, Shawyer R, Myles PS, Harris SK et al. SNAP-2 EPICCS: the second Sprint National Anaesthesia Project-EPIdemiology of Critical Care after Surgery: protocol for an international observational cohort study. BMJ Open. 2017;7: e017690 10.1136/bmjopen-2017-017690 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol. 2008;61: 344–349. 10.1016/j.jclinepi.2007.11.008 [DOI] [PubMed] [Google Scholar]
  • 20.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement. Br J Surg. 2015;102: 148–158. 10.1002/bjs.9736 [DOI] [PubMed] [Google Scholar]
  • 21.Wong DJN, Popham S, Wilson AM, Barneto LM, Lindsay HA, Farmer L et al. Postoperative critical care and high-acuity care provision in the United Kingdom, Australia, and New Zealand. Br J Anaesth. 2019;122: 460–469. 10.1016/j.bja.2018.12.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Prytherch DR, Whiteley MS, Higgins B, Weaver PC, Prout WG, Powell SJ. POSSUM and Portsmouth POSSUM for predicting mortality. Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity. Br J Surg. 1998;85: 1217–1220. [DOI] [PubMed] [Google Scholar]
  • 23.Sutton R, Bann S, Brooks M, Sarin S. The Surgical Risk Scale as an improved tool for risk-adjusted analysis in comparative surgical audit. Br J Surg. 2002;89: 763–768. 10.1046/j.1365-2168.2002.02080.x [DOI] [PubMed] [Google Scholar]
  • 24.Protopapa KL, Simpson JC, Smith NC, Moonesinghe SR. Development and validation of the Surgical Outcome Risk Tool (SORT). Br J Surg. 2014;101: 1774–1783. 10.1002/bjs.9638 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21: 128–138. 10.1097/EDE.0b013e3181c30fb2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Swets JA. Measuring the Accuracy of Diagnostic Systems. Science. 1988;240: 1285–1293. 10.1126/science.3287615 [DOI] [PubMed] [Google Scholar]
  • 27.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44: 837–845. [PubMed] [Google Scholar]
  • 28.Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27: 157–72; discussion 207. 10.1002/sim.2929 [DOI] [PubMed] [Google Scholar]
  • 29.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26: 565–574. 10.1177/0272989X06295361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.National Institute for Health and Care Excellence. Routine preoperative tests for elective surgery (NICE guideline NG45). 2016 [cited 2019 December 13]. https://www.nice.org.uk/guidance/ng45 [PubMed]
  • 31.Graham JW. Analysis of Missing Data In: Graham JW, ed. Missing Data: Analysis and Design. New York: Springer; 2012. pp. 47–69. [Google Scholar]
  • 32.Devereaux PJ, Chan MT, Alonso-Coello P, Walsh M, Berwanger O, Villar JC et al. Association between postoperative troponin levels and 30-day mortality among patients undergoing noncardiac surgery. JAMA. 2012;307: 2295–2304. 10.1001/jama.2012.5502 [DOI] [PubMed] [Google Scholar]
  • 33.Bilimoria KY, Liu Y, Paruch JL, Zhou L, Kmiecik TE, Ko CY et al. Development and evaluation of the universal ACS NSQIP surgical risk calculator: a decision aid and informed consent tool for patients and surgeons. J Am Coll Surg. 2013;217: 833–42.e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Chan DXH, Sim YE, Chan YH, Poopalalingam R, Abdullah HR. Development of the Combined Assessment of Risk Encountered in Surgery (CARES) surgical risk calculator for prediction of postsurgical mortality and need for intensive care unit admission risk: a single-center retrospective study. BMJ Open. 2018;8: e019427 10.1136/bmjopen-2017-019427 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Moran J, Wilson F, Guinan E, McCormick P, Hussey J, Moriarty J. Role of cardiopulmonary exercise testing as a risk-assessment method in patients undergoing intra-abdominal surgery: a systematic review. Br J Anaesth. 2016;116: 177–191. 10.1093/bja/aev454 [DOI] [PubMed] [Google Scholar]
  • 36.Corey KM, Kashyap S, Lorenzi E, Lagoo-Deenadayalan SA, Heller K, Whalen K et al. Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study. PLoS Med. 2018;15: e1002701 10.1371/journal.pmed.1002701 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Louise Gaynor-Brook

10 Mar 2020

Dear Dr. Moonesinghe,

Thank you very much for submitting your manuscript "Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study" (PMEDICINE-D-19-04548) for consideration at PLOS Medicine.

Your paper was evaluated by a senior editor and discussed among the editors here. It was also discussed with an academic editor with relevant expertise, and sent to independent reviewers, including a statistical reviewer. The reviews are appended at the bottom of this email and any accompanying reviewer attachments can be seen via the link below:

[LINK]

In light of these reviews, I am afraid that we will not be able to accept the manuscript for publication in the journal in its current form, but we would like to consider a revised version that addresses the reviewers' and editors' comments. Obviously we cannot make any decision about publication until we have seen the revised manuscript and your response, and we plan to seek re-review by one or more of the reviewers.

In revising the manuscript for further consideration, your revisions should address the specific points made by each reviewer and the editors. Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments, the changes you have made in the manuscript, and include either an excerpt of the revised text or the location (eg: page and line number) where each change can be found. Please submit a clean version of the paper as the main article file; a version with changes marked should be uploaded as a marked up manuscript.

In addition, we request that you upload any figures associated with your paper as individual TIF or EPS files with 300dpi resolution at resubmission; please read our figure guidelines for more information on our requirements: http://journals.plos.org/plosmedicine/s/figures. While revising your submission, please upload your figure files to the PACE digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at PLOSMedicine@plos.org.

We expect to receive your revised manuscript by Mar 31 2020 11:59PM. Please email us (plosmedicine@plos.org) if you have any questions or concerns.

***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.***

We ask every co-author listed on the manuscript to fill in a contributing author statement, making sure to declare all competing interests. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. If new competing interests are declared later in the revision process, this may also hold up the submission. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT. You can see our competing interests policy here: http://journals.plos.org/plosmedicine/s/competing-interests.

Please use the following link to submit the revised manuscript:

https://www.editorialmanager.com/pmedicine/

Your article can be found in the "Submissions Needing Revision" folder.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plosmedicine/s/submission-guidelines#loc-methods.

Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it.

We look forward to receiving your revised manuscript.

Sincerely,

Louise Gaynor-Brook, MBBS PhD

Associate Editor

PLOS Medicine

plosmedicine.org

-----------------------------------------------------------

Requests from the editors:

General comment: Please cite reference numbers in square brackets, leaving a space before the reference bracket, and removing spaces between reference numbers where more than one reference is cited e.g. '... postoperative complications [5,6].'

General comment: Please refer to a more specific part of your appendix to make clear to readers to what you are referring e.g. Figure S1, Table S1, etc.

General comment: Throughout the ms, please quote exact p values or "p<0.001", unless there is a specific statistical justification for quoting smaller exact numbers.

General comment: Please add line numbers to your revised manuscript.

Data Availability Statement: PLOS Medicine requires that the de-identified data underlying the specific results in a published article be made available, without restrictions on access, in a public repository or as Supporting Information at the time of article publication, provided it is legal and ethical to do so. Please provide the URL and any other details that will be needed to access data stored in the public repository.

Title: Please revise your title according to PLOS Medicine's style, placing the study design in the subtitle (ie, after a colon). We suggest “Subjective and objective risk assessment for predicting mortality

after major surgery: an international prospective cohort study” or similar

Abstract Background: Please expand upon the context of why the study is important. The final sentence should clearly state the study question, and clarify the aims of the study e.g. to compare the accuracy of freely-available objective surgical risk tools with subjective clinical assessment in predicting 30-day mortality

Please combine the Methods and Findings components of your Abstract into one subsection with the heading Methods and Findings’

Please include the study design, brief demographic details of the population studied (e.g. age, sex, types of surgery, etc) and further details of the study setting (e.g. what type of hospitals - secondary / tertiary care? Rural / urban? etc.), and main outcome measures.

In your abstract and elsewhere, please quote p values alongside 95% CI where available.

Please revise the sentence beginning ‘We included consecutive adults…’ to clarify. Please revise 'consecutive adults’ and ‘undergoing inpatient surgery for one week’

In the last sentence of the Abstract Methods and Findings section, please describe the main limitation(s) of the study's methodology.

Please begin your Abstract Conclusions with "In this study, we observed ..." or similar. Please address the study implications, emphasising what is new without overstating your conclusions.

Please tone down subjective language such as ‘highly accurate’

Please avoid vague statements such as " Clinicians can use this….", to instead highlight that ‘This may be of value in helping to stratify… ‘

At this stage, we ask that you include a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should use non-identical language distinct from the scientific abstract. Please see our author guidelines for more information: https://journals.plos.org/plosmedicine/s/revising-your-manuscript#loc-author-summary Please ensure to include a final bullet point under ‘What do these findings mean?’ to describe the main limitation(s) of the study.

Introduction

Please explain the need for and potential importance of your study. Indicate whether your study is novel and how you determined that.

Please define QI

Methods

Please include the completed STROBE / TRIPOD checklists as Supporting Information and refer to the relevant supplementary files in your Methods section. When completing the checklists, please use section and paragraph numbers, rather than page numbers.

Please adapt "STROBE diagram" to "participant flowchart", or similar.

Did your study have a prospective protocol or analysis plan? Please state this (either way) early in the Methods section. If a prospective analysis plan was used in designing the study, please include the relevant prospectively written document with your revised manuscript as a Supporting Information file to be published alongside your study, and cite it in the Methods section. If no such document exists, please make sure that the Methods section transparently describes when analyses were planned, when/why any data-driven changes to analyses took place, and what those changes were.

Please provide more detail on the hospitals included in the study e.g. secondary / tertiary care, how many from each country, etc.

Please provide detail on the inclusion criteria, as this is notably missing from your supplementary material.

Please provide further detail on the institutional research and development department approvals for Northern Ireland, Australia and New Zealand as you have for the UK. This should include the names of the institutional review board(s) that provided ethical approval.

Results

Please mention in the main text of your Results that 4,891 surgical episodes were excluded from analysis (as indicated in your STROBE diagram)

Please quote p values alongside 95% CI where available.

Please refer to supplementary figures for the results presented for the sensitivity analyses.

Please consider whether additional sensitivity analyses could be performed to at least partially address comments about inclusion of certain specialties that might bias the overall findings in a particular direction. Please ensure that it is made clear in your Methods that any such additional analyses are non-prespecified.

Table 1 - please revise ‘Xmajor’.

Tables 3 & 4 - please define all abbreviations used in the table legend. When a p value is given, please specify the statistical test used to determine it in the table legend.

Tables S3 & S4 - please define all abbreviations used in the table legend.

Figure 2 - please define all abbreviations used in the figure legend

Figure 4 - please provide a figure legend

Supplementary Figures 3 and 4 - please provide a figure legend

Discussion

Please present and organize the Discussion as follows: a short, clear summary of the article's findings; what the study adds to existing research and where and why the results may differ from previous research; strengths and limitations of the study; implications and next steps for research, clinical practice, and/or public policy; one-paragraph conclusion.

Please tone down subjective language such as ‘highly accurate’

Please avoid vague statements such as " Clinicians can use this….", to instead highlight that ‘This may be of value in helping to stratify… ‘

References

Please provide names of the first 6 co-authors for each paper referenced, before ‘et al’

Noting reference 11, please ensure that all references have full access details.

Supplementary Files

Please clearly label each supplementary file / table / figure with a distinct name and ensure that a reference is made to each component in the main text of the manuscript.

Please ensure that all components of your supplementary files e.g. inclusion criteria are included in your resubmitted manuscript.

Comments from the reviewers:

Reviewer #1: "Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study of 26,216 patients" evaluates three publicly-available objective risk tools against clinicians' subjective judgment on mortality prediction, and further integrates the best-performing objective risk tool with said subjective judgment to produce an even better-performing logistic regression model.

The prospective nature of the study, diverse demographics involved (over 26,000 patients from 274 hospitals in the United Kingdom, Australia and New Zealand) and open availability of the examined objective risk tooks (P-POSSUM, SRS and SORT) are particular strengths of the manuscript. Overall, the work seems to have the potential to broadly benefit surgical management practice. However, there are a number of points that might be expanded upon.

Firstly, on the comparison methods; the authors may wish to clarify their definition of objective vs. subjective for risk assessment procedures. In particular, the "objective" Surgical Risk Scale (SRS) appears to comprise a summation of ASA-PS scores together with the CEPOD and BUPA scores ("The surgical risk scale as an improved tool for risk-adjusted analysis in comparative surgical audit", Sutton et al., 2002), but the ASA-PS is itself considered "subjective" assessment (Page 8). Moreover, it is not clear whether CEPOD and BUPA are any less "subjective" than ASA-PS, from their descriptions.

Also, for P-POSSUM, it is noted that a number of biochemical/haematological parameters were missing in practice ("Missing data" section, Page 9), and in these cases, normal data was assumed/imputed. Moreover, there remained a small number of cases with further missing data beyond these bio/haemo parameters (as understood from "Following imputation, we performed a complete case analysis as we considered the proportion of cases with missing data in *the remaining variables* to be low [1.08%]"). This practice however seems problematic, in that close to half of the variables used for P-POSSUM appear to be biochemical/haemotological (i.e. haemoglobin, WBC, urea, sodium, potassium). Simply assuming normal values for these variables would then appear to rob P-POSSUM of quite a bit of its utility, since these assumptions constitute unsubstantiated evidence towards better patient outcomes. The authors might clarify as to exactly how many cases were affected by missing data for P-POSSUM, and consider an analysis of only those cases where full data was available for P-POSSUM, for a fair comparison.

In general, the authors might consider summarizing the parameters & criteria used for the various objective methods as supplementary material.

Secondly, on the use of AUROC as a main quantative assessment metric of the various methods - it is stated that the 30-day risk of death was predicted as one of six categorical responses, for all methods used (Page 7). Then, although the continuous net reclassification improvement statistic (NRI) is cited (Page 8), the authors might clarify in greater detail as to how this relates to the construction of ROC curves (i.e. Figure 2 & 3). For example, if the prediction is 0.5% and the patient indeed survives for 30 days, what sensitivity/specificity does this amount to as opposed to if the patient had not survived (we assume a binary outcome)? While readers might be familiar with the usual binary ROC formulation, the implementation of multiple categories might warrant more description.

Further on the categorial prediction, the calibration graphs in Figures 2 and 3 appear to suggest that the various methods output different numbers of point estimates. For example, P-POSSUM has 10, SRS has 6, SORT has 9 and clinicians have 4 (as opposed to the six categories implied in the Dataset section). While it seems that a greater number of point estimates improves the level of detail of the corresponding ROC curve, the effect seems exaggerated for SORT and SRS; in particular, the ASA-PS ROC curve in Figure 2 seems to be a piecewise construction from about 3 datapoints, while the SRS curve likewise seems to be piecewise constructed from about 6 datapoints. However, the SORT and SRS curves have much finer detail (i.e. have an independent value for each 0.01 change in specificity or less), despite only having slightly more point estimates than SRS. The authors might wish to explain this discrepancy.

Thirdly, while there is detailed analysis of the aggregate statistics for the various methods (including on a high-risk patient subset), an additional inter-model analysis at finer granularity would seem to be appropriate. In particular, do the various objective and subjective models tend to agree/disagree on specific patients, and in cases where they strongly disagree (i.e. one method predicts a very low mortality risk, while another method predicts a significant risk, for the same patient), what are the factors that might have led to this disagreement? Such an analysis would help to determine whether the choice of particular models (e.g. the combined logistic regression model vs. SORT alone) is strictly beneficial for all patients (strictly dominating method), or if it may shift the risks of inaccurate prediction from one group of patients onto another group.

A confusion matrix of the six prediction categories vs. binary outcomes at a reasonable ROC operating point for each method would also be illuminating.

Fourthly, the authors may wish to comment further on the practical implications of accurate mortality prediction (it is noted in the Discussion section that accurate prediction may result in fewer inappropriate admissions to critical case, though there remains a lack of "real-world" clinical uptake); will such predictions be used by clinicians/patients in deciding whether to commence surgery? Indeed, given that mortality estimates were made by the perioperative team beforehand for participating patients, did these estimates have any impact on the treatment offered (i.e. the abovementioned reduction in inappropriate admissions)?

Finally, while perhaps somewhat out of the scope of this manuscript, the authors might consider exploring data-centric machine learning methods in the future (i.e. train models directly from the available patient demographic features)

Reviewer #2: In this international prospective cohort study, the authors adress the very important issue of preopeative risk assessment for mortality.

The article is clear an well written, but there is an important methodological question that have to be considered.

1) It is certainly a very good point to test the subjetive risk assesment. However a main limitation is that the physicians could not have used objective assessments scores to help them answering the question proposed. These cases should have been excluded for all analysis initially (flowchart 1). Additionally, patients for whom ASA score was used were kept in the other analysis and classified as belonging to the `subjective analysis `group. Altough ASA score is the oldest and has a lot of subjectivity in itself, it could not be considered that no score was used. Also, the numbers in Flowchart 1 and Table 2 are a bit confusing (23540 patients with clinical judgment only and 9928 with ASA score in Table 2 and in flowchart 21325 patients inluded). The main analysis should have been done only with patients classified by the clinical judgment according to the answer of the question proposed and not using any additinal tool, including ASA.

2) It is stated in methods that data imputing was done to the laboratory values that were missing to the POSSUM score. Please provide how many imputation was performed. Additionaly, assuming that the values are normal in patients undergoing elective surgery without a blood withdraw available, that may not be accurate for emergency patients, for whom there was no time to perform a blood test.

3) Although there is no consensus of what is the best risk stratification tool, in the discussion the authors mentioned that the limitation of the SQIP risk calculator was never validated outside the US. This would have been a good oportunity to do so because, although it involves more variables, it is online available for free and could be easily implemented in an electronic chart, for exemple.

Reviewer #3: Dear authors

Thank you to provide me the opportunity to read your manuscript, it reports a comparison of objective and subjective predictive tools to predict postoperative mortality after major surgery.

While the objectives of the study are interesting and would have some potentially important clinical applications, several major limitations limit the interpretation of the observed results.

My key suggestions to improve the impact of this study are:

To select preoperative (not P-POSSOM) scores that do not already include a subjective component (neither SORT nor SRS).

To not attend to create a universal predictive tool working for any type of surgeries.

To better describe the subjective score introduced: inter-rater variability, distributions and eventually to reduce the number of strata to limit variability.

To emphasize the results provided by the decision curves rather than focussing on AUC ROC in models with poor calibration.

I summarized the most important methodological concerns below

1 - Population: The study population includes a non-representative mix of cases including some surgical specialties that are associated with high postoperative mortality and in which general predictive tools don't work well (i.e cardiac surgery - 3.9% of the cohort, 5.7% of the observed deaths).

The objective of the study is to evaluate the predictive performances of scores already described elsewhere. Therefore we are not seeing representative samples and we are simply evaluating the performances of some predictive models. Unfortunately, the combination of the clinical prediction and the other scores is dependent of the study population case-mix, because a new model has been fit. Whether this combined model is working on other populations with a different case-mix is unknown and must be discussed properly as an important limitation.

2 - Population (2): The inclusion of obstetric is a usual mistake we can find in similar studies. It increased the number of patients in the cohort, but it is not associated with any deaths. Therefore, it dropped the average apparent mortality (no deaths after obstetrics in the study we are discussing today) and make the methodological tools used to compare the predictive models not accurate and potentially biased. Further, with no deaths and a clearly identified subgroup (which is included in the calculation of the scores), the inclusion of OB patients artificially increases discrimination performances but is not clinically relevant. Which clinician(s) would use the same score to predict accurately the outcome of OB and cardiac patients?

3 - Major methodological mistake:

We are dealing with low-frequency primary outcome (i.e. 30d mortality = 1.2% - 317 deaths in 26,616 patients).

P-POSSUM, SRS and SORT showed terrible calibrations in figure 2, where we are observing dramatic overestimations of the risk of death (The calibration curves are way down in the lower right part of the calibration plots). This common methodological concern is poorly detected with ROC curves. Combining infrequent outcomes and poorly calibrated predictive models constantly produce very high AUC while models have no clinical values. That is what we observed here - and in many other studies dealing with similar population characteristics. including OB patients amplified the phenomenon.

4 - Major methodological mistake (2):

SORT : Truly preoperative, include ASA (subjective and wide inter-rater variability), fine description of the surgical procedures

P-POSSUM: More objective risk score with less inter-rater variability, but includes intraoperative variable not available SRSpreoperatively

SRS: Truly preoperative, include ASA (subjective and wide inter-rater variability) and a subjective stratification associated with the planned surgery.

The new Subjective assessment introduced in this study: Preoperative, very subjective, 6 categories, somewhat like ASA. NO quantification of the inter-rater variability which is suspected to be very high in the not extreme categories.

2 of the 3 scores evaluated already include a clinical subjective evaluation of the preoperative patients' characteristics (i.e. ASA with all the limitations we know)

1 score includes intraoperative characteristics which make the comparison impossible.

The authors finally combined SORT and the clinical prediction they introduced. They demonstrated in the study population (after model fitting) that the predictive performance are somewhat better, and that the calibration is way better (they fit a new model...)

SORT already includes ASA, how does ASA interacted with the clinical prediction which look very similar.

While it is recognized that the inter-rater variability of ASA is huge, what do we know about the inter-rater variability of the clinical prediction described here?

5 - The use of decision curves is a very strong methodological part of this study, Unfortunately, their description and interpretation are somewhat neglected compared to useless ROC comparison. The reviewer strongly recommend to expand the description , interpretation and disccusion of the decision curves which actuaaly provide the information that clinicians are seeking.

Some specific comments:

Page 18 line 1:

"The SORT was the best-calibrated of the pre-existing models, however all over-predicted risk (Figure 2A-2C; Hosmer-Lemeshow p-values all "

HL test is not an appropriate approach to evaluate calibration in large cohorts where the frequency of the outcomes is low (See Tripod statements). Further a HL stat probability lower than 0.10 suggests mis-calibration. In the figure 2, HL probabilities are all lower than 0.0001. This somewhat confirms the visual analysis of the curves suggesting that the calibration of these scores is terrible (results already widely describe elsewhere)>

Page 18 line 6:

"All models exhibited good-to-excellent discrimination (Figure 2D; AUROC SORT=0.91 (95% confidence interval (CI): 0.90-0.93); P-POSSUM=0.90, (95% CI: 0.88-0.92); SRS=0.85 (95% CI: 0.83-0.88)."

High observed discriminations are the consequence of:

1/ poor calibration of the models and

2/primary outcome #1.2%.

Presented CI seems to have been produced assuming a binormal distribution (the usual approach), this is not likely to be appropriate in this setting and the reader can guess that this CI are widely underestimated.

More importantly, no discrimination should be interpreted with such a poor calibration.

Figure 5:

This figure looks great, but some confidence interval would probably show that there are some major overlaps between the c.

clinical prediction strata

Thank you again for your work and I hope my comments would be useful in your work.

There is a true need for preoperative stratification/predictive tools and this work has the potential to fill a part of this need

Yannick Le Manach MD PhD

Reviewer #4: The manuscript entitled „Comparing subjective and objective risk assessment for predicting mortality after major surgery - an international prospective cohort study" was reviewed. In this study, the authors aimed to compare subjective and objective risk assessment tools, which are utilized in predicting the probability of the postoperative mortality. The study was a well-designed, prospective, multicentric, observational study, with great effort to reduce the bias as much as possible, in terms of recruitment phase and data analysis. However, I there are some concerns regarding the manuscript, which should be addressed.

Comments:

1. In a recent study, Chan et al. (1) calculated the derivation and validation cohorts for mortality using the Combined Assessment of Risk Encountered in Surgery (CARES) surgical risk calculator, which is also comparable to the present study outcomes. The authors should also discuss the findings of this study.

2. How did the authors define the needed cut-off values for classification of preoperative estimation of mortality risk?

3. I would suggest the authors to perform a subgroup and validation analyses by classifying the patients based on the type of surgery.

4. How would it be possible to combine the two objective and subjective assessments in clinical practice. The combination of the tolls has improved the outcomes for sure, but authors did not suggest any combined approach to establish an estimation using both risk assessments simultaneously.

Reference:

1. Chan DXH, Sim YE, Chan YH, Poopalalingam R, Abdullah HR. Development of the Combined Assessment of Risk Encountered in Surgery (CARES) surgical risk calculator for prediction of postsurgical mortality and need for intensive care unit admission risk: a single-center retrospective study. BMJ open. 2018;8(3):e019427.

Any attachments provided with reviews can be seen via the following link:

[LINK]

Decision Letter 1

Clare Stone

18 Jun 2020

Dear Dr. Moonesinghe,

Thank you very much for re-submitting your manuscript "Developing and validating subjective and objective risk assessment measures for predicting mortality after major surgery: an international prospective cohort study" (PMEDICINE-D-19-04548R1) for review by PLOS Medicine.

I have discussed the paper with my colleagues and the academic editor and it was also seen again by previous reviewers. I am pleased to say that provided the remaining editorial and production issues are dealt with we are planning to accept the paper for publication in the journal.

The remaining issues that need to be addressed are listed at the end of this email. Any accompanying reviewer attachments can be seen via the link below. Please take these into account before resubmitting your manuscript:

[LINK]

Our publications team (plosmedicine@plos.org) will be in touch shortly about the production requirements for your paper, and the link and deadline for resubmission. DO NOT RESUBMIT BEFORE YOU'VE RECEIVED THE PRODUCTION REQUIREMENTS.

***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.***

In revising the manuscript for further consideration here, please ensure you address the specific points made by each reviewer and the editors. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments and the changes you have made in the manuscript. Please submit a clean version of the paper as the main article file. A version with changes marked must also be uploaded as a marked up manuscript file.

Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. If you haven't already, we ask that you provide a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should be distinct from the scientific abstract.

We expect to receive your revised manuscript within 1 week. Please email us (plosmedicine@plos.org) if you have any questions or concerns.

We ask every co-author listed on the manuscript to fill in a contributing author statement. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT.

Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it.

If you have any questions in the meantime, please contact me or the journal staff on plosmedicine@plos.org.

We look forward to receiving the revised manuscript by Jun 25 2020 11:59PM.

Sincerely,

Clare Stone, PhD

Managing Editor

PLOS Medicine

plosmedicine.org

------------------------------------------------------------

Requests from Editors:

Please provide summary demographic information to the abstract as previously requested.

We suggest quoting AUROC/95% CI in the abstract for all the objective models.

Please add a new final sentence to the "methods and findings" subsection of your abstract to quote 2-3 of the study's main limitations.

Please remove the instructions from the "author summary".

Please convert p<0.0001 to p<0.001 throughout.

Please move the reference call-outs to precede punctuation (e.g., "... decision making [2,3].").

Is reference 2 missing full access details?

Comments from Reviewers:

Reviewer #1: We thank the authors for addressing most of the points raised previously, and particularly appreciate the addition of a fourth sensitivity analysis on P-POSSUM, and an informative confusion matrix at 5% mortality. On the confusion matrix (Supplementary Table S6), we agree with the authors that a single cut-off ROC value is inadequate to characterize the ROC profile; the intention was chiefly to gain some additional perspective on the performance of the tool, which does appear to fulfil expectations. The authors might however note that some entries in the matrix appear slightly off-by-one (e.g. the Dead column is stated to total 189, but the sum of the six categories appears to be 188; 2.6-5% is stated to total 891, but the Alive+Dead for that row appears to be 892)

On part of the second point raised in the previous review on different methods having different numbers of point estimates in Figure 2 and 3, the authors' clarification on clinicians having 4 points in their calibration graph was much appreciated. However, given that it is stated that the predictions for P-POSSUM, SRS and SORT are continuous variables, the authors may wish to briefly comment on the different number of sampled points for these methods, as shown in Figure 2.

On the third point raised in the previous review ("additional inter-model analysis at finer granularity"), the major concern was whether the various risk models tend to have similar risk predictions for the same patients (in aggregate), and if not, what were the characteristics of patients that tended to be assigned different risks by different models. We agree with the authors that this may not be critical to the main thrust of the manuscript; it was proposed largely as the data appears available, and since it might be of interest given that multiple models are already being compared. As such, we would respect the authors' preference on whether to include such an analysis.

Minor issue: On page 7 line 10, "and equipoise over method is most accurate" might be "...which method is most accurate"

Reviewer #3: Dear Authors

Thank you for the responses to my comments. Regarding the limitations associated with the datasets and the objectives of your study, I believe you significantly improved your manuscript.

I still disagree with the value of subjective score in some countries where these scores are used for billing purpose. It seems it was not the case in your cohort. However, I would be prudent in any generalization tentative. A. Sankar et al. (BJA Volume 113, Issue 3, September 2014, Pages 424-432) provided a report of this problem in a jurisdiction where ASA is used for billing.

While I will not ask for any revision for this version of the manuscript. I would be delighted if the authors would consider to add a sentence about the need for further researches since objective quantifications of the risk did not perform as well as when a subjective component is included (i.e. current models and approaches do not capture this part of the information provided by clinicians…more research needed to capture it in a more objective manner).

Thanks for considering

Yannick Le Manach MD PhD

Any attachments provided with reviews can be seen via the following link:

[LINK]

Decision Letter 2

Clare Stone

3 Sep 2020

Dear Prof. Moonesinghe,

On behalf of my colleagues and the academic editor, Dr. David Menon, I am delighted to inform you that your manuscript entitled "Developing and validating subjective and objective risk assessment measures for predicting mortality after major surgery: an international prospective cohort study" (PMEDICINE-D-19-04548R2) has been accepted for publication in PLOS Medicine.

PRODUCTION PROCESS

Before publication you will see the copyedited word document (in around 1-2 weeks from now) and a PDF galley proof shortly after that. The copyeditor will be in touch shortly before sending you the copyedited Word document. We will make some revisions at the copyediting stage to conform to our general style, and for clarification. When you receive this version you should check and revise it very carefully, including figures, tables, references, and supporting information, because corrections at the next stage (proofs) will be strictly limited to (1) errors in author names or affiliations, (2) errors of scientific fact that would cause misunderstandings to readers, and (3) printer's (introduced) errors.

If you are likely to be away when either this document or the proof is sent, please ensure we have contact information of a second person, as we will need you to respond quickly at each point.

PRESS

A selection of our articles each week are press released by the journal. You will be contacted nearer the time if we are press releasing your article in order to approve the content and check the contact information for journalists is correct. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact.

PROFILE INFORMATION

Now that your manuscript has been accepted, please log into EM and update your profile. Go to https://www.editorialmanager.com/pmedicine, log in, and click on the "Update My Information" link at the top of the page. Please update your user information to ensure an efficient production and billing process.

Thank you again for submitting the manuscript to PLOS Medicine. We look forward to publishing it.

Best wishes,

Clare Stone, PhD

Managing Editor

PLOS Medicine

plosmedicine.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. STROBE checklist.

    STROBE, Strengthening the Reporting of Observational Studies in Epidemiology.

    (PDF)

    S2 Text. TRIPOD checklist.

    TRIPOD, Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis.

    (DOCX)

    S3 Text. Case report form.

    (DOCX)

    S4 Text. Continuous net reclassification index analysis and reclassification tables.

    (DOCX)

    S5 Text. Sensitivity analyses overview.

    (DOCX)

    S6 Text. Sensitivity analysis 1.

    (DOCX)

    S7 Text. Sensitivity analysis 2.

    (DOCX)

    S8 Text. Sensitivity analysis 3.

    (DOCX)

    S9 Text. Sensitivity analysis 4.

    (DOCX)

    S10 Text. Sensitivity analysis 5.

    (DOCX)

    S11 Text. Sensitivity analysis 6.

    (DOCX)

    S12 Text. Acknowledgments and full list of SNAP2: EPICCS collaborators.

    SNAP2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery.

    (DOCX)

    S1 Table. Characteristics of the patient subgroups used in all sensitivity analyses.

    ASA-PS, American Society of Anesthesiologists Physical Status; COPD, Chronic Obstructive Pulmonary Disease; IQR, interquartile range; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

    (DOCX)

    S2 Table. Confusion matrix of patients 30-day mortality outcomes versus clinician predictions.

    (%) represents row percentage.

    (DOCX)

    S3 Table. AUROCs of the objective risk tools and subjective assessment, compared between the UK and Australian/New Zealand data subsets.

    We found no significant difference in discrimination using any of the risk prediction tools or using subjective assessment when comparing their performance in the UK and Australian/New Zealand data sets. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

    (DOCX)

    S4 Table. Discrimination and calibration performance of the new combined prediction model in different specialty subgroups.

    (DOCX)

    S1 Fig. Calibration plots for the SORT (A), P-POSSUM (B), SRS (C), and ROC curves for the 3 models (D) validated in the whole patient cohort, including those undergoing obstetric procedures.

    In the calibration plots (A–C), nonparametric smoothed best-fit curves (blue) are shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each decile of predicted mortality. External validation of all 3 models were performed on the entire SNAP-2: EPICCS patient data set (n = 25,854). CI, confidence interval; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

    (PDF)

    S2 Fig. Calibration plots and ROC curves for subjective clinical assessments (A, B) and the logistic regression model combining clinician and SORT predictions (C, D), validated on the subset of patients in whom clinicians estimated risk based on clinical judgement alone, drawn from the full SNAP-2: EPICCS data set, including patients who underwent obstetric surgery (n = 21,325).

    For (A), a nonparametric smoothed best-fit curve (blue) is shown along with the point estimates for predicted versus observed mortality (black dots) and their 95% CIs (black lines) within each range of clinician predicted mortality. For (C), the apparent (blue) and optimism-corrected (red) nonparametric smoothed calibration curves are shown, the latter was generated from 1,000 bootstrapped resamples of the data set. CI, confidence interval; ROC, Receiver Operating Characteristic; SNAP-2: EPICCS, Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery; SORT, Surgical Outcome Risk Tool.

    (PDF)

    S3 Fig. Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with restricted inclusion criteria (n = 12,985).

    The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.863, 0.810, 0.875, and 0.853 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

    (PDF)

    S4 Fig. Calibration plot (A) and ROC curve (B) for clinical assessments, validated in the sensitivity analysis patient subgroup in which clinical assessments were made in conjunction with 1 or more other risk prediction tools (n = 4,786).

    The AUROC for clinical assessments was 0.880 in this subgroup. AUROC, Area Under Receiver Operating Characteristic curve.

    (PDF)

    S5 Fig. Calibration plots (A to F) and ROC curves (G & H) for objective risk tools, validated in patients stratified by their country groups.

    There was minimal difference between countries. ROC, Receiver Operating Characteristic.

    (PDF)

    S6 Fig. Calibration plots (A & B) and ROC curves (C & D) for clinical assessments, validated in patients stratified by their country groups.

    There was minimal difference between countries. ROC, Receiver Operating characteristic Curve.

    (PDF)

    S7 Fig. Calibration plots for SORT (A), P-POSSUM (B), SRS (C), and clinical assessments (E) and ROC curves for the 3 models (D) and clinical assessments (F), validated in the sensitivity analysis patient subset with complete P-POSSUM variables (n = 18,362).

    The AUROCs for P-POSSUM, SRS, SORT, and clinical assessments were 0.893, 0.838, 0.899, and 0.896 in this subgroup, respectively. AUROC, Area Under Receiver Operating Characteristic curve; P-POSSUM, Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality; SORT, Surgical Outcome Risk Tool; SRS, Surgical Risk Scale.

    (PDF)

    S1 Data. Raw data.

    (ZIP)

    Attachment

    Submitted filename: Reviewer responses SRM 01062020.docx

    Data Availability Statement

    All relevant data are within the manuscript and its supporting information files.


    Articles from PLoS Medicine are provided here courtesy of PLOS

    RESOURCES