Skip to main content
PLOS One logoLink to PLOS One
. 2021 Jan 19;16(1):e0245157. doi: 10.1371/journal.pone.0245157

A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis

William P T M van Doorn 1,2, Patricia M Stassen 3,4, Hella F Borggreve 3, Maaike J Schalkwijk 3, Judith Stoffers 3, Otto Bekers 1,2, Steven J R Meex 1,2,*
Editor: Ivan Olier5
PMCID: PMC7815112  PMID: 33465096

Abstract

Introduction

Patients with sepsis who present to an emergency department (ED) have highly variable underlying disease severity, and can be categorized from low to high risk. Development of a risk stratification tool for these patients is important for appropriate triage and early treatment. The aim of this study was to develop machine learning models predicting 31-day mortality in patients presenting to the ED with sepsis and to compare these to internal medicine physicians and clinical risk scores.

Methods

A single-center, retrospective cohort study was conducted amongst 1,344 emergency department patients fulfilling sepsis criteria. Laboratory and clinical data that was available in the first two hours of presentation from these patients were randomly partitioned into a development (n = 1,244) and validation dataset (n = 100). Machine learning models were trained and evaluated on the development dataset and compared to internal medicine physicians and risk scores in the independent validation dataset. The primary outcome was 31-day mortality.

Results

A number of 1,344 patients were included of whom 174 (13.0%) died. Machine learning models trained with laboratory or a combination of laboratory + clinical data achieved an area-under-the ROC curve of 0.82 (95% CI: 0.80–0.84) and 0.84 (95% CI: 0.81–0.87) for predicting 31-day mortality, respectively. In the validation set, models outperformed internal medicine physicians and clinical risk scores in sensitivity (92% vs. 72% vs. 78%;p<0.001,all comparisons) while retaining comparable specificity (78% vs. 74% vs. 72%;p>0.02). The model had higher diagnostic accuracy with an area-under-the-ROC curve of 0.85 (95%CI: 0.78–0.92) compared to abbMEDS (0.63,0.54–0.73), mREMS (0.63,0.54–0.72) and internal medicine physicians (0.74,0.65–0.82).

Conclusion

Machine learning models outperformed internal medicine physicians and clinical risk scores in predicting 31-day mortality. These models are a promising tool to aid in risk stratification of patients presenting to the ED with sepsis.

Introduction

Among emergency department (ED) presentations, a substantial number of patients present with symptoms of sepsis [1]. Sepsis is defined as a systemic inflammatory response syndrome (SIRS) to an infection and is associated with a wide variety of risks including septic shock and death [2]. Mortality rates of sepsis are as high as 16%, potentially increasing up to 40% when suffering from septic shock [2, 3]. Novel clinical decision support (CDS) systems capable of identifying low- or high-risk patients could become important for early treatment and triage of ED patients, but also for preventing unnecessary referrals to the intensive care unit (ICU). EDs are one of the most overcrowded units of a modern hospital, highlighting the importance of proper allocation and management of resources [1]. Development of a risk stratification tool for patients with sepsis may improve health outcome in this group, but may also contribute to resolve the problem of overcrowded EDs.

Currently, a wide variety of clinical risk scores are used in routine clinical care to facilitate risk stratification of patients with sepsis [4]. These include the relatively simple (quick) sequential organ failure assessment ((q)SOFA) score [5, 6], but also more complex scores such as the abbreviated Mortality in Emergency Department Sepsis (abbMEDS) score and modified Rapid Emergency Medicine Score (mREMS) [7, 8]. These traditional risk scores have shown varying performance for predicting 28-day mortality (area under the receiver operating characteristic curve (AUC) for abbMEDS: 0.62–0.85, mREMS: 0.62–0.84 and SOFA: 0.61–0.82) [3, 811]. In addition, clinical judgment of the attending physician in the ED plays an important role in risk stratification. The judgment of physicians was found to be a moderate to good predictor (AUC of 0.68–0.81) of mortality in the ED [12, 13].

Interestingly, a new group of CDS systems are being developed based on machine learning (ML) technology [14]. Machine learning can extract information from complex, non-linear data and provide insights to support clinical decision making. Hence, the first studies emerged that report machine learning-based mortality prediction models using data from patients with sepsis presenting to the ED [1526]. Unfortunately, these studies did not provide a comparison with physicians in terms of prognostic performance. Recently, a new group of machine learning algorithms termed gradient boosting trees emerged; showing superior performance compared to other ML models in some problems within the medical domain [27, 28]. Exploring if these models can outperform clinical risk scores and clinical judgment of physicians in their ability to identify low- or high-risk patients is a necessary step to explore the potential value of machine learning models in clinical practice.

The aim of this study was to develop machine learning-based prediction models for all-cause mortality at 31 days based on available laboratory and clinical data from patients presenting to the ED with sepsis. Subsequently, we compared the performance of these machine learning models with judgment of internal medicine physicians and clinical risk scores; abbMEDS, mREMS and SOFA.

Methods

Study design and setting

We performed a retrospective cohort study among all patients who presented to the ED at the Maastricht University Medical Centre+ between January 1, 2015 and December 31, 2016. All patients aged ≥18 years being referred to the internal medicine physician with sepsis, defined as a proven or suspected infection, and two or more SIRS and/or qSOFA criteria (S1 File) were included in this study [2, 5, 29]. Patients with missing clinical data or with less than four laboratory results were excluded. Also, patients who refused to give consent were excluded. This study was approved by the medical ethical committee (METC 2019–1044) and the hospital board of the Maastricht University Medical Centre+. Furthermore, the study follows the STROBE guidelines and was conducted according to the principles of the Declaration of Helsinki [30]. The ethics committee waived the requirement for informed consent.

Data collection and processing

We collected clinical and laboratory data from all patients included in the study available within two hours after initial ED presentation. Clinical data were manually extracted through the electronic health record of the patient and included characteristics such as vital signs, hemodynamic parameters, and medical history (S1 Table). Biomarkers requested for standard clinical care were acquired through the laboratory information system. Biomarkers that were ordered in less than 1/1000 patients were excluded from the analysis. A list of included biomarkers is provided in S1 Table. Missing values did not require any processing as our machine learning model is capable of dealing with missing data. Instead, we created an additional variable for each biomarker with a discrete ‘absence’ or ‘presence’ feature to enable our model to distinguish between the absence and presence of a laboratory test within a patient. These features were included in both datasets. Finally, we derived two datasets from the processed data:

  1. Laboratory dataset: this dataset consisted of age, sex, time of laboratory request and all requested laboratory biomarkers within two hours after the initial laboratory request

  2. Laboratory + clinical dataset: this dataset contained all variables from the laboratory dataset, and additionally clinical, vital and physical (e.g. length and weight) characteristics of the patient

A full overview of all variables present in each dataset is described in S1 Table. Datasets were anonymized and randomly divided into two subsets: 1) a development subset (n = 1,244), used for model training and evaluation, and 2) an independent validation subset (n = 100), used for final validation and comparison of models with judgment of acute internal medicine physicians and clinical risk scores. A schematic overview of the study design and model development is depicted in Fig 1. Data processing and manipulation was performed using Python programming language (version 3.7.1) using packages numpy (version 1.17) and Pandas (version 0.24).

Fig 1. Overview of study design and model development.

Fig 1

(A) We included 1,344 patients with a diagnosis of sepsis who presented to the ED. Patients were randomly partitioned in a development subset (n = 1,244), used to train and evaluate performance of machine learning models, and a validation subset (n = 100), used to compare models with internal medicine physicians and clinical risk scores. Cross-validation was used to obtain a robust estimate of model performance in the development subset. (B) The machine learning model with the highest cross-validation performance was compared internal medicine physicians and clinical risk scores to predict 31-days mortality.

Outcome measure

Septic shock during presentation was defined as systolic blood pressure (SBP) ≤90 mmHg and mean arterial pressure (MAP) ≤65 mmHg despite adequate fluid resuscitation. The outcome measure for this study was death within 31 days (1 month) after initial ED presentation. All-cause mortality information was acquired through electronic health records.

Model training and evaluation

Our proposed predictive model uses individual patient data available within two hours after initial ED presentation and generates the probability of mortality within 31 days. This prediction task can be solved by a variety of statistical and machine learning models. In the current study we evaluated logistic regression, random forest, multi-layer perceptron neural networks and XGBoost (S2 File and S2 Table) on the laboratory dataset. We selected XGBoost as our machine learning model of choice as this was proven to possess the highest baseline performance (S2 Table). XGBoost is a recent implementation of gradient tree boosting systems which involve combining the predictions of many “weak” decision trees into a strong predictor [27]. This recent implementation is characterized by integral support of missing data and regularization mechanisms to prevent overfitting [27]. XGBoost models and their development can be altered by adjusting the parameters of the technique, referred to as “hyperparameters”. Due to sample size limitations and the scope of our study, we decided not to optimize our hyperparameters and predefined them as described in S3 Table.

We employed stratified K-fold cross validation to assess the generalizability of our prediction models. Briefly, we randomly partitioned the development subset (n = 1,244) into five, equally sized, folds. During each round of cross-validation, four of these folds were used to train our models (“train set”) and the fifth was used to evaluate performance (“test set”). This was done in such a manner that every fold would be labeled as test set only once. We monitored training and test set errors to ensure that training increased performance on the test set. Accordingly, training was terminated after 5,000 rounds or when performance on the test set did not further improve for 10 rounds. We evaluated developed models trained with (i) the laboratory dataset or (ii) the laboratory + clinical dataset, resulting in a total of two independent cross-validations.

Model explanation

To explain the output of our XGBoost models, we used the SHapley Additive exPlanations (SHAP) algorithm, to help us understand how a single feature affects the output of the model [3133]. SHAP uses a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions [34, 35]. A Shapley value states, given the current set of variables, how much a variable in the context of its interaction with other variables contributes to the difference between the actual prediction and the mean prediction. That is, the mean prediction plus the sum of the Shapley values for all variables equals the actual prediction. It is important to understand that this is fundamentally different to direct variable effects known from e.g. (generalized) linear models. The SHAP value for a variable should not be seen as its direct -and isolated effect- but as its aggregated effect when interacting with other variables in the model. In our specific case, positive Shapley values contribute towards a positive prediction (death), whilst low or negative Shapely values contribute towards a negative prediction (survival). ML training and evaluation was done in Python using packages Keras (version 2.2.2), XGBoost (version 0.90), SHAP (version 0.34.0) and scikit-learn (version 0.22.1). The analysis code for this study is available on reasonable request.

Comparison of machine learning with internal medicine physicians and clinical risk scores

Performance of machine learning models was compared with clinical judgment of acute internal medicine physicians (n = 4) and clinical risk scores in a validation subset of patients with sepsis (n = 100) which were not previously exposed to the ML model. We selected the best performing machine learning model from cross-validation and trained this with identical hyperparameters as previously described on the full development subset. A machine learning prediction of higher than 0.50 was considered as a positive prediction. Next, we calculated the mREMS, abbMEDS and SOFA clinical risk scores as described previously (S1 File) [8, 36]. Acute internal medicine physicians (n = 4; 2 experienced consultants in acute internal medicine and 2 experienced residents acute internal medicine) were asked to predict 31-day mortality in the validation subset, based on retrospectively collected clinical and laboratory data. This data was presented in the form of a simulated electronic health record.

Statistical analysis

Descriptive analysis of baseline characteristics was performed using IBM SPSS Statistics for Windows (version 24.0). Continuous variables were reported as means with standard deviation (SD) or medians with interquartile ranges (IQRs) depending on the distribution of the data. Categorical variables were reported as proportions. Cross-validated models were assessed by receiver operating characteristic (ROC) curves and compared by their AUC using the Wilcoxon matched-pairs signed rank test. Besides diagnostic performance, we assessed calibration in cross-validations with reliability curves [37] and brier scores [38]. In our final validation subset, we compared the predictive performance of our best performing ML model to the judgment of acute internal medicine physicians and clinical risk scores with respect to sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy and AUC. Differences in AUC were tested using the method of DeLong et al [39]. Confidence intervals for proportions (e.g. sensitivity) were calculated using binomial testing and compared using McNemar’s test. To analyze individual differences between internal medicine physicians, we performed two additional sensitivity analyses. First, the Cohen κ statistic was used to measure the inter-observer agreement between the internal medicine physicians. The level of agreement was interpreted as nil if κ was 0 to 0.20; minimal, 0.21 to 0.39; weak, 0.40 to 0.59; moderate, 0.60 to 0.79; strong, 0.80 to 0.90; and almost perfect, 0.90 to 1 [40]. Second, we compared the machine learning model against alternating groups of internal medicine physicians in which one physician was removed in each comparison.

Results

Study population and characteristics

During the study period, 5,967 patients presented to the ED who were referred to an internal medicine physician in our hospital. Of these patients, we included 1,420 patients with a suspected or proven infection, fulfilling the SIRS and/or qSOFA criteria. A number of 76 patients were excluded due to missing clinical data (n = 23) and insufficient number of laboratory results (n = 53), to form a final cohort of 1,344 patients (S1 Fig). Among all patients, 102 (7.6%) suffered from septic shock during presentation at ED and 174 (13.0%) died within 31 days after initial ED presentation. Baseline characteristics of the study patients in development and validation datasets are shown in Table 1.

Table 1. Baseline characteristics of patients in the development and validation datasets.

Characteristics Development N = 1,244 Validation N = 100
Demographics
    Age 71.3 (58.8–82.3) 70.8 (58.4–82.8)
    Sex, female 567 (45.6) 58 (58.0)
Comorbidity
    Cancer 446 (35.9) 28 (28.0)
    Cardiopulmonary 381 (30.6) 30 (30.0)
    Diabetes 264 (21.2) 19 (19.0)
    Renal disease 128 (10.3) 9 (9.0)
    Liver disease 42 (3.4) 7 (7.0)
    Neuropsychiatric 65 (5.2) 2 (2.0)
Focus of infection at ED
    Respiratory tract 421 (33.8) 34 (34.0)
    Urinary tract 218 (17.5) 18 (18.0)
    Gastrointestinal tract 415 (33.4) 37 (37.0)
    Others 75 (6.0) 6 (6.0)
    Skin 115 (9.2) 5 (5.0)
Severity scores
    abbMEDSa 5.5 (3–8) 6 (3–8)
    mREMSb 7 (6–9) 7 (6–9)
    SOFAc 7 (5–9) 6 (5–8)
Outcomes
    Septic shock 94 (7.6) 8 (8.0)
    31-day mortality 161 (12.9) 13 (13.0)

a AbbMEDS, Abbreviated Mortality in ED Sepsis, was calculated as described by Vorwerk et al [8].

b mREMS, modified Rapid Emergency Medicine Score, was calculated as described by Chang et al [36].

c SOFA, Sepsis-related Organ Failure Assessment, was calculated as described by Vincent et al [6].

Machine learning development and evaluation

To assess the generalizability of our developed XGBoost models, we employed five-fold cross validation on the development dataset (n = 1,244). XGBoost models trained with laboratory data achieved an AUC of 0.82 (95% CI: 0.80–0.84) for predicting 31-day mortality (Fig 2). The performance improved, although not statistically significant, when clinical data was added to the laboratory data to train XGBoost to an AUC of 0.84 (95% CI: 0.81–0.87) for predicting mortality (compared to lab only; p = 0.25). Individual cross-validation results of each model are depicted in S2 Fig. Calibration curves show well calibrated models with brier scores between 0.08 to 0.10 (S3 Fig).

Fig 2. XGBoost model performance for predicting all-cause mortality at 31 days in the development dataset.

Fig 2

Models trained with laboratory data achieved a mean AUC of 0.82 (95% CI: 0.80–0.84) for predicting 31-day mortality. Predictive performance increased when models were trained with laboratory + clinical data to a mean AUC of 0.84 (95% CI: 0.81–0.87), but this was not statistically different (p = 0.25).

Model explanation

To identify which laboratory and clinical features contributed most to the performance of our models, we calculated SHAP values for the (i) laboratory and (ii) laboratory + clinical models (Fig 3). Among the highest ranked features, we observe features that are also often used in risk scores including urea, platelet count, glasgow coma score (GCS) and blood pressure. Interestingly, we also observe features such as glucose, lipase, and GCS which are less commonly associated with mortality in sepsis patients. An extended analysis of the correlation between important features in our models and risk scores is provided in S4 Table. Moreover, these SHAP plots allow us to examine the individual impact of laboratory and clinical features on the predictions of our models. For example, higher urea and C-reactive protein (CRP) levels (represented by red points) have a high SHAP value and thus a positive effect on the model outcome (death).

Fig 3. Analysis of parameter importance in the XGBoost models.

Fig 3

Models with laboratory data (left) and with laboratory + clinical data (right) were analyzed using SHAP values. Individual parameters are ranked by importance in descending order based on the sum of the SHAP values over all the samples. Negative or low SHAP values contribute towards a negative model outcome (survival), whereas high SHAP values contribute towards a positive model outcome (death).

Machine learning versus internal medicine physicians and clinical risk scores

To explore the potential value of machine learning models in clinical practice, we compared the model trained with laboratory + clinical data with acute internal medicine physicians and clinical risk scores, abbMEDS, mREMS and SOFA, to predict 31-day mortality. In an independent validation subset (n = 100) -which the model never had been exposed to before- it achieved a sensitivity of 0.92 (95% CI: 0.87–0.95, Fig 4A) and specificity of 0.78 (95% CI: 0.70–0.86, Fig 4B). In terms of sensitivity, the machine learning model significantly outperformed internal medicine physicians (0.72, 95% CI: 0.62–0.81; p<0.001), abbMEDS (0.54, 95% CI: 0.44–0.64; p<0.0001), mREMS (0.62, 95% CI: 0.52–0.72; p<0.001) and SOFA (0.77, 95% CI: 0.69–0.85; p = 0.003). On the other hand, the model retained a specificity that was comparable to that of internal medicine physicians (0.74, 95% CI: 0.64–0.82; p = 0.509), abbMEDS (0.72, 95% CI: 0.64–0.81; p = 0.327) and SOFA (0.74, 95% CI: 0.65–0.82, p = 0.447), while still outperforming mREMS (0.64, 95% CI: 0.55–0.74; p = 0.02). Additionally, the model had higher overall diagnostic accuracy with an AUC of 0.852 (95% CI: 0.783–0.922) compared to abbMEDS (0.631, 0.537–0.726, p = 0.021), mREMS (0.630, 0.535–0.724, p = 0.016), SOFA (0.752, 0.667–0.836, p = 0.042) and internal medicine physicians (0.735, 0.648–0.821, p = 0.032–0.189) (S4 Fig and S5 Table). Similar observations were made in additional evaluation metrics such as positive predictive value (NPV), negative predictive value (NPV) and accuracy (S5 Table). Individually, consultants were found to be more sensitive compared to residents (S5 Fig) with a poor to moderate agreement between the internists (Cohen’s Kappa 0.46 to 0.67) (S6 Table). A sensitivity analysis with four additional comparisons, where one physician was excluded at a time, confirmed that the results are robust and that the outperformance of the machine learning model was not due to an outlier in the physician group (S7 Table).

Fig 4. Comparison of XGBoost model with internal medicine physicians and clinical risk scores.

Fig 4

The XGBoost model achieved a sensitivity (A) of 0.92 (95% CI: 0.87–0.95) and specificity (B) of 0.78 (95% CI: 0.70–0.86) for predicting mortality. This was significantly better than the mean prediction of internal medicine physicians for sensitivity (0.72, 0.62–0.81; p<0.001) as well as abbMEDS (0.54, 0.44–0.64; p<0.0001), mREMS (0.62, 0.52–0.72; p<0.001) and SOFA (0.77, 95% CI: 0.69–0.85; p = 0.003). In terms of specificity, internal medicine physicians (0.74, 0.64–0.82; p = 0.509), abbMEDS (0.72, 0.64–0.81; p = 0.327) and SOFA (0.74, 95% CI: 0.65–0.82, p = 0.447) achieved similar performance compared to the XGBoost model, opposed to mREMS (0.64, 0.55–0.74; p = 0.02) which was significantly worse than machine learning predictions. * = p<0.05; ** = p<0.001; *** = p<0.0001; NS = not significant.

Discussion

In the present study we demonstrate the application of machine learning models to predict 31-day mortality patients presenting to the ED with sepsis. Our study reports several important findings.

First, we show that machine learning based models can accurately predict 31-day mortality in patients with sepsis. Highest diagnostic accuracy was obtained with the model that was trained with both laboratory and clinical data. Patient characteristics that are employed in traditional risk scores, such as blood pressure and heart rate, were also found to be amongst the most important variables for model predictions. Second, machine learning models outperformed the judgment of internal medicine physicians and commonly used clinical risk scores, abbMEDS, mREMS and SOFA. Specifically, machine learning was more sensitive compared with risk scores and internal medicine physicians, while retaining identical or slightly higher specificity. These preliminary data provide support in favor of the development and implementation of machine learning based models as clinical decision support tools, e.g. risk stratification of sepsis patients presenting to the ED.

We are aware of several studies which describe the machine-learning based prediction of mortality in sepsis populations presenting to the ED [1517]. Taylor et al. described a random forest model outperforming clinical risk scores in an ED population. Despite their bigger population, our XGBoost model appears to achieve similar performance to their random forest model, which corroborates and extends the power of this machine learning technique. Two recent studies by Barnaby et al. and Chiew et al. focused on using heart rate variability (HRV) for risk prediction in sepsis patients and reported predictive performance similar to our findings [15, 16]. Interestingly, their populations were smaller and this would therefore also advocate the use of HRV in our models. Despite these findings, Chiew et al. demonstrated that models without laboratory data significantly decreased in performance, emphasizing the importance of laboratory data in these machine learning models. Nevertheless, to the best of our knowledge this is the first study to report the direct comparison of machine learning models with internal medicine physicians. Although we do not present prospective results, we demonstrate that machine learning outperforms clinical judgment of internal medicine physicians and clinical risk scores, implying that current XGBoost models potentially aid in risk stratification of ED patients. As an example, implementation of these models should revolve around identifying patients with a high risk, e.g. ≥50% mortality within 31 days, which would then be re-evaluated once more before being discharged from the ED. This kind of implementation was shown in a recent randomized clinical trial by Shimabukuro et al. [41], proving that average length of stay and in-hospital mortality decreased by using a ML-based sepsis detection model in the ICU. Although this was carried out with a small population in an ICU instead of the ED, it clearly shows the potential of ML-based risk stratifying models.

The current study has several strengths and limitations. Strengths include (i) comparison of laboratory versus laboratory + clinical models, (ii) analysis of features contributing to models’ prediction and (iii) the comparison with internal medicine specialists. We are also aware of several limitations. First, the present study was a single-center study with a relatively small sample size at least from a machine learning analysis perspective. Nearly all machine learning models scale exceptionally well with data, and therefore substantial further improvement of diagnostic accuracy is likely when increasing the sample size. We also limited ourselves to sepsis patients presenting to the ED, and thus it is unknown to what degree these models translate to a broader, general ED population. Second, results presented in this study are based on retrospective data in a single center, limiting the external validity of the model. Unfortunately, this limitation currently applies to most studies applying ML in medicine. Third, the present study focused on model development and subsequent performance comparison with clinical judgment and clinical risk scores. It should be noted that the comparison with internal medicine specialists was performed using retrospectively generated electronic health records, rather than a prospective evaluation, which might have underestimated their diagnostic performance as they were not able to directly “see” the patient. Prospective evaluation, in respect to mortality, but also in relation to clinical endpoints that confirm true clinical benefit would facilitate implementation of ML-based risk stratification tools in clinical practice.

Conclusion

In conclusion, the present proof-of-concept study demonstrates the potential of machine learning models to predict mortality in patients with sepsis presenting to the ED. Machine learning outperformed clinical judgment of internal medicine physicians and established clinical risk scores. These data provide support in favor of the implementation of machine learning based risk stratification tools of sepsis patients presenting to the ED.

Supporting information

S1 File. Extended description of clinical criteria and risk scores.

(DOCX)

S2 File. Background information on machine learning models reviewed in the current study.

(DOCX)

S1 Table. Overview of variables present in the datasets.

The laboratory dataset consisted exclusively of laboratory variables with age, sex and time of request. The laboratory and clinical dataset contained all variables from the laboratory dataset and additionally clinical and vital characteristics.

(DOCX)

S2 Table. Comparison of baseline statistical and machine learning models for predicting 31-day mortality risk.

We performed a baseline comparison of statistical and machine learning models (S1 File) for the 31-day mortality prediction task using the laboratory dataset. We used five-fold cross validation to assess model performance. Performance was assessed by area under the receiver operating characteristic curve (AUC) and accuracy. Confidence intervals were calculated using bootstrapping methods (n = 1,000).

(DOCX)

S3 Table. Hyperparameters of XGBoost models.

Hyperparameters were based on theoretical reasoning rather than hyperparameter tuning. This was done to prevent overfitting on hyperparameters due to small sample size. “Base_score”, “Missing”, “Reg_alpha”, “Reg_lambda” and “Subsample” parameters were standard values provided by the XGBoost interface. “Max_depth”, “max_delta_step” and “estimators” were values we internally use for these kind of machine learning models. During the study, hyperparameters were never adjusted to gain performance in our validation dataset.

(DOCX)

S4 Table. Extended analysis of correlation between important model features and clinical risk scores.

To study the correlation between the most important features contributing to model predictions and the clinical criteria (qSOFA and SIRS) and risk scores (abbMEDS and mREMS), we compared their existence in both. The top-20 most important features (Fig 3 in main article) are compared to all criteria in the clinical scores (S1 File). We observe that most of the features present in the clinical criteria and scores are also among the most important features in the lab and clinical machine learning model.

(DOCX)

S5 Table. Extended comparison of machine learning models with internal medicine physicians and clinical risk scores.

In addition to sensitivity and specificity, we evaluated the performance of each group by positive predictive value (PPV), negative predictive value (NPV), accuracy and area-under-the receiver operating characteristics curve (AUC). Our XGBoost model shows superior performance in each of these metrics, which is in line with the findings presented in the manuscript.

(DOCX)

S6 Table. Inter-rater agreement of internal medicine physicians.

Cohen’s kappa was used to measure the inter-rater agreement between the internal medicine physicians. The level of agreement was interpreted as nil if κ was 0 to 0.20; minimal, 0.21 to 0.39; weak, 0.40 to 0.59; moderate, 0.60 to 0.79; strong, 0.80 to 0.90; and almost perfect, 0.90 to 1.3.

(DOCX)

S7 Table. Machine learning comparison to alternating physician groups.

In each comparison between the machine learning model and the physicians group, a single physician was removed from the physician group. In every comparison the machine learning model outperforms the physicians. This analysis shows that the higher performance of the machine learning model was not due to systemic underperformance of a single physician.

(DOCX)

S1 Fig. Flow diagram of study inclusion.

During the study period 5,967 patients that presented to our emergency department were referred to an internal medicine physician. Of these patients, 1420 patients fulfilled two or more SIRS and/or qSOFA criteria. After exclusion of 76 patients, a number of 1,344 patients were separated into development and validation datasets.

(DOCX)

S2 Fig. Five-fold cross validation of diagnostic performance of XGBoost models.

During each cycle of cross-validation, we assessed predictive performance by area under the receiver operating characteristic curves (AUC). Performance was determined for models trained with laboratory data (A) and models trained with laboratory and clinical data (B) to predict 31-day mortality.

(DOCX)

S3 Fig. Five-fold cross validation of calibration of XGBoost models.

During each cycle of cross-validation, we assessed calibration by calibration curves and their respective brier scores. Calibration was determined for models trained with laboratory data (A) and models trained with laboratory and clinical data (B).

(DOCX)

S4 Fig. Receiver operating characteristic analysis of machine learning model, risk scores and internal medicine physicians.

Receiver operating characteristics analysis of the lab + clinical machine learning model (AUC: 0.852 [0.783–0.922]), abbMEDS (0.631 [0.537–0.726]), mREMS (0.630 [0.535–0.724]) and internal medicine physicians (mean 0.735 [0.648–0.821]). Internal medicine physicians were depicted as bullets in the ROC analysis.

(DOCX)

S5 Fig. Individual performance of internal medicine physicians.

Predictive performance of all internal medicine specialists (n = 4; 2 experienced consultants in acute internal medicine and 2 experienced residents acute internal medicine) was assessed by sensitivity (left) and specificity (right). Consultants (experienced) specialists are depicted in grey and residents in orange.

(DOCX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This study was funded by a Noyons stipendium from the Dutch Federation of Clinical Chemistry (NVKC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.LaCalle E, Rabin E. Frequent users of emergency departments: the myths, the data, and the policy implications. Ann Emerg Med. 2010;56(1):42–8. Epub 2010/03/30. 10.1016/j.annemergmed.2010.01.032 . [DOI] [PubMed] [Google Scholar]
  • 2.Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801–10. Epub 2016/02/24. 10.1001/jama.2016.0287 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Roest AA, Tegtmeier J, Heyligen JJ, Duijst J, Peeters A, Borggreve HF, et al. Risk stratification by abbMEDS and CURB-65 in relation to treatment and clinical disposition of the septic patient at the emergency department: a cohort study. BMC Emerg Med. 2015;15:29 Epub 2015/10/16. 10.1186/s12873-015-0056-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.McLymont N, Glover GW. Scoring systems for the characterization of sepsis and associated outcomes. Ann Transl Med. 2016;4(24):527 Epub 2017/02/06. 10.21037/atm.2016.12.53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Seymour CW, Liu VX, Iwashyna TJ, Brunkhorst FM, Rea TD, Scherag A, et al. Assessment of Clinical Criteria for Sepsis: For the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):762–74. Epub 2016/02/24. 10.1001/jama.2016.0288 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Medicine. 1996;22(7):707–10. 10.1007/BF01709751 [DOI] [PubMed] [Google Scholar]
  • 7.Olsson T, Terent A, Lind L. Rapid Emergency Medicine score: a new prognostic tool for in-hospital mortality in nonsurgical emergency department patients. J Intern Med. 2004;255(5):579–87. Epub 2004/04/14. 10.1111/j.1365-2796.2004.01321.x . [DOI] [PubMed] [Google Scholar]
  • 8.Vorwerk C, Loryman B, Coats TJ, Stephenson JA, Gray LD, Reddy G, et al. Prediction of mortality in adult emergency department patients with sepsis. Emerg Med J. 2009;26(4):254–8. Epub 2009/03/25. 10.1136/emj.2007.053298 . [DOI] [PubMed] [Google Scholar]
  • 9.Crowe CA, Kulstad EB, Mistry CD, Kulstad CE. Comparison of severity of illness scoring systems in the prediction of hospital mortality in severe sepsis and septic shock. J Emerg Trauma Shock. 2010;3(4):342–7. Epub 2010/11/11. 10.4103/0974-2700.70761 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Olsson T, Terent A, Lind L. Rapid Emergency Medicine Score can predict long-term mortality in nonsurgical emergency department patients. Acad Emerg Med. 2004;11(10):1008–13. Epub 2004/10/07. 10.1197/j.aem.2004.05.027 . [DOI] [PubMed] [Google Scholar]
  • 11.Minne L, Abu-Hanna A, de Jonge E. Evaluation of SOFA-based models for predicting mortality in the ICU: A systematic review. Crit Care. 2008;12(6):R161 Epub 2008/12/19. 10.1186/cc7160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rohacek M, Nickel CH, Dietrich M, Bingisser R. Clinical intuition ratings are associated with morbidity and hospitalisation. Int J Clin Pract. 2015;69(6):710–7. Epub 2015/02/18. 10.1111/ijcp.12606 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zelis N, Mauritz AN, Kuijpers LIJ, Buijs J, de Leeuw PW, Stassen PM. Short-term mortality in older medical emergency patients can be predicted using clinical intuition: A prospective study. PLoS One. 2019;14(1):e0208741 Epub 2019/01/03. 10.1371/journal.pone.0208741 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. Epub 2019/01/09. 10.1038/s41591-018-0300-7 . [DOI] [PubMed] [Google Scholar]
  • 15.Barnaby DP, Fernando SM, Herry CL, Scales NB, Gallagher EJ, Seely AJE. Heart Rate Variability, Clinical and Laboratory Measures to Predict Future Deterioration in Patients Presenting With Sepsis. Shock. 2019;51(4):416–22. Epub 2018/05/31. 10.1097/SHK.0000000000001192 . [DOI] [PubMed] [Google Scholar]
  • 16.Chiew CJ, Liu N, Tagami T, Wong TH, Koh ZX, Ong MEH. Heart rate variability based machine learning models for risk prediction of suspected sepsis patients in the emergency department. Medicine (Baltimore). 2019;98(6):e14197 Epub 2019/02/09. 10.1097/MD.0000000000014197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23(3):269–78. Epub 2015/12/19. 10.1111/acem.12876 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Perng JW, Kao IH, Kung CT, Hung SC, Lai YH, Su CM. Mortality Prediction of Septic Patients in the Emergency Department Based on Machine Learning. J Clin Med. 2019;8(11). Epub 2019/11/11. 10.3390/jcm8111906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fagerstrom J, Bang M, Wilhelms D, Chew MS. LiSep LSTM: A Machine Learning Algorithm for Early Detection of Septic Shock. Sci Rep. 2019;9(1):15132 Epub 2019/10/24. 10.1038/s41598-019-51219-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mao Q, Jay M, Hoffman JL, Calvert J, Barton C, Shimabukuro D, et al. Multicentre validation of a sepsis prediction algorithm using only vital sign data in the emergency department, general ward and ICU. BMJ Open. 2018;8(1):e017833 Epub 2018/01/29. 10.1136/bmjopen-2017-017833 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Klug M, Barash Y, Bechler S, Resheff YS, Tron T, Ironi A, et al. A Gradient Boosting Machine Learning Model for Predicting Early Mortality in the Emergency Department Triage: Devising a Nine-Point Triage Score. J Gen Intern Med. 2020;35(1):220–7. Epub 2019/11/05. 10.1007/s11606-019-05512-7 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sahni N, Simon G, Arora R. Development and Validation of Machine Learning Models for Prediction of 1-Year Mortality Utilizing Electronic Medical Record Data Available at the End of Hospitalization in Multicondition Patients: a Proof-of-Concept Study. J Gen Intern Med. 2018;33(6):921–8. Epub 2018/02/01. 10.1007/s11606-018-4316-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Horng S, Sontag DA, Halpern Y, Jernite Y, Shapiro NI, Nathanson LA. Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PLoS One. 2017;12(4):e0174708 Epub 2017/04/07. 10.1371/journal.pone.0174708 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ford DW, Goodwin AJ, Simpson AN, Johnson E, Nadig N, Simpson KN. A Severe Sepsis Mortality Prediction Model and Score for Use With Administrative Data. Crit Care Med. 2016;44(2):319–27. Epub 2015/10/27. 10.1097/CCM.0000000000001392 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Shukeri W, Ralib AM, Abdulah NZ, Mat-Nor MB. Sepsis mortality score for the prediction of mortality in septic patients. J Crit Care. 2018;43:163–8. Epub 2017/09/14. 10.1016/j.jcrc.2017.09.009 . [DOI] [PubMed] [Google Scholar]
  • 26.Bogle B, Balduino, Wolk D, Farag H, Kethireddy, Chatterjee, et al. Predicting Mortality of Sepsis Patients in a Multi-Site Healthcare System using Supervised Machine Learning 2019. [Google Scholar]
  • 27.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv e-prints [Internet]. 2016 March 01, 2016. Available from: https://ui.adsabs.harvard.edu/abs/2016arXiv160302754C.
  • 28.Nanayakkara S, Fogarty S, Tremeer M, Ross K, Richards B, Bergmeir C, et al. Characterising risk of in-hospital mortality following cardiac arrest using machine learning: A retrospective international registry study. PLoS Med. 2018;15(11):e1002709 Epub 2018/12/01. 10.1371/journal.pmed.1002709 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Levy MM, Fink MP, Marshall JC, Abraham E, Angus D, Cook D, et al. 2001 SCCM/ESICM/ACCP/ATS/SIS International Sepsis Definitions Conference. Crit Care Med. 2003;31(4):1250–6. Epub 2003/04/12. 10.1097/01.CCM.0000050454.01978.3B . [DOI] [PubMed] [Google Scholar]
  • 30.World Medical A. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013;310(20):2191–4. Epub 2013/10/22. 10.1001/jama.2013.281053 . [DOI] [PubMed] [Google Scholar]
  • 31.Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng. 2018;2(10):749–60. Epub 2019/04/20. 10.1038/s41551-018-0304-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. Explainable AI for Trees: From Local Explanations to Global Understanding. arXiv e-prints [Internet]. 2019 May 01, 2019. Available from: https://ui.adsabs.harvard.edu/abs/2019arXiv190504610L. [DOI] [PMC free article] [PubMed]
  • 33.Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. 2020. 10.1038/s42256-019-0138-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lipovetsky S, Conklin M. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry. 2001;17(4):319–30. 10.1002/asmb.446 [DOI] [Google Scholar]
  • 35.Štrumbelj E, Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems. 2013;41:647–65. [Google Scholar]
  • 36.Chang SH, Hsieh CH, Weng YM, Hsieh MS, Goh ZNL, Chen HY, et al. Performance Assessment of the Mortality in Emergency Department Sepsis Score, Modified Early Warning Score, Rapid Emergency Medicine Score, and Rapid Acute Physiology Score in Predicting Survival Outcomes of Adult Renal Abscess Patients in the Emergency Department. Biomed Res Int. 2018;2018:6983568 Epub 2018/10/18. 10.1155/2018/6983568 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. Proceedings of the 22nd international conference on Machine learning; Bonn, Germany. 1102430: ACM; 2005. p. 625–32.
  • 38.BRIER GW. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY. Monthly Weather Review. 1950;78(1):1–3. [DOI] [Google Scholar]
  • 39.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45. Epub 1988/09/01. . [PubMed] [Google Scholar]
  • 40.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22(3):276–82. Epub 2012/10/25. [PMC free article] [PubMed] [Google Scholar]
  • 41.Shimabukuro DW, Barton CW, Feldman MD, Mataraso SJ, Das R. Effect of a machine learning-based severe sepsis prediction algorithm on patient survival and hospital length of stay: a randomised clinical trial. BMJ Open Respir Res. 2017;4(1):e000234 Epub 2018/02/13. 10.1136/bmjresp-2017-000234 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Ivan Olier

23 Jun 2020

PONE-D-20-09068

Machine Learning versus Physicians to Predict Mortality in Sepsis Patients Presenting to the Emergency Department

PLOS ONE

Dear Dr. Meex,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 07 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ivan Olier, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your ethics statement in the manuscript and in the online submission form, please provide additional information about the patient records used in your retrospective study. Specifically, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

3. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed)

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

5. Please ensure that you refer to Figure 4 in your text as, if accepted, production will need this reference to link the reader to the figure.

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript presents the development of a machine learned model predicting 31-day mortality.

The predictive performance of the model is compared to the predictive performance of 4 internists (2 consultants and 2 fellows). The machine learned model outperformed the internists by a wide margin.

The novelty in the manuscript is the comparison between the machine learning (ML) model and the internists.

Developing yet another ML model for sepsis mortality risk is not exciting; but a comparison of the model with

human expert internists is exciting. As the key contribution of the paper this comparison has to be solid.

Unfortunately, the way the paper stands now, I do not feel that this comparison is rock solid. Here are some

weakness in the comparison.

(a) Although the difference between internist and ML is high, with only 4 internists, the differences among the internists can significantly influence the results. It would be helpful to have inter-rater agreement among the internists.

Also, how was specificity/sensitivity for the ML model determined?

(b) It should also be noted that the machine learned model incorporates physician judgement. Whether a lab test is performed or not is based on clinical judgement and is made available to the model (as the lab absence/presence indicator). A person with fewer labs have fewer problems and is hence less likely to die. How the authors handle missing values is typically completely reasonable and correct, however, in this case, it "leaks" physician knowledge to the ML model.

Beside the comparison, the model development itself raises some concerns. Specifically, he performance differences between the various machine learning methods is incredibly high (.63-ish for Ridge regression, .65 neural networks, .72 for random forest, but .85 for xgboost). We also typically see xgboost outperform other methods but not to this extreme extent. I am wondering whether the authors may have made some mistake evaluating xgboost. There are other indicators of a possible mistake:

- The performance of xgboost differs across different tables (.813 in Supp Tbl 2; .852 in Supp Tbl 5). With a stated confidence interval of .79-.83, .852 is quite a bit outside the confidence interval.

- The learning rate is stated as .075 in Supp Tbl 5 but .001 in the text; could be a typo or actual different hyper-parameterization. A different hyper-parameterization could explain observed AUC values outside the confidence interval than random chance.

- The "95%"-confidence intervals are not 95%. In Supp Fig. 2B, only fold 1 falls consistently within the stated 95% confidence interval, the remaining 4 folds fall outside the "95%"-confidence interval for large consecutive portions of the ROC curve. How was the bootstrap estimation performed? Specifically, which data set is resampled (the development or the leave-out)?

I agree with the authors that a full-fledged lattice search of the hyper-parameter space is unnecessary, but I would suggest:

- explaining how the hyper-parameters were determined (e.g. using the default values from the package)

- explaining which if any hyper-parameters were changed (based on the CV test set; NOT the 100 patient leave-out set)

- conduct a sensitivity analysis demonstrating that small changes in the hyper parameters do not lead to major perturbations in the performance.

The SHAP score is poorly explained and Figure 3 is poorly described. Readers not familiar with SHAP and violin plots will not understand this figure. An example could be useful. Eg. High values of urea (red points) impact the risk of mortality positively (large positive impact); low values reduce it slightly (modest negative impact); urea observations in most patients have no impact on mortality (the urea violin plot is widest at 0 impact).

A concern with xgboost is the (somewhat) black-box nature of the model. The authors address this concern by computing the SHAP score for each feature. Given that the key differentiator of trees from the other methods is their ability to detect and use interactions, SHAP may not capture this well. A better approach is to take the (say) top 20 features and build a reduced model to see the performance loss. (I picked 20 because that is what the authors focus on in Supp Tbl 4, but other numbers could illustrate their point better.)

Clinical significance.

ML models in sepsis abound with very marginal contributions. Proposing yet another one without explaining how it can improve sepsis care is fairly meaningless. I appreciate the authors mentioning how their model could be used, and I think expanding on this is essential. The author's suggestion of using the model to identifying high risk patients (>= 50% risk of mortality) is reasonable. However, their evaluation does not quantify their contribution from this perspective. The clinical significance of this work could be significantly improved by (i) showing how much better the ML model is at identifying patients at >=50% risk (or any other high risk); (ii) clinically describing where the ML model was correct and the internists were not (again >= 50% vs <50% can be used) ; (iii) understanding the limits of the model, clinically describing where the ML model "fails": predicts low risk of mortality while the patient dies (regardless of whether the internists made the correct prediction). Note that higher AUC does not guarantee an improved ability to identify high-risk patients; correctly ordering low-risk patients will increase the AUC yet it has not clinical significance.

Limitations. Some important limitation are omitted.

1.The key limitation is portability. We can expect the risk scores and the internists to have similar performance if they made predictions for patients in a different health system. How about the xgboost model? Will it achieve similar performance?

2. Physicians, which actually seeing patients, may incorporate other features not captured by the EHR.

SUMMARY.

The key contribution of the paper is a comparison between an ML model and human internists. There are minor flaws in the comparison (ML has access to clinical judgement).

The model development process appears to have technical problems (confidence intervals appear incorrect; unknown rational for the hyper-parameterization and dependence of the performance on these parameters).

The performance envelope of the model is unexplored: what clinical patient characteristics make it fail? make it perform better than the internists?

Important limitations are not mentioned.

I think this paper has the potential to be an influential piece of work but its contribution as it stands now is insufficient and some technical aspects may even be incorrect.

Reviewer #2: • Mortality prediction in patients with sepsis, utilising a small dataset of patients presenting to a single emergency department

• The title is a little sensational – “Machine Learning versus Physicians”, and could be modified to represent the science

o “A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis”

• Authors testing the hypotehesis that machine learning models would outperform physician evaluation and existing clinical risk scores

• Authors determined that machine learning models out performed clinicians due to higher sensitivity and specificity, however discriminatory information is not provided in the abstract

• Introduction

o The references noted in the introduction note high discriminatory scores for abbMEDS (up to 0.85), mREMS (up to 0.84) and clinician judgement (up to 0.81), suggesting similar performance to this cohort

o The comment is made that “machine learning can extract information from incomplete…data” – although complexity and non-linear relationships are areas where ML does succeed, incomplete data is still a major limitation

• Methods

o Patients with missing clinical data were excluded – how many variables needed to be missing? Were the data missing at random? A plot representing percentage of missing values would be valuable, and an analysis to confirm that the missing value distribution was equal in the training and testing set

o Similarly, the authors note that the machine learning model is capable of dealing with missing data – could further information be provided as to how GBMs deal with the issue? Further, this statement is incongruent with the previous noting that patients with missing clinical data were excluded (unless this does not refer to biomarkers, and instead to other data)

o As biochemical markers were taken within the first two hours, and a significant proportion of treatment is performed in that time frame, was the variable of time between presentation and time of biomarker withdrawal recorded and adjusted for?

o The train/test split ratio is unusual – 93:7 – can the authors explain why this was chosen? Internal validation on 100 patients is small and may affect the reproducibility of this result.

o Mortality was obtained from the electronic health record – was this linked to a national health index? How were deaths out of hospital recorded in the EHR?

o The septic shock definition mentions a MAP ≤90 – is an alternate threshold intended here?

o The use of SHAP to explain the findings are an important addition and the authors should be credited for its inclusion

o Why were internal medicine physicians chosen, as opposed to emergency or intensive care physicians? Is this typical for this hospital?

o Calibration measures such as the Brier score and calibration plots should be presented for the models

o There is marked class imbalance, with only 13% of the population experiencing the primary outcome of death – were oversampling methods considered?

• Results and Discussion

o There appears to be a significantly higher percentage of patients with cancer and diabetes in the training/development set

o Why is the discrimination/AUC not directly compared between the ML models and the physicians and clinical scores?

o AUCs should be compared using DeLong’s test to demonstrate statistical superiority in regard to discrimination

o Figure 2 does not include the clinical risk scores for comparison

o Creatinine does not share the same relationship with model output as does urea – can this be explained clinically? (from the SHAP figure)

o Several references to other machine learning models in the literature focused on sepsis and mortality prediction have been omitted

General points

• Funded study with no conflicts of interest

• Grammar could be slightly improved however does not interfere with message of the paper

o “…and categorize from low to high risk” should be “and is categorized from low to high risk”

o “follow-studies” should be “follow-up studies”

• Authors have made data available without restriction, however note that it is available in the supplementary material – the code and raw data are not provided?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 19;16(1):e0245157. doi: 10.1371/journal.pone.0245157.r002

Author response to Decision Letter 0


10 Jul 2020

Dr. Ivan Olier, PhD

Academic Editor of PLOS ONE

Faculty of Engineering and Technology

Liverpool John Moores University, United Kingdom

July 10, 2020

Dear Dr. Olier,

We wish to thank you for the interest in our manuscript. We appreciate the constructive comments of both reviewers which have been valuable for improving of our manuscript. Please find enclosed our revision of the manuscript “A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis” (title has changed). We addressed the comments of both reviewers in a point by point rebuttal and revised our manuscript, accordingly. Track changes were used to indicate the revised sections in our manuscript. Additionally, minor textual changes were performed to improve the readability of the manuscript.

On behalf of my co-authors, I am pleased to submit a revised version of our manuscript. I confirm that this work is original and has not been published elsewhere nor is it currently under consideration for publication elsewhere. All authors have read and agreed with the revision of the manuscript.

We wish to thank you for the opportunity to submit a revised version of our manuscript to PLOS ONE.

Yours sincerely,

Steven J.R. Meex, Ph.D.

Central Diagnostic Laboratory

Maastricht University Medical Center+

Post office box 5800

6202 AZ Maastricht

The Netherlands

Tel: +31 (0)43-387 4709

Fax: +31 (0)84-003 8525

E-mail: steven.meex@mumc.nl

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

We adjusted the manuscript and supplemental information to meet the style requirements of PLOS ONE.

2. In your ethics statement in the manuscript and in the online submission form, please provide additional information about the patient records used in your retrospective study. Specifically, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

The data was fully anonymized before analysis. The study was approved by the medical ethical committee (METC 2019-1044) and hospital board of the Maastricht University Medical Centre+. The ethics committee waived the requirement for informed consent. We adjusted our ethics statement and methods sections in the manuscript to include this necessary information (page 7, lines 122-123).

3. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed)

The ethics committee waived the requirement for informed consent and therefore we did not collect this. We adjusted our ethics statement and methods sections in the manuscript to include this necessary information (page 7, lines 122-123).

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

We added the ORCID iD for the corresponding author in the editorial manager.

5. Please ensure that you refer to Figure 4 in your text as, if accepted, production will need this reference to link the reader to the figure.

We inserted two references to Figure 4 in the last section of our results section (page 20, line 328).

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

We inserted the captions for supporting information at the end of our manuscript file and also adjusted the citations accordingly.

Reviewer 1

We are pleased with the feedback given by the reviewer. We want to thank the reviewer for evaluating our manuscript and for giving important suggestions and comments to improve our manuscript. The comments have been addressed in a point-by-point fashion in this document. The changes are highlighted in track changes throughout the manuscript. The pages and lines mentioned for each adjustment refer to the manuscript and supplemental file with track changes.

1. The manuscript presents the development of a machine learned model predicting 31-day mortality. The predictive performance of the model is compared to the predictive performance of 4 internists (2 consultants and 2 fellows). The machine learned model outperformed the internists by a wide margin. The novelty in the manuscript is the comparison between the machine learning (ML) model and the internists. Developing yet another ML model for sepsis mortality risk is not exciting; but a comparison of the model with human expert internists is exciting. As the key contribution of the paper this comparison has to be solid. Unfortunately, the way the paper stands now, I do not feel that this comparison is rock solid. Here are some weakness in the comparison.(a) Although the difference between internist and ML is high, with only 4 internists, the differences among the internists can significantly influence the results. It would be helpful to have inter-rater agreement among the internists. Also, how was specificity/sensitivity for the ML model determined?

We thank the reviewer for this in-depth analysis of our manuscript. We agree that the differences between the internists could influence the results, this is also highlighted in the individual differences in sensitivity and specificity (S4 Fig). Furthermore, we calculated the inter-rater agreement between internists using Cohen’s Kappa statistics which is described in the table below (in the attached rebuttal document). This information was included in our manuscript as S6 Table, and described in the methods (page 14, lines 254-258) and results section (page 20, lines 340-342).

As an additional sensitivity analysis, we compared the model to alternating physician group compositions (table in the attached rebuttal document)). In each comparison between the machine learning model and the physicians group, a single physician was removed from the physician group. In all comparisons the machine learning model outperforms the physicians. This information was included in our manuscript as S7 Table, and described in the methods (page 14, lines 258-260) and results sections (pages 20-21, lines 342-345). This analysis, supported by the moderate to good Cohen Kappa’s, shows the higher performance of the machine learning model was not due to systemic underperformance of a single physician.

The specificity and sensitivity of the machine learning model was calculated using a standard cut-off of 0.5. We added an additional sentence to the manuscript in the methods section to clarify this: “A machine learning prediction of higher than 0.50 was considered as a positive prediction.” (page 13, lines 228-229).

2. It should also be noted that the machine learned model incorporates physician judgement. Whether a lab test is performed or not is based on clinical judgement and is made available to the model (as the lab absence/presence indicator). A person with fewer labs have fewer problems and is hence less likely to die. How the authors handle missing values is typically completely reasonable and correct, however, in this case, it "leaks" physician knowledge to the ML model.

We agree with the reviewer that the addition of an “absence/presence” indicator also adds a certain level of physician intuition into the laboratory model. We feel however that this is not necessarily problematic from an application perspective. Noteworthy, the most important difference between the laboratory and the laboratory/clinical datasets is the explicit addition of clinical and vital characteristics such as blood pressure, heart rate and oxygen saturation. These factors are known to be important predictors of mortality, which is also observed in our SHAP analysis (Figure 3).

3. Beside the comparison, the model development itself raises some concerns. Specifically, he performance differences between the various machine learning methods is incredibly high (.63-ish for Ridge regression, .65 neural networks, .72 for random forest, but .85 for xgboost). We also typically see xgboost outperform other methods but not to this extreme extent. I am wondering whether the authors may have made some mistake evaluating xgboost. There are other indicators of a possible mistake:

- The performance of xgboost differs across different tables (.813 in Supp Tbl 2; .852 in Supp Tbl 5). With a stated confidence interval of .79-.83, .852 is quite a bit outside the confidence interval.

We share the reviewers’ observation that XGBoost (and other gradient-boosting implementations) often outperform other algorithms in settings with heterogeneous, tabular datasets. As mentioned by the reviewer, we describe the baseline performance of XGBoost in S2 Table with an area-under-the ROC curve of 0.813 (0.791 – 0.835). This was done as a baseline comparison on the development variant of the laboratory dataset (n=1,244) with five-fold cross validation. This model is similar to the “lab” model presented in Figure 2, that has an AUC of 0.82 (0.80-0.84). The minor difference in AUC is most likely due to a different random initialization and different cross-validation folds being created (both models were developed in independent experiments). The XGBoost model we present in S5 Table is the best performing model from Figure 2 which is the laboratory + clinical model. This model had comparable performance with five-fold cross validation in the development dataset (AUC: 0.84, Figure 2) and in the validation dataset (AUC: 0.852, Supplementary Table 5).

To prevent any misinterpretation, we added “on the laboratory dataset” (page 10, line 177) in the manuscript and “using the laboratory dataset” (page 7, line 113) in the supplemental file.

- The learning rate is stated as .075 in Supp Tbl 5 but .001 in the text; could be a typo or actual different hyper-parameterization. A different hyper-parameterization could explain observed AUC values outside the confidence interval than random chance.

We apologize, this was a typo indeed. We corrected the learning rate of 0.001 to 0.075 in S1 supporting information in our supplemental files (page 5, line 97).

- The "95%"-confidence intervals are not 95%. In Supp Fig. 2B, only fold 1 falls consistently within the stated 95% confidence interval, the remaining 4 folds fall outside the "95%"-confidence interval for large consecutive portions of the ROC curve. How was the bootstrap estimation performed? Specifically, which data set is resampled (the development or the leave-out)?

The reviewer is correct: the 95% CI is based on the mean and standard deviation of 5 ROC calculations, each time using 4 of 5 equally sized folds of the development set. Five calculations are however insufficient to assume normality and calculate a reliable 95% CI. One slightly deviating ROC estimation can hence fall –just by statistical chance- almost completely outside the 95% CI. To prevent any confusion, we removed the 95% CI areas and calculations in S1 Fig and Fig 2 (main article) and only present the ROC’s and mean of the 5 individual folds.

4. I agree with the authors that a full-fledged lattice search of the hyper-parameter space is unnecessary, but I would suggest:

- explaining how the hyper-parameters were determined (e.g. using the default values from the package)

Hyperparameter tuning is often critical in the development of high-performance machine learning models but due to the following constraints we decided not to perform any explicit hyperparameter tuning:

* Our main aim is the comparison of machine learning models versus physicians and not the development of a high-performance machine learning model which would be unrealistic given the current sample size

* It would require a third dataset to independently evaluate the different hyperparameters (e.g. a ‘tuning’/’validation’ dataset); which in our case would limit our sample size even further. An alternative would be nested cross-validation (Parvandeh S, et al. Bioinformatics. 2020;36(10):3093-8. doi:10.1093/bioinformatics/btaa046) but given the clinical context of the study we believe this is overly complex.

In the table below (in the attached rebuttal document) we describe the rationale for each of the chosen hyperparameters in our study. This information was added as an additional paragraph in our supplementary information in S3 Table (page 7, line 120-124).

- explaining which if any hyper-parameters were changed (based on the CV test set; NOT the 100 patient leave-out set)

We would like to refer to the comment above; we did not adjust any hyperparameters.

- conduct a sensitivity analysis demonstrating that small changes in the hyper parameters do not lead to major perturbations in the performance.

To illustrate that small changes in hyperparameters do not lead to major pertubations in performance, we performed a grid-search with “max_depth” and “n_estimators” which are known to be amongst the most influential hyperparameters in tree-based models. We compared combinations of these hyperparameters and depicted their AUC in the table below (in the attached rebuttal document). Models employed in our manuscript are highlighted in bold.

We would like to stress that based upon these results we cannot decide which model is better as this would require an additional dataset. Instead, these results only provide a comparison between different hyperparameters, depicting that subtle differences in hyperparameters do not lead to major changes in performance. As the main objective of our study is to compare the machine learning models versus physicians, we would propose not to include this in the current manuscript.

5. The SHAP score is poorly explained and Figure 3 is poorly described. Readers not familiar with SHAP and violin plots will not understand this figure. An example could be useful. Eg. High values of urea (red points) impact the risk of mortality positively (large positive impact); low values reduce it slightly (modest negative impact); urea observations in most patients have no impact on mortality (the urea violin plot is widest at 0 impact).

To facilitate the interpretation of the SHAP algorithm, we added an extended description in the methods section of our paper (pages 6-7, lines 206-216) explaining the interpretation of these values and hereby hopefully facilitating the reader with enough information to interpret these values. Moreover, as the reviewer suggested, we added an example of the interpretation of SHAP values to our results section (page 18, lines 306-309).

6. A concern with xgboost is the (somewhat) black-box nature of the model. The authors address this concern by computing the SHAP score for each feature. Given that the key differentiator of trees from the other methods is their ability to detect and use interactions, SHAP may not capture this well. A better approach is to take the (say) top 20 features and build a reduced model to see the performance loss. (I picked 20 because that is what the authors focus on in Supp Tbl 4, but other numbers could illustrate their point better.)

We agree with the reviewers comment that it would be interesting to build a XGBoost model with the top-N features. We assessed the performance of XGBoost models with top-20, 15, 10, 5 and 3 features in cross-validation by AUC under the ROC curve. Results are depicted in the table below (in the attached rebuttal document).

The differences between the full and top-20 models are relatively small. Considering the scope of the current manuscript and the small differences, we would propose to not include this into the current manuscript.

7. Clinical significance.ML models in sepsis abound with very marginal contributions. Proposing yet another one without explaining how it can improve sepsis care is fairly meaningless. I appreciate the authors mentioning how their model could be used, and I think expanding on this is essential. The author's suggestion of using the model to identifying high risk patients (>= 50% risk of mortality) is reasonable. However, their evaluation does not quantify their contribution from this perspective. The clinical significance of this work could be significantly improved by (i) showing how much better the ML model is at identifying patients at >=50% risk (or any other high risk); (ii) clinically describing where the ML model was correct and the internists were not (again >= 50% vs <50% can be used) ; (iii) understanding the limits of the model, clinically describing where the ML model "fails": predicts low risk of mortality while the patient dies (regardless of whether the internists made the correct prediction). Note that higher AUC does not guarantee an improved ability to identify high-risk patients; correctly ordering low-risk patients will increase the AUC yet it has not clinical significance.

We agree with the author that the development of machine learning models without any clinical application is meaningless. However, given our limited sample size, only moderate to good model performance and a main focus on the comparison of physician vs machine learning model, we believe that providing these clinical estimates is beyond the scope of the current study. However, in an ongoing follow-up study we build high-performance machine learning models (AUCs of 0.90 and higher) in four Dutch hospitals, including more than 260.000 patients. That study is more focused on the strategy towards clinical application, which would briefly be described as follows:

First, we defined the acceptable percentage of patients that are erroneously identified as “low-risk” by the algorithm (any number from 0-100%). This percentage, e.g. 1%, could be derived from an inventory of acceptable risk tolerance for adverse events by patients, health care workers, or both (Brown TB, et al. J Emerg Med. 2010;39(2):247-52). Then, use the corresponding negative predictive value (in this case 99%) to derive the matching algorithm prediction threshold (e.g. 0.05) and associated values for sensitivity, specificity, and proportion of subjects identified as low risk. A similar approach can be applied to identify high risk patients: define the positive predictive value that would provide an acceptable balance between true high risk patient identification and false positives, e.g. a positive predictive value of 75% would categorize x% as high-risk individuals with 1 in 4 “flaggings” by the clinical decision support tool being false positive. A higher proportion of high risk subject identification is feasible but will be at the expense of increased false positive flaggings.

8. Limitations. Some important limitation are omitted.1. The key limitation is portability. We can expect the risk scores and the internists to have similar performance if they made predictions for patients in a different health system. How about the xgboost model? Will it achieve similar performance?

We expanded on the portability of our model in our discussion section:

(i) “our sample size (especially in the area of machine learning) is relatively small, and therefore substantial further improvement of diagnostic accuracy is likely when increasing the sample size” (page 23-24, lines 406-408)

(ii) “We also limited ourselves to sepsis patients presenting to the ED, and thus it is unknown to what degree these models translate to a broader, general ED population” (page 24, lines 408-410)

(iii) “these results are based on a retrospective, single-center study which limits the external validity of our models” (page 24, lines 410-412)

9. Physicians, which actually seeing patients, may incorporate other features not captured by the EHR.

We agree with the reviewer’s comment that the performance of physicians might be underestimated due to the retrospective nature of the study. We briefly discussed this limitation in our discussion, and extended this with the sentence: “As such, the performance of internal medicine physicians might be underestimated as they were not able to directly see the patient.” (page 24, lines 417-419)

10. SUMMARY. The key contribution of the paper is a comparison between an ML model and human internists. There are minor flaws in the comparison (ML has access to clinical judgement). The model development process appears to have technical problems (confidence intervals appear incorrect; unknown rational for the hyper-parameterization and dependence of the performance on these parameters). The performance envelope of the model is unexplored: what clinical patient characteristics make it fail? make it perform better than the internists? Important limitations are not mentioned. I think this paper has the potential to be an influential piece of work but its contribution as it stands now is insufficient and some technical aspects may even be incorrect.

We again would like to express our gratitude for the thorough review and suggestions of the reviewer. We think that the manuscript significantly improved and hope that it now meets the standard for publication in PLOS ONE.

Reviewer 2

We wish to thank the reviewer for evaluating our manuscript and for providing suggestions and comments to improve our manuscript. The comments have been addressed in a point-by-point fashion in this document. The changes are highlighted in track changes throughout the manuscript. The pages and lines mentioned for each adjustment refer to the manuscript and supplemental file with track changes.

1. Mortality prediction in patients with sepsis, utilising a small dataset of patients presenting to a single emergency department. The title is a little sensational – “Machine Learning versus Physicians”, and could be modified to represent the science “A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis”

According to the reviewer’s suggestion we changed the title to: “A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis” (page 1, lines 3-4).

2. Authors testing the hypotehesis that machine learning models would outperform physician evaluation and existing clinical risk scores. Authors determined that machine learning models out performed clinicians due to higher sensitivity and specificity, however discriminatory information is not provided in the abstract.

The discriminatory ability of models versus physicians is described in S5 Table. To further emphasize this, we added this information in our abstract: “The model had higher diagnostic accuracy with an area-under-the-ROC curve of 0.852 (95%CI: 0.783-0.922) compared to abbMEDS (0.631,0.537-0.726), mREMS (0.630,0.535-0.724) and internal medicine physicians (0.735,0.648-0.821).” (page 4, lines 58-61).

3. The references noted in the introduction note high discriminatory scores for abbMEDS (up to 0.85), mREMS (up to 0.84) and clinician judgement (up to 0.81), suggesting similar performance to this cohort

Clinical risk scores and even clinical judgement can indeed show high discriminatory scores (up to 0.80) but vary substantially between centers. We therefore feel it is only scientifically sound to make comparisons within studies. In our study, we found AUC’s of 0.631 (0.537-0.726) and 0.630 (0.535-0.724) for abbMEDS and mREMS, respectively.

4. The comment is made that “machine learning can extract information from incomplete…data” – although complexity and non-linear relationships are areas where ML does succeed, incomplete data is still a major limitation

We agree with the comment of the reviewer that incomplete missing data is still an ongoing limitation in most machine learning models. As complex and non-linear relationships are areas where machine learning currently excels, we adopted reviewer’s comment and modified the sentences in our introduction section accordingly (page 6, line 93).

5. Patients with missing clinical data were excluded – how many variables needed to be missing? Were the data missing at random? A plot representing percentage of missing values would be valuable, and an analysis to confirm that the missing value distribution was equal in the training and testing set.

We included 1,420 patients in our current study of whom 76 were excluded due to missing laboratory results (n=53) or clinical data (n=23). Missing clinical data (n=23) was a result of missing blood pressure (n=5), glasgow coma score (n=7), oxygen saturation (n=6), heart rate (n=4) and age (n=1). We would like to stress that these were excluded before randomization of the dataset into training and testing set. Hence, similar as in a randomized controlled trail (RCT), statistical analysis after randomization to confirm equal distribution would be undesirable.

6. Similarly, the authors note that the machine learning model is capable of dealing with missing data – could further information be provided as to how GBMs deal with the issue? Further, this statement is incongruent with the previous noting that patients with missing clinical data were excluded (unless this does not refer to biomarkers, and instead to other data)

Gradient-boosting systems are models that build an ensemble of ‘weaker’ prediction models, mostly decision trees. Within such decision tree, each node is a test on a feature (e.g. CRP < 5 U/L or age < 70) and each branch represents an outcome for this test. During the training phase of these algorithms, we determine the features (and their cut-offs within a node) that result in the best performance given the dataset. When we specifically deal with a missing value during training, we evaluate what happens with the performance if we take one of the two leaves for the decision tree; the leaf that results in the highest performance increase on average will be defined as the “default” path to take in case there is a missing value. An extended description of this native mechanism for support of missing data in the XGBoost algorithm specifically can be found in its original paper (Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754).

The reviewer is correct that the missing clinical data does not include the biomarkers. We adjusted the manuscript to further clarify this (page 7, line 117).

7. As biochemical markers were taken within the first two hours, and a significant proportion of treatment is performed in that time frame, was the variable of time between presentation and time of biomarker withdrawal recorded and adjusted for?

The reviewer highlights an interesting point about the time of biomarker withdrawal and its relationship to treatment in the emergency department. Unfortunately, we do not have any data on these time windows, but considering that this is evenly distributed in the training and test dataset, it is automatically captured by the model, and will not limit the performance of the models.

8. The train/test split ratio is unusual – 93:7 – can the authors explain why this was chosen? Internal validation on 100 patients is small and may affect the reproducibility of this result.

We agree with the reviewer that the train:test split ratio differs from more conventional splits of e.g. 70%/30%. However, as the main aim of the study is the comparison versus physicians, we decided that 100 patients (which corresponds to 7%) would be a feasible amount of patients to have evaluated by our internal medicine specialists.

9. Mortality was obtained from the electronic health record – was this linked to a national health index? How were deaths out of hospital recorded in the EHR?

Mortality information was acquired through the electronic health records which are linked to municipal population registries.

10. The septic shock definition mentions a MAP ≤90 – is an alternate threshold intended here?

We thank reviewer for this comment as we intended to describe the MAP (mean arterial pressure) here but at a threshold of <65 instead of <90. This was adjusted in the manuscript accordingly (page 10, line 165).

11. The use of SHAP to explain the findings are an important addition and the authors should be credited for its inclusion.

Thank you for the comment. In response the other reviewer, we extended the description of the SHAP algorithm in the methods (pages 6-7, lines 206-216) and results section (page 18, lines 306-309).

12. Why were internal medicine physicians chosen, as opposed to emergency or intensive care physicians? Is this typical for this hospital?

We selected internal medicine physicians (n=4) that were either residents (n=2) or consultants (n=2) specialized in acute internal medicine. At the time of the study, our hospital had no emergency medicines, but rather internal medicine physicians that specialized into acute internal medicine. These physicians work at our emergency department on a day-to-day basis, and thus best suit the comparison versus a machine learning model. To clarify this we modified the manuscript at several lines to explicitly include the word “acute” into our definition of internal medicine specialists (page 13, line 224 and 232; page 20, line 324).

13. Calibration measures such as the Brier score and calibration plots should be presented for the models

We added calibration plots and brier scores in a five-fold cross validation setting of our laboratory and laboratory+clinical models in S2 Fig. This was also added to our manuscript in the methods (page 14, lines 246-247) and results section (page 17, lines 286-287).

14. There is marked class imbalance, with only 13% of the population experiencing the primary outcome of death – were oversampling methods considered?

We thank reviewer for the suggestion. In early stages we indeed considered oversampling but given the positive outcome (death) of 13%, which is reasonably high, we decided not to perform oversampling. Also, oversampling synthesizes and/or duplicates (nearly) identical positive samples from the training dataset, which can lead to potential overfitting especially at smaller sample sizes.

• Results and Discussion

15. There appears to be a significantly higher percentage of patients with cancer and diabetes in the training/development set

Similar to other randomization procedures (e.g. in a RCT), statistical chance may lead to slight imbalances between randomized groups. For critical parameters one can choose to distribute them evenly during randomization, but that was considered unnecessary in our study.

16. Why is the discrimination/AUC not directly compared between the ML models and the physicians and clinical scores?

The analysis of discrimination for the physicians versus models is described in S5 Table. We deliberately chose not to use discrimination (AUC under the ROC) as the main analysis in our manuscript as it might not be fully appropriate for the evaluation of the performance of internal medicine physicians. Typically, a ROC analysis shows how sensitivity (true positive rate) changes with varying specificity (true negative rate or 1 −false positive rate) for different thresholds. Internal medicine physicians, however, only provide a binary prediction (death or survive) and therefore will represented by a single point in the ROC analysis. Although this still allows calculation of an area-under-the curve, it is an ongoing debate whether or not this is a valid metric (Muschelli J. ROC and AUC with a Binary Predictor: a Potentially Misleading Metric. Journal of Classification. 2019. doi:10.1007/s00357-019-09345-1).

Nonetheless, we performed a ROC analysis and depicted the predictions of internal medicine physicians as bullets in the figure (see figure below in the attached rebuttal document). This figure is also embedded as S3 Fig in our manuscript (page 20, line 337) and supplemental files (page 12, lines 172-177).

Furthermore, we recognize that the discriminatory ability of diagnostic tool is widely used and thus we decided to explicitly add a sentence regarding this comparison in our results section: “Additionally, the model had higher overall diagnostic accuracy as depicted by an AUC of 0.852 (95% CI: 0.783-0.922) compared to abbMEDS (0.631, 0.537-0.726), mREMS (0.630, 0.535-0.724) and internal medicine physicians (0.735, 0.648-0.821).” (page 20, lines 335-338).

17. AUCs should be compared using DeLong’s test to demonstrate statistical superiority in regard to discrimination

We thank reviewer for the suggestion of adding DeLong’s test to test for statistical significance in our ROC analysis. However, we feel that calculating these for binary predictors could be misleading. DeLong’s test is mainly useful for testing differences between related predictors (e.g. an old risk score versus an updated risk score with additional variables), which is not the case in the current study.

18. Figure 2 does not include the clinical risk scores for comparison

Figure 2 shows the development of machine learning models using the development dataset. In our validation dataset of 100 patients, we provide a head-to-head comparison of the clinical risk scores with our machine learning models and internal medicine specialists (Figure 4 and S5 Table).

19. Creatinin does not share the same relationship with model output as does urea – can this be explained clinically? (from the SHAP figure)

From Figure 3 we can indeed observe that urea and creatinin have different relationships with the predictions. Urea contributes to a more positive prediction (death) especially at higher levels (indicated by the red dots), whereas creatinin mainly has a negative (protective) effect at lower biomarkers levels (indicated by blue dots). Creatinin is a marker for kidney function (and slightly for muscle mass), whereas urea is also a marker for kidney function but moreover is highly associated with hemodynamics. Hence, urea is an important marker for the overall disease state of a patient. This is also reflected in recent risk scores, such as the RISE UP, which include urea but not creatinin in their prediction models (Zelis N, Buijs J, de Leeuw PW, van Kuijk SMJ, Stassen PM. Eur J Intern Med. 2020;77:36-43.doi:10.1016/j.ejim.2020.02.021).

20. Several references to other machine learning models in the literature focused on sepsis and mortality prediction have been omitted

Based on the comment of the reviewer we conducted a new literature search and found four additional references that were included in the revised version of our manuscript:

* Horng S, Sontag DA, Halpern Y, Jernite Y, Shapiro NI, Nathanson LA. Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PLoS One. 2017;12(4):e0174708. doi:10.1371/journal.pone.0174708

* Ford DW, Goodwin AJ, Simpson AN, Johnson E, Nadig N, Simpson KN. A Severe Sepsis Mortality Prediction Model and Score for Use With Administrative Data. Crit Care Med. 2016;44(2):319-27. doi:10.1097/CCM.0000000000001392

* Shukeri W, Ralib AM, Abdulah NZ, Mat-Nor MB. Sepsis mortality score for the prediction of mortality in septic patients. J Crit Care. 2018;43:163-8. doi:10.1016/j.jcrc.2017.09.009

* Bogle B, Balduino, Wolk D, Farag H, Kethireddy, Chatterjee, et al. Predicting Mortality of Sepsis Patients in a Multi-Site Healthcare System using Supervised Machine Learning2019.

21. Funded study with no conflicts of interest

Our study was funded by the non-profit organization Dutch Federation of Clinical Chemistry (NVKC). As stated in our disclosure the funders had no role in the study design, data collection, analysis, decision to publish, and preparation of the manuscript.

22. Grammar could be slightly improved however does not interfere with message of the paper

o “…and categorize from low to high risk” should be “and is categorized from low to high risk”

o “follow-studies” should be “follow-up studies”

We carefully re-read the manuscript and performed several minor modifications to improve the readability and grammar of our manuscript.

23. Authors have made data available without restriction, however note that it is available in the supplementary material – the code and raw data are not provided?

Our supplementary material presents additional information regarding the manuscript. On reasonable request the raw data and the source code used in the current study are available.

Attachment

Submitted filename: Response to reviewers.pdf

Decision Letter 1

Ivan Olier

26 Aug 2020

PONE-D-20-09068R1

A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis

PLOS ONE

Dear Dr. Meex,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular, please consider Reviewer 2's concerns regarding size of validation subset. I understand the authors point on attempting the fairest possible comparison between ML algorithms and clinicians, but also agree that a very small validation subset could be in detriment of the quality of the ML performance evaluation. The authors might want to consider the inclusion of further model evaluation using an out-of-the-bag resampling strategy on 70/30 as suggested, alongside the ones already presented. It would be very helpful if you could address this and the rest of the reviewer points.

Please submit your revised manuscript by Oct 10 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ivan Olier, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: Reviewer comments are attached below:

- The significantly lower discriminatory scores in this data compared with the literature may represent differences in the particular sample, or reflect the small sample size and thus be inaccurately portraying a lower discriminatory capacity of the physicians.

- The decision to only test on 100 patients is unclear; the model may falsely appear more accurate without robust validation. Did the authors attempt validation on a randomly selected 30%?

- Cancer and diabetes would traditionally be strongly associated with mortality, particularly in septic shock. The authors note that this was considered unnecessary to distribute these evenly, however it is unclear why.

- A statistical opinion should be sought regarding the direct validity of comparing AUROCs with DeLong’s test; it is the reviewers opinion that this is a useful comparator for discriminatory evaluations such as this.

- The reverse relationships of urea and creatinine are unclear. Often both are not included in the same score as they are in the same direction (and consequently one will knock out the other during development); it remains unclear why they would be in opposite directions.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 19;16(1):e0245157. doi: 10.1371/journal.pone.0245157.r004

Author response to Decision Letter 1


1 Sep 2020

Reviewer 2

We want to thank the reviewer for re-evaluating our manuscript. The comments have been addressed in a point-by-point fashion in this document. The changes are highlighted in track changes throughout the manuscript. The pages and lines mentioned for each adjustment refer to the manuscript and supplemental file with track changes.

1. The significantly lower discriminatory scores in this data compared with the literature may represent differences in the particular sample, or reflect the small sample size and thus be inaccurately portraying a lower discriminatory capacity of the physicians.

Clinical risk scores show varying discriminatory performance in the literature with area-under-the receiver operating characteristic (AUROC) ranging between 0.62-0.85 for abbMEDS and 0.62-0.84 for mREMS. The AUC in in our study for both scores is within the range reported in literature.

2. The decision to only test on 100 patients is unclear; the model may falsely appear more accurate without robust validation. Did the authors attempt validation on a randomly selected 30%?

We would like to emphasize that these 100 patients represent a random selection from the population. The validation number of 100 patients was chosen as an amount that was feasible to have carefully evaluated by 4 physicians. In regard to the reviewers suggestion to attempt validation on a randomly selected 30%, we think there may be a misunderstanding which we try to clarify: in our analysis we perform model evaluation on a randomly selected 20%, and then on top applied 5-fold cross validation, each time with another random 20% selection. We feel that our approach with 5-fold cross validation is in fact more robust than a single validation with 30%. Nevertheless, we carried out the validation suggested by the reviewer where we used a randomly selected subset of 70% to train laboratory and laboratory + clinical models, and a randomly selected 30% to evaluate the model. The AUC’s (depicted below in Response To Reviewers document) of the resulting models are 0.84 and 0.86, which is comparable to the performance reported in our manuscript (0.82 [0.80-0.84] and 0.84 [0.81-0.87] respectively).

We feel that adding 70/30% resampling strategy on top of the 80/20% 5-fold cross validation to our current manuscript, would be redundant and may even be confusing to the journal’s readership. We therefore propose not to include this additional analysis in the manuscript. Alternatively we will be happy to use the possibility offered by the journal to make this review correspondence publicly available alongside the article, so interested readers are able to see the additional analysis and read all considerations made during the review process.

3. Cancer and diabetes would traditionally be strongly associated with mortality, particularly in septic shock. The authors note that this was considered unnecessary to distribute these evenly, however it is unclear why.

This choice is similar to decisions in a randomization procedure in a randomized controlled trial: statistical chance may lead to slight imbalances between randomized groups (which becomes less likely with increasing sample size). If there is a single (or very limited number) of key confounding factors, one can choose to account for them and guarantee an equal distribution. In our study however, we feel that an even distribution of cancer and diabetes is not substantially more critical than many other traits: e.g. age, sex, hemodynamics at presentation, other comorbidities etc. In the absence of compelling a priori evidence to prioritize cancer and diabetes over all other potentially confounding parameters, we chose not distribute them evenly.

4. A statistical opinion should be sought regarding the direct validity of comparing AUROCs with DeLong’s test; it is the reviewers opinion that this is a useful comparator for discriminatory evaluations such as this.

As suggested by the reviewer, we compared the discriminatory performance of the machine learning model versus the clinical risk scores and physicians with DeLong’s test (DeLong et al, 1988) in the validation subset. The results are provided in the table below.

Model AUC (95% CI) P-Value

Machine learning model 0.852 (0.783 – 0.922) N/A

abbMEDS 0.631 (0.537 – 0.726) 0.021

mREMS 0.630 (0.535 – 0.724) 0.016

Internal medicine physicians 0.735 (0.648 – 0.821) 0.189; 0.072; 0.068; 0.032a

a. Individual P-values were calculated for each of the internal medicine physicians.

We updated the manuscript in the methods section on page 11, lines 230-231 and in the results section on page 15, lines 304-306. Additionally, we updated Supplementary Table 5 on page 8, lines 124-126 with these results.

5. The reverse relationships of urea and creatinine are unclear. Often both are not included in the same score as they are in the same direction (and consequently one will knock out the other during development); it remains unclear why they would be in opposite directions.

From a clinical perspective creatinin is a marker for kidney function (and slightly for muscle mass), whereas urea also reflects hemodynamics. Urea is therefore an important marker for the overall disease state of a patient. Although creatinin and urea are concordant in many subjects, they can differ, and reverse relationships are actually possible and should not be considered surprising. Indicative for the fact that urea and creatinin are not always concordant is the clinical use of an urea-to-creatinin ratio, which can aid in the diagnosis of prerenal injury, GI bleeding, elderly patients or hypercatabolic states (Irwin & Rippe, 2008; Brisco et al, 2013; Sunjino et al, 2019).

Attachment

Submitted filename: 20200901_Rebuttal - response to reviewers.pdf

Decision Letter 2

Ivan Olier

21 Oct 2020

PONE-D-20-09068R2

A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis

PLOS ONE

Dear Dr. Meex,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 05 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ivan Olier, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: I Don't Know

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: I have no further comments for the authors. I have read the prior comments and believe that they have beed addressed in a satisfactory manner.

Reviewer #4: Thank you for the opportunity to review this manuscript. Though interesting and seemingly sound in terms of methods, I have some concerns regarding its application (see comments below).

In order for this study to be valid it needs to applied to an undifferentiated population with infection who could have sepsis, but do not have the diagnosis yet. As I read it, it would appear that all of the patients in this study were referred for admission for some reason, which is different from patients presenting to the ED with an infection as many of those patients would be discharged home. Therefore I don’t believe this score can be directly compared to score made for an undifferentiated ED population. However, if this is not the case, the authors should state clearly that this study included all ED data for patients meeting their SIRS/qSOFA criteria.

I would be interested in knowing how this score would be applied, since the authors specifically chose to test and validate their models based on the first 2 hours of available clinical and laboratory information and from what I can tell these were all patients who were being admitted to the hospital. If this is the case it would reduce the applicability of this tool.

How was consent obtained on a retrospective study? How were patients able to refuse participation, since the “ethics committee waived the requirement for informed consent”

This article should have a separate stat/methods review, particularly the use of 100 patients for the validation set despite this being addressed in this revision in a 70/30 split the data in table 1 seem too sparse with several of the features having N’s in the single digits. I am not sure the standard for an ML paper. Also I am not familiar with their method for cross-validation.

I am not familiar with the term acute internal medicine physicians? Do these physicians work in the ED or do they work on the acute inpatient services admitting patients?

How was the subpopulation of 1420 patients selected out of the 5967 patients consulted to internal medicine? Was this cohort selected randomly? There should be flow diagram showing the total populations, those excluded with reasons why and the final cohort.

The lab model vs the lab + clinical model include features that are quite different. How do the authors suggest we reconcile these differences and which model should we consider to be superior? Also please define “Blood group (present)”.

Some terms are not defined in the manuscript, such as GCS. Also thrombocytes should be replaced with “platelet count”.

The standard for mortality prediction in sepsis is the SOFA score. There should be a direct comparison with SOFA or at least modified version of the SOFA score (there are several versions) in order to conclude that this may have clinical utility.

There needs to be more of an explanation of the different models and how they should be interpreted.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

Reviewer #4: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 19;16(1):e0245157. doi: 10.1371/journal.pone.0245157.r006

Author response to Decision Letter 2


5 Nov 2020

Reviewer #3

1. I have no further comments for the authors. I have read the prior comments and believe that they have been addressed in a satisfactory manner.

Response: We wish to thank the reviewer for evaluating our manuscript.

Reviewer #4

1. Thank you for the opportunity to review this manuscript. Though interesting and seemingly sound in terms of methods, I have some concerns regarding its application (see comments below).

Response: We want to thank the reviewer for evaluating our manuscript and for giving important suggestions and comments to improve our manuscript. The comments have been addressed in a point-by-point fashion in this document. The changes are highlighted in track changes throughout the manuscript. The pages and lines mentioned for each adjustment refer to the manuscript and supplemental file version with track changes.

2. In order for this study to be valid it needs to applied to an undifferentiated population with infection who could have sepsis, but do not have the diagnosis yet. As I read it, it would appear that all of the patients in this study were referred for admission for some reason, which is different from patients presenting to the ED with an infection as many of those patients would be discharged home. Therefore I don’t believe this score can be directly compared to score made for an undifferentiated ED population. However, if this is not the case, the authors should state clearly that this study included all ED data for patients meeting their SIRS/qSOFA criteria.

Response: This study focused on sepsis patients who visited the ED. All patients aged ≥18 years being referred to the internal medicine physician because of sepsis (i.e. a suspected or proven infection with two or more SIRS and/or qSOFA criteria) (S1 supporting information) were included in this study. This is also described in our methods section at page 6, lines 99-109. Our machine learning model was compared to the abbreviated Mortality in Emergency Department Sepsis (abbMEDS) and the modified Rapid Emergency Medicine Score (mREMS). Since the abbMEDS is specifically designed for sepsis patients, we believe this is a fair comparison. We agree with the reviewer that mREMS is a score originally developed for an undifferentiated ED population, but since it has also been validated specifically in sepsis populations (Chen et al., 2013, BMJ Emer Med J; Howell et al., 2008, Acad Emer Med; Sankoff et al., 2008, Crit Care Med; Crowe et al., 2010, J Emerg Trauma Shock), we believe it is also insightful in a comparison in our study. For completeness, and in response to your request in question 10, we also included the SOFA score in our comparison to the machine learning model and internal medicine physicians. This is described in greater detail in our response to question 10.

3. I would be interested in knowing how this score would be applied, since the authors specifically chose to test and validate their models based on the first 2 hours of available clinical and laboratory information and from what I can tell these were all patients who were being admitted to the hospital. If this is the case it would reduce the applicability of this tool.

Response: We agree with the author that the development of machine learning models without any clinical application is meaningless. However, given our limited sample size, only moderate to good model performance and a main focus on the comparison of physician vs machine learning model, we believe that providing these clinical estimates is beyond the scope of the current proof-of-concept study. However, in an ongoing follow-up study we built high-performance machine learning models (AUCs of 0.90 and higher) in four Dutch hospitals, including more than 260.000 patients. That study is more focused on the strategy towards clinical application, which can be described as follows:

First, we defined the acceptable percentage of patients that are erroneously identified as “low-risk” by the algorithm (any number from 0-100%). This percentage, e.g. 1%, could be derived from an inventory of acceptable risk tolerance for adverse events by patients, health care workers, or both (Brown TB, et al. J Emerg Med. 2010;39(2):247-52). Then, we will use the corresponding negative predictive value (in this case 99%) to derive the matching algorithm prediction threshold (e.g. 0.05) and associated values for sensitivity, specificity, and proportion of subjects identified as low risk. A similar approach can be applied to identify high risk patients: define the positive predictive value that would provide an acceptable balance between true high risk patient identification and false positives, e.g. a positive predictive value of 75% would categorize x% as high-risk individuals with 1 in 4 “flaggings” by the clinical decision support tool being false positive. A higher proportion of high risk subject identification is feasible but will be at the expense of increased false positive flaggings.

4. How was consent obtained on a retrospective study? How were patients able to refuse participation, since the “ethics committee waived the requirement for informed consent”

Response: The study was approved by the medical ethical committee (METC 2019-1044) and hospital board of the Maastricht University Medical Centre+. The ethics committee waived the requirement for informed consent. Patients who are treated in The Netherlands automatically consent for their data to be used for anonymized scientific research provided that medical ethical approval is obtained, unless the specifically refuse this.

5. This article should have a separate stat/methods review, particularly the use of 100 patients for the validation set despite this being addressed in this revision in a 70/30 split the data in table 1 seem too sparse with several of the features having N’s in the single digits. I am not sure the standard for an ML paper. Also I am not familiar with their method for cross-validation.

Response: The methods and statistical analysis of our manuscript have extensively been reviewed and, after revisions, considered robust by previous reviewers (#2 and #3). Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. This was described in our methods section (page 10, lines 173-183) and is widely used and adapted in literature validating clinical prediction models (few examples include Beker, et al., 2020, Nature Mach Intell; de Rooij et al., 2020, Adv in Meth and Pract in Psych Sci; Saeb et al, 2017, Gigascience; Steyerberg et al, 2014, Eur Heart J).

6. I am not familiar with the term acute internal medicine physicians? Do these physicians work in the ED or do they work on the acute inpatient services admitting patients?

Response: We selected internal medicine physicians (n=4) who were either residents (n=2) or consultants (n=2) specialized in acute internal medicine. At the time of the study, our hospital had no emergency physicians, but rather internal medicine physicians who work at our emergency department on a day-to-day basis, and thus represent the most experienced emergency medicine physicians for a comparison versus a machine learning model. They treat patients at the acute admission unit as well and are familiar with treating sepsis, infections being the main reason for ED visits they handle.

7. How was the subpopulation of 1420 patients selected out of the 5967 patients consulted to internal medicine? Was this cohort selected randomly? There should be flow diagram showing the total populations, those excluded with reasons why and the final cohort.

Response: During the study period 5,967 patients that presented to our emergency department were referred to an internal medicine physician. Of these patients, 1420 patients had a suspected or proven infection and fulfilled two or more SIRS and/or qSOFA criteria. To clarify the complete process from study inclusion to data processing we depicted the flow chart below (in response to reviewers document). This flow chart was also added as S1 Fig (page 10, lines 154-159) and inserted into our manuscript at page 13, line 250.

8. The lab model vs the lab + clinical model include features that are quite different. How do the authors suggest we reconcile these differences and which model should we consider to be superior? Also please define “Blood group (present)”.

Response: The lab and lab + clinical model include features that are described in S1 Table. Briefly, the lab + clinical model consists of all features in the lab model with additional clinical variables. Figure 3 presents an analysis of the top-20 most important features on a model level using the SHapley Additive exPlanations (SHAP) algorithm (Lundberg et al., 2020, Nat Mach Intell). Hence, all laboratory features ranked in the top-20 lab + clinical model are also important in the lab model. The main difference is that several clinical features (e.g. heart rate, oxygen saturation and systolic blood pressure) appear to be important and therefore are ranked amongst the top-20 features.

In the current comparison, the lab + clinical model slightly outperformed the lab model (AUC 0.84 vs 0.82) and can therefore be considered superior in terms of performance. In a follow-up study (answer #3) we however decided to continue with the laboratory model as automatic, standardized collection of clinical variables is more complex and subject to variability.

We apologize for the unclear definition in relation to “Blood group (present)”. This variable represents whether or not the attending physician ordered a blood group assessment of the patient. As such this lab parameter reflects the physicians consideration to request a blood transfusion. We adjusted this in our manuscript in a revised version of Figure 3 and on page 7, lines 131-132 in the supplemental information.

9. Some terms are not defined in the manuscript, such as GCS. Also thrombocytes should be replaced with “platelet count”.

Response: We thank reviewer for the suggestion. To clarify our manuscript we defined several terms in our manuscript, including platelet count (page 15, line 279 in manuscript; page 5, line 106 and page 7, lines 107 in supplementals; revised Figure 2), glasgow coma score (GCS; page 15, line 280) and C-reactive protein (CRP; page 15, line 285-286).

10. The standard for mortality prediction in sepsis is the SOFA score. There should be a direct comparison with SOFA or at least modified version of the SOFA score (there are several versions) in order to conclude that this may have clinical utility.

Response: We thank the reviewer for the suggestion to include the SOFA score in the comparison we provide with the abbMEDS and mREMS clinical risk scores (Figure 4 in manuscript). We decided to implement the SOFA score in our manuscript in its original version (Vincent et al., 1996, Intensive Care Medicine) scoring 1 to 4 points for each of the six organ systems. This was included in our introduction (page 4, line 75; page 5, line 97), methods (page 10, lines 213-214), results (page 13, line 254; page 13, line 258; page 15, line 299; page 16, lines 305-308, 308-309, 312-313, 328-331), discussion (page 18, line 346) and supplemental information (page 3, lines 65-69; S5 Table; S4 Fig). The comparison with the new SOFA score was updated in a revised version of Figure 4 (depicted in the response to reviewers document and manuscript).

Although it performs better than the other clinical risk scores, the machine learning models still outperforms all three clinical risk scores. We therefore believe it does not impact the major findings of our current manuscript.

11. There needs to be more of an explanation of the different models and how they should be interpreted.

Response: We assume the reviewer is referring to the different machine learning models explored in the methods section of our manuscript (page 9, line 160-164). The models and their implementation details were described in S2 supporting information (page 4, lines 70-101). The results of the comparison of different models was described in S3 Table. To further clarify our models, we extended the description of our models in S2 supporting information (page 4, lines 70-101).

Attachment

Submitted filename: 20201105_Rebuttal - response to reviewers.docx

Decision Letter 3

Ivan Olier

23 Dec 2020

A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis

PONE-D-20-09068R3

Dear Dr. Meex,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ivan Olier, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #4: The manuscript is substantially improved and I thank the authors for their efforts. However, the main limitation is still present. Namely, that the study was performed in a population that was already consulted for admission to the hospital for sepsis. I am not sure how clinically this score would be applied in the ED setting.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #4: No

Acceptance letter

Ivan Olier

5 Jan 2021

PONE-D-20-09068R3

A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis

Dear Dr. Meex:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ivan Olier

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Extended description of clinical criteria and risk scores.

    (DOCX)

    S2 File. Background information on machine learning models reviewed in the current study.

    (DOCX)

    S1 Table. Overview of variables present in the datasets.

    The laboratory dataset consisted exclusively of laboratory variables with age, sex and time of request. The laboratory and clinical dataset contained all variables from the laboratory dataset and additionally clinical and vital characteristics.

    (DOCX)

    S2 Table. Comparison of baseline statistical and machine learning models for predicting 31-day mortality risk.

    We performed a baseline comparison of statistical and machine learning models (S1 File) for the 31-day mortality prediction task using the laboratory dataset. We used five-fold cross validation to assess model performance. Performance was assessed by area under the receiver operating characteristic curve (AUC) and accuracy. Confidence intervals were calculated using bootstrapping methods (n = 1,000).

    (DOCX)

    S3 Table. Hyperparameters of XGBoost models.

    Hyperparameters were based on theoretical reasoning rather than hyperparameter tuning. This was done to prevent overfitting on hyperparameters due to small sample size. “Base_score”, “Missing”, “Reg_alpha”, “Reg_lambda” and “Subsample” parameters were standard values provided by the XGBoost interface. “Max_depth”, “max_delta_step” and “estimators” were values we internally use for these kind of machine learning models. During the study, hyperparameters were never adjusted to gain performance in our validation dataset.

    (DOCX)

    S4 Table. Extended analysis of correlation between important model features and clinical risk scores.

    To study the correlation between the most important features contributing to model predictions and the clinical criteria (qSOFA and SIRS) and risk scores (abbMEDS and mREMS), we compared their existence in both. The top-20 most important features (Fig 3 in main article) are compared to all criteria in the clinical scores (S1 File). We observe that most of the features present in the clinical criteria and scores are also among the most important features in the lab and clinical machine learning model.

    (DOCX)

    S5 Table. Extended comparison of machine learning models with internal medicine physicians and clinical risk scores.

    In addition to sensitivity and specificity, we evaluated the performance of each group by positive predictive value (PPV), negative predictive value (NPV), accuracy and area-under-the receiver operating characteristics curve (AUC). Our XGBoost model shows superior performance in each of these metrics, which is in line with the findings presented in the manuscript.

    (DOCX)

    S6 Table. Inter-rater agreement of internal medicine physicians.

    Cohen’s kappa was used to measure the inter-rater agreement between the internal medicine physicians. The level of agreement was interpreted as nil if κ was 0 to 0.20; minimal, 0.21 to 0.39; weak, 0.40 to 0.59; moderate, 0.60 to 0.79; strong, 0.80 to 0.90; and almost perfect, 0.90 to 1.3.

    (DOCX)

    S7 Table. Machine learning comparison to alternating physician groups.

    In each comparison between the machine learning model and the physicians group, a single physician was removed from the physician group. In every comparison the machine learning model outperforms the physicians. This analysis shows that the higher performance of the machine learning model was not due to systemic underperformance of a single physician.

    (DOCX)

    S1 Fig. Flow diagram of study inclusion.

    During the study period 5,967 patients that presented to our emergency department were referred to an internal medicine physician. Of these patients, 1420 patients fulfilled two or more SIRS and/or qSOFA criteria. After exclusion of 76 patients, a number of 1,344 patients were separated into development and validation datasets.

    (DOCX)

    S2 Fig. Five-fold cross validation of diagnostic performance of XGBoost models.

    During each cycle of cross-validation, we assessed predictive performance by area under the receiver operating characteristic curves (AUC). Performance was determined for models trained with laboratory data (A) and models trained with laboratory and clinical data (B) to predict 31-day mortality.

    (DOCX)

    S3 Fig. Five-fold cross validation of calibration of XGBoost models.

    During each cycle of cross-validation, we assessed calibration by calibration curves and their respective brier scores. Calibration was determined for models trained with laboratory data (A) and models trained with laboratory and clinical data (B).

    (DOCX)

    S4 Fig. Receiver operating characteristic analysis of machine learning model, risk scores and internal medicine physicians.

    Receiver operating characteristics analysis of the lab + clinical machine learning model (AUC: 0.852 [0.783–0.922]), abbMEDS (0.631 [0.537–0.726]), mREMS (0.630 [0.535–0.724]) and internal medicine physicians (mean 0.735 [0.648–0.821]). Internal medicine physicians were depicted as bullets in the ROC analysis.

    (DOCX)

    S5 Fig. Individual performance of internal medicine physicians.

    Predictive performance of all internal medicine specialists (n = 4; 2 experienced consultants in acute internal medicine and 2 experienced residents acute internal medicine) was assessed by sensitivity (left) and specificity (right). Consultants (experienced) specialists are depicted in grey and residents in orange.

    (DOCX)

    Attachment

    Submitted filename: Response to reviewers.pdf

    Attachment

    Submitted filename: 20200901_Rebuttal - response to reviewers.pdf

    Attachment

    Submitted filename: 20201105_Rebuttal - response to reviewers.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES