Skip to main content
PLOS One logoLink to PLOS One
. 2020 Jul 13;15(7):e0235835. doi: 10.1371/journal.pone.0235835

Value of laboratory results in addition to vital signs in a machine learning algorithm to predict in-hospital cardiac arrest: A single-center retrospective cohort study

Ryo Ueno 1,2,3,*, Liyuan Xu 4, Wataru Uegami 5, Hiroki Matsui 6, Jun Okui 7, Hiroshi Hayashi 7, Toru Miyajima 7, Yoshiro Hayashi 1, David Pilcher 2, Daryl Jones 2,3
Editor: Wisit Cheungpasitporn8
PMCID: PMC7357766  PMID: 32658901

Abstract

Background

Although machine learning-based prediction models for in-hospital cardiac arrest (IHCA) have been widely investigated, it is unknown whether a model based on vital signs alone (Vitals-Only model) can perform similarly to a model that considers both vital signs and laboratory results (Vitals+Labs model).

Methods

All adult patients hospitalized in a tertiary care hospital in Japan between October 2011 and October 2018 were included in this study. Random forest models with/without laboratory results (Vitals+Labs model and Vitals-Only model, respectively) were trained and tested using chronologically divided datasets. Both models use patient demographics and eight-hourly vital signs collected within the previous 48 hours. The primary and secondary outcomes were the occurrence of IHCA in the next 8 and 24 hours, respectively. The area under the receiver operating characteristic curve (AUC) was used as a comparative measure. Sensitivity analyses were performed under multiple statistical assumptions.

Results

Of 141,111 admitted patients (training data: 83,064, test data: 58,047), 338 had an IHCA (training data: 217, test data: 121) during the study period. The Vitals-Only model and Vitals+Labs model performed comparably when predicting IHCA within the next 8 hours (Vitals-Only model vs Vitals+Labs model, AUC = 0.862 [95% confidence interval (CI): 0.855–0.868] vs 0.872 [95% CI: 0.867–0.878]) and 24 hours (Vitals-Only model vs Vitals+Labs model, AUC = 0.830 [95% CI: 0.825–0.835] vs 0.837 [95% CI: 0.830–0.844]). Both models performed similarly well on medical, surgical, and ward patient data, but did not perform well for intensive care unit patients.

Conclusions

In this single-center study, the machine learning model predicted IHCAs with good discrimination. The addition of laboratory values to vital signs did not significantly improve its overall performance.

Introduction

In-hospital cardiac arrests (IHCAs), which are associated with high mortality and long term morbidity, are a significant burden on patients, medical practitioners, and public health [1]. To achieve a favorable outcome, the prevention and early detection of IHCA has been proven to be essential [2,3]. Up to 80% of patients with IHCA have signs of deterioration in the eight hours before cardiac arrest [4,5], and various early warning scores based on vital signs have been developed [611]. The widespread implementation of electronic health records enables large datasets of laboratory results to be used in the development of early warning scores [1215]. Recently, automated scores using machine learning models with and without laboratory results have been widely investigated, and both have achieved promising results [1618].

However, it is unknown whether a model with vital signs alone (Vitals-Only model) performs similarly to a model that incorporates both vital signs and laboratory results (Vitals+Labs model). It is promising that prior studies that use both vital signs and laboratory results report that vital signs are more predictive than laboratory results alone for IHCA within each model [16,19]. Physiologically, changes in vital signs may be more dynamic and occur earlier than changes in laboratory results [4,5]. Computationally, the amount of vital-sign data is likely to be much larger than that of laboratory results. Vital signs are less invasive and easier to obtain, and there are many more opportunities to collect vital signs than laboratory results. For these reasons, we hypothesize that a Vitals-Only model may perform similarly to a Vitals+Labs model in the prediction of IHCA.

We were motivated to perform this study because a Vitals-Only model that performs similarly to a Vitals+Labs model is of clinical importance for the following reasons. First, a simpler model with fewer input variables can be adopted in a wide variety of settings. Vital signs are available anywhere, even in a patient’s home potentially because of the development of telemetry and wearable devices, whereas laboratory results might not be available in all instances. If a model uses complex input data such as biochemical and arterial blood gas data, or complex image results, such a model may not be suitable in some hospitals, especially in low-resource settings. Second, a simple model can also be externally validated and more easily calibrated to different healthcare systems. Third, the Vitals-Only model would be non-invasive and economically feasible, because it does not require any laboratory tests, which may be physically stressful and financially burdensome for patients. Fourth, from a computational point of view, the minimal optimal model with a low data dimensionality is always better than a more complicated model as long as it has similar performance.

We hypothesized that the occurrence of an IHCA within 8 h can be predicted from vital signs alone without the need for additional laboratory tests. To assess this hypothesis, we conducted a single-center retrospective research project.

Methods

Study design and setting

We conducted this single-center retrospective cohort study at Kameda Medical Center, a tertiary teaching hospital in a rural area in Japan with 917 beds, which includes 16 intensive care unit (ICU) beds. This hospital has a cardiac arrest team that treats all IHCAs and consists of staff anesthetists, emergency physicians, and cardiologists. A record of the resuscitation is entered in the code blue registry immediately after an event. This hospital also has a rapid response team with an ICU senior doctor and an ICU nurse who attend calls to review patients if requested by the ward nurse. The RRT has 90–100 activations per year.

Since 1995, Kameda Medical Center has used an electronic medical record system, which collects patient information such as patient demographics, vital signs, and laboratory results. The data for this study were taken from this system.

This study was reviewed and approved by the Institutional Review Board of Kameda Medical Center (approval number: 18-004-180620). The committee waived the requirement for informed consent because of the retrospective design of the study. In addition, this study follows the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for prognostic studies [20].

Study population

We included all adult patients (age ≥ 18 years old) admitted for more than 24 h between 20 October 2011 and 31 October 2018 in the study. Both ward and ICU patients were included, but emergency department patients were excluded unless they stayed in the hospital for more than 24 h. We collected the following data: demographic data on admission (i.e. age, sex, BMI, elective or emergency admission, and department of admission), eight-hourly vital signs (systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, temperature, saturation, and urinary output), and daily laboratory results (see Fig 1 for details). All the IHCA patients were identified from the code blue registry of Kameda Medical Center. Three doctors (JO, TM, and HH) retrospectively reviewed all the records and confirmed the IHCA event and the time of day it occurred.

Fig 1. Overview of the input variables for both models predicting IHCA.

Fig 1

Six sets of eight-hourly measured Vital signs in the last 48 hours were used in both models. Two sets of most recent Laboratory results in the last seven days were used in Lab model. Demographic data remains static whereas vital signs and laboratory results, other data, and output are tracked at regular intervals. Abbreviations: Alb, albumin; APTT, activated partial thromboplastin time; AST, aspartate aminotransferase; ALT, alanine aminotransferase; ALP, alkaline phosphatase; BMI, body mass index; BNP, brain natriuretic peptide; CK, creatine kinase; CKMB, creatine kinase–muscle/brain; Cre, creatinine; CRP, C-reactive protein; eGFR, estimated estimated Glomerular. Filtration Rate; GGT, gamma-glutamyl transferase; Glu, glucose; Hb, haemoglobin; Hct, haematocrit; IHCA, in-hospital cardiac arrest; LD, lactate dehydrogenase; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular haemoglobin concentration; MCV, mean corpuscular volume; Plt, platelets; PT, prothrombin time; PT-INR, prothrombin time international normalized ratio; RBC, red blood cells; TP, total protein; T-Bil, total bilirubin; WBC, white blood cells.

Prediction outcome measures

The primary outcome predicted by our model was IHCA within the next 8 h. The secondary outcome predicted by our model was IHCA within the next 24 h. All the ‘expected’ cardiac arrests (such as cardiac arrests in palliative care patients) without code blue responses were excluded.

Algorithm selection

We chose a random forest model, which is a nonparametric machine learning approach that has been shown to outperform other algorithms without the need for standardization or log-transformation of the input data [16,21,22]. We chose this model for the following reasons. First, a random forest model allows us to take into account the non-linear relationships between input variables. Vital signs are known to differ among various age groups [23]. This approach enables the model to learn the age-dependent variables more precisely. Second, random forest models have relative explainability through the provision of ‘feature importance’. This is a scaled measure that indicates the weighted contribution of each variable to overall prediction. It is scaled so that the most important variable is given a maximum value of 100, with less important variables given a lower value. While this allows comparison of the relative importance of each variable within a model, it does not allow comparison of the relative contribution of the same variables between different models.

Statistical methods

Patient data were divided into training data and test data according to admission date (training data dates: 20 October 2011 to 31 December 2015; test data dates: 1 January 2016 to 31 October 2018) [17]. As summarized in Fig 1, predictions were made every 8 h using patient demographics and eight-hourly vital signs in the last 48 h in the Vitals-Only model. The Vitals+Labs model was the same as the Vitals-Only model but with the addition of two sets of laboratory results obtained in the last seven days, as previously reported [17,24]. As shown in Fig 2, an ensemble model of trees was developed using training data after bagging, and then the model was validated using the test dataset. (See the S1 File for further details.)

Fig 2. Architectural overview of data extraction and representation.

Fig 2

We measured the prediction performance of each model by computing the following values: (1) the C statistic (i.e. the area under the receiver operating characteristic [ROC] curve); (2) the prospective prediction results (i.e. sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio, and negative likelihood ratio) at Youden’s index; and (3) the calibration curve. To gain insight into the contribution of each predictor to our model, we calculated their feature importance with respect to the primary outcome.

Sensitivity analyses were planned a priori under various statistical assumptions. To compare the Vitals-Only model with the Vitals+Labs model in various populations, we tested the performance of our models in patient subsets (i.e. medical admission or surgical admission; ward admission or ICU admission). To assess the impact of missing data, we repeated the primary analysis with different imputation methods (see the S1 File for further details).

Values that are clearly errors (e.g. systolic blood pressure >300 mmHg) were removed, as described in [17,25] (see the S1 File for further details). No other preprocessing (e.g. normalization or log-transformation) of the dataset was performed. Missing values were imputed with the patient’s last measured value for that feature or the median value of the entire sample if a patient had no previous values, as described in [14,15,26]. If more than 50% of data for a particular vital sign or laboratory result was missing in the entire dataset, the feature was converted to a binary value (1 denotes a measured value and 0 denotes a missing value) [27]. Various types of missing imputations were performed as part of the sensitivity analyses (see the S1 File for further details). Categorical variables were expressed as percentages, whereas continuous variables were described as means (± standard deviations, SDs) or median (± interquartile ranges, IQRs). The analyses were performed using Python 3.7.0, and we have made the analysis code publicly available (https://github.com/liyuan9988/Automet).

Results

Details of the patient cohort and IHCAs

A total of 143,190 admissions of adult patients were recorded during the study period. After 2,069 admissions less than 24 h in length were excluded, the remaining 141,111 admissions were used in the analysis. Among these admissions, 338 IHCAs were recorded. Patients characteristics were similar in both the training and test data (Table 1). The percentage of missing data for each variable is summarized in S1 Table in S1 File; 11/130 (8.5%) variables with >50% missing values were converted into binary values following the rule described in the Method section. As summarized in S2 Table in S1 File, patients who suffered an IHCA had characteristics different from those who did not. Of note, IHCA patients were older, frequently male, and admitted with non-surgical conditions. Of the IHCA patients, almost 40% were admitted to the Cardiology department. The next most common admission departments were Hematology and General Internal Medicine, each accounting for approximately 10% of the IHCA patients.

Table 1. Characteristics of study population.

Comparison of Training/Test data
  Training Test
Study Period Oct 2011- Dec 2015 Jan 2016—Oct 2018
Total Admissions, n 83,064 58,047
Age, y, median (IQR) 64 (54; 77) 65 (56; 77)
Male sex, n (%) 41,868 (50) 29,210 (50)
BMI, kg/m2, median (IQR) 22.9 (20.5; 25.4) 23.1 (20.7; 25.7)
Surgical Patients, n (%) 41,337 (50) 29,491 (51)
Emergency Admission, n (%) 17,853 (21) 14,673 (25)
Patient with IHCA, n (%) 217 (0.3) 121 (0.2)
Patient with In-Hospital Death, n (%) 2,577 (3.1) 1,729 (3.0)
Comparison of Patients with/without IHCA
Training Test
IHCA non-IHCA IHCA non-IHCA
Total, n 217 82847 121 57926
Male, n (%) 142 (65.4) 41726 (50.4) 77 (63.6) 29133 (50.3)
Age, median (IQR) 75 (66:81) 67 (54:77) 72 (63;81) 68 (56;77)
BMI, median (IQR) 24 (21;26) 23 (21;25) 24 (20;28) 23 (21;26)
Emergency admission, n (%) 63 (29.0) 17790 (21.5) 27 (22.3) 14646 (25.3)
Surgical admission, n (%) 50 (23.0) 41287 (49.8) 24 (19.8) 29467 (50.9)
History of previous IHCA in the same admission, n (%) 26 (12.0) 15 (0.0) 14 (11.6) 11 (0.0)
In-hospital death, n (%) 164 (75.6) 2413 (2.9) 88 (72.7) 1641 (2.8)
Primary Admission Department, n (%)        
Breast Surgery 0 ( 0.0) 3592 (4.3) 0 (0.0) 2710 (4.7)
Cardiology 96 (44.2) 6556 (7.9) 46 (38.0) 5299 (9.1)
Cardiovascular Surgery 14 ( 6.5) 1412 (1.7) 4 (3.3) 768 (1.3)
Dermatology 0 ( 0.0) 136 (0.2) 0 (0.0) 68 (0.1)
Emergency Medicine 2 ( 0.9) 1758 (2.1) 6 (5.0) 709 (1.2)
Endocrinology 0 (0.0) 385 (0.5) 0 (0.0) 307 (0.5)
ENT 3 (1.4) 1594 (1.9) 1 (0.8) 1287 (2.2)
Gastroenterology 11 (5.1) 11282 (13.6) 4 (3.3) 6082 (10.5)
General Internal Medicine 13 (6.0) 3758 (4.5) 15 (12.4) 3902 (6.7)
General Surgery 13 (6.0) 6161 (7.4) 4 (3.3) 4387 (7.6)
Hematology 14 (6.5) 1850 (2.2) 13 10.7) 1594 (2.8)
Infectious Disease 0 (0.0) 47 (0.1) 0 (0.0) 91 (0.2)
Nephrology 14 (6.5) 2418 (2.9) 4 (3.3) 1449 (2.5)
Neurology 3 (1.4) 2747 (3.3) 5 (4.1) 1668 (2.9)
Neurosurgery 1 (0.5) 1812 (2.2) 2 (1.7) 1145 (2.0)
Obstetrics and Gynecology 3 (1.4) 9080 (11.0) 0 (0.0) 5107 (8.8)
Oncology 9 (4.1) 3387 (4.1) 4 (3.3) 1447 (2.5)
Opthalmology 0 (0.0) 2206 (2.7) 0 (0.0) 1562 (2.7)
Oral and Maxillofacial Surgery 0 (0.0) 1110 (1.3) 2 (1.7) 648 (1.1)
Orthopedics 6 (2.8) 3070 (3.7) 3 (2.5) 2312 (4.0)
Palliative Medicine 0 (0.0) 3 (0.0) 0 (0.0) 1 (0.0)
Pediatrics 0 (0.0) 3 (0.0) 0 (0.0) 2 (0.0)
Plastic Surgery 2 (0.9) 1052 (1.3) 0 (0.0) 814 (1.4)
Psychiatry 0 (0.0) 732 (0.9) 0 (0.0) 390 (0.7)
Pulmonology 6 (2.8) 4868 (5.9) 4 (3.3) 3984 (6.9)
Rehabilitation 0 (0.0) 1306 (1.6) 0 (0.0) 844 (1.5)
Rheumatology 1 (0.5) 1936 (2.3) 0 (0.0) 1055 (1.8)
Spinal Surgery 1 (0.5) 2082 (2.5) 0 (0.0) 1354 (2.3)
Sports medicine 0 (0.0) 720 (0.9) 1 (0.8) 675 (1.2)
Thoracic Surgery 2 (0.9) 1135 (1.4) 2 (1.7) 1201 (2.1)
Urology 3 (1.4) 4649 (5.6) 1 (0.8) 5064 (8.7)

Data are n (%) or median (IQR)

Predictive ability of the Vitals-Only model and Vitals+Labs model

As summarized in Table 2, the Vitals-Only model (AUC = 0.862 [95% confidence interval (CI): 0.855–0.868]) and Vitals+Labs model (0.872 [95% CI: 0.867–0.878]) had similar performance in predicting the occurrence of an IHCA in the next 8 h. In addition, similar results were obtained for the prediction of IHCA in the next 24 h (Vitals-Only model vs Vitals+Labs model, AUC = 0.830 [95% CI: 0.825–0.835] vs 0.837 [95% CI: 0.830–0.844]). At Youden’s index, both models achieved similar prospective prediction results for both positive and negative prediction values (Vitals-Only model vs Vitals+Labs model, positive prediction value = 0.035 [95% CI: 0.029–0.041] vs 0.044 [95% CI: 0.035–0.053], negative prediction value = 0.998 [95% CI: 0.997–0.998] vs 0.997 [95% CI: 0.997–0.998]).

Table 2. Predictive performance of each model for the in-hospital cardiac arrest in the next 0–8 hours.

Model  AUC (95%CI) PPV (95%CI) NPV (95%CI) Sensitivity (95%CI) Specificity (95%CI) PLR (95%CI) NLR (95%CI)
Vitals-Only model 0.862 [0.855; 0.868] 0.035 [0.029; 0.041] 0.998 [0.997; 0.998] 0.817 [0.754; 0.880] 0.772 [0.716; 0.827] 3.58 [2.92; 4.25] 0.238 [0.172; 0.304]
Vitals+Labs model 0.872 [0.867; 0.878] 0.044 [0.035; 0.053] 0.997 [0.997; 0.998] 0.770 [0.719; 0.821] 0.830 [0.781; 0.879] 4.52 [3.56; 5.50] 0.277 [0.229; 0.325]

Abbreviations: 95%CI, 95% confidence interval; AUC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value; PLR, positive likelihood ratio; NRL, negative likelihood ratio

Calibration plot

As shown in Fig 3, the calibration curve of both models were similarly far from the diagonal of the calibration plot (Vitals-Only model vs Vitals+Labs model, Hosmen-Lemeshow C-statistics = 8592.40 vs 9514.76, respectively). For the highest risk group, the occurrence of IHCA was 20%–30% in both models.

Fig 3. Calibration plot.

Fig 3

The x-axis summarizes the predicted probability of having a cardiac arrest in the next 8 hours, whereas the y-axis shows the observed proportions of patients who have a cardiac arrest in next 8 hours. Histogram below the calibration plot summarizes the distribution of predicted probability amongst all the patients in our dataset. Abbreviations: IHCA; inhospital cardiac arrest.

Sensitivity analysis with different models

We performed several sensitivity analyses under various statistical assumptions. Our results were unchanged when we applied the models to medical patients (Vitals-Only model vs Vitals+Labs model, AUC = 0.869 [95% CI: 0.862–0.877] vs 0.876 [95% CI: 0.870–0.882]) or surgical patients (Vitals-Only model vs Vitals+Labs model, AUC = 0.806 [95% CI: 0.797–0.816] vs 0.825 [95% CI: 0.802–0.848]). Likewise, the Vitals-Only model was not inferior to the Vitals+Labs model among ward patients (Vitals-Only model vs Vitals+Labs model, AUC = 0.879 [95% CI: 0.871–0.886] vs 0.866 [95% CI: 0.858–0.874]).

However, the discrimination of both model types was poor when applied to the ICU population. The Vitals+Labs model outperformed Vitals-Only model for ICU patients (Vitals-Only model vs Vitals+Labs model, AUC = 0.580 [95% CI: 0.571–0.590] vs 0.648 [95% CI: 0.635–0.661]; Table 3). Finally, our results were similar regardless of the type of imputation. The Vitals-Only model had a performance similar to that of the Vitals+Labs model with AUCs ranging between 0.80 and 0.90 for all four imputation methods (S3 Table in S1 File).

Table 3. Predictive performance of each model in sets of sensitivity analyses.

AUC (95%CI) PPV (95%CI) NPV (95%CI) Sensitivity (95%CI) Specificity (95%CI) PLR (95%CI) NLR (95%CI)
Medical
Vitals-Only 0.869 [0.862; 0.877] 0.047 [0.034; 0.060] 0.997 [0.996; 0.998] 0.803 [0.714; 0.892] 0.783 [0.701; 0.865] 3.70 [2.63; 4.79] 0.251 [0.160; 0.342]
Vitals+Labs 0.876 [0.870; 0.882] 0.052 [0.039; 0.066] 0.997 [0.996; 0.998] 0.796 [0.733; 0.859] 0.809 [0.746; 0.871] 4.16 [3.01; 5.34] 0.253 [0.193; 0.313]
Surgical
Vitals-Only 0.806 [0.797; 0.816] 0.020 [0.016; 0.025] 0.998 [0.998; 0.998] 0.708 [0.661; 0.754] 0.819 [0.768; 0.871] 3.91 [3.00; 4.82] 0.357 [0.316; 0.398]
Vitals+Labs 0.825 [0.802; 0.848] 0.018 [0.011; 0.026] 0.998 [0.998; 0.999] 0.707 [0.604; 0.811] 0.795 [0.711; 0.879] 3.45 [2.06; 4.87] 0.368 [0.278; 0.458]
ICU
Vitals-Only 0.580 [0.571; 0.590] 0.346 [0.338; 0.353] 0.905 [0.881; 0.930] 0.889 [0.848; 0.930] 0.384 [0.342; 0.426] 1.44 [1.39; 1.49] 0.288 [0.209; 0.372]
Vitals+Labs 0.648 [0.635; 0.661] 0.382 [0.331; 0.434] 0.859 [0.775; 0.944] 0.725 [0.472; 0.977] 0.560 [0.300; 0.821] 1.65 [1.32; 2.04] 0.492 [0.178; 0.874]
Ward
Vitals-Only 0.879 [0.871; 0.886] 0.022 [0.018; 0.026] 0.999 [0.998; 0.999] 0.81 [0.752; 0.868] 0.791 [0.739; 0.844] 3.88 [3.13; 4.63] 0.240 [0.180; 0.300]
Vitals+Labs 0.866 [0.858; 0.874] 0.024 [0.017; 0.032] 0.998 [0.998; 0.999] 0.782 [0.705; 0.859] 0.814 [0.738; 0.890] 4.21 [2.85; 5.59] 0.268 [0.196; 0.340]

Abbreviations: 95%CI, 95% confidence interval; AUC, area under the receiver operating characteristic curve; PPV, positive predictive value; NPV, negative predictive value; PLR, positive likelihood ratio; NLR, negative likelihood ratio

Discussion

Summary of key findings

In this retrospective study with 141,111 admissions, we compared two prediction models for IHCA, one using vital signs and patient background only and one using the same information plus laboratory results. The Vitals-Only model yielded a performance that was similar to that of Vitals+Labs model for all data except for that of ICU patients, where discrimination was poor for both models.

Relationship with prior literature

Prior studies have extensively investigated the importance of early prediction of IHCA [25]. It is known that monitored or witnessed IHCAs have more favorable outcomes even after IHCA compared to events that are unmonitored or unwitnessed [2,3]. To support the necessity of preventive monitoring, prior studies found that clinical deterioration is common prior to cardiac arrest [4,5]. Based on these findings, various early warning scores have been developed, ranging from an analog model based on vital signs to a digital scoring system using both vital signs and laboratory results [615]. Recently, a variety of early warning scores have utilized machine learning to account for the non-linear relationships in various input variables [1618].

Various studies on models that use both vital signs and laboratory results have reported that vital signs are more predictive than laboratory results for IHCA. In a study by Churpek and colleagues, the top five most predictive variables for the composite outcome including IHCA were all vital signs [16]. A similar result was obtained in [19]. We note that our study has similar results to the results of those studies in that higher weights are given to heart rate, blood pressure, and age in our model. However, our model was unable to learn the importance of respiratory rate because of the high rate of missing respiratory rate data in our dataset (Fig 4). Similar to our study, a few studies have investigated the change in predictive performance using different sets of input features. Churpek and colleagues reported a prediction model for IHCA using both vital signs and laboratory results [15]. Their model performed better than their previously published model, which was only based on vital signs [9]. However, those models were developed and validated in different cohorts with different methodologies. In a short communication by Kellett and colleagues, a score based solely on vital signs was more predictive than other scores using both vital signs and laboratory results to predict in-hospital mortality [28]. To date, the best performing model for IHCA (AUC = 0.85) was built solely using vital signs by Kwon and colleagues. The authors argued that the addition of laboratory results should improve performance, but this has not yet been confirmed [17].

Fig 4. Feature importance in predicting subsequent occurrence of IHCA.

Fig 4

Importance of each predictor in the 2 different random forest models: Vitals-Only model and Vitals+Labs model. 20 most important variables were summarized. Abbreviations: APTT, activated partial thromboplastin time; BUN, blood urea nitrogen; CRP, C-reactive protein; LD, lactate dehydrogenase; PT, prothrombin time.

Interpretation of the results

In our dataset, both the Vitals-Only and Vitals+Labs models were capable of predicting early IHCA (within the next 8 h) and late IHCA (within the next 24 h). There are three possible reasons for this result. First, physiologically, eight-hourly measured vital signs obtained in the last 48 h might reflect acute deterioration more sensitively than laboratory results obtained within the last seven days. Second, doctors and nurses might have intervened to treat patients with abnormal laboratory results to prevent physiological deterioration. As a result, such deterioration might not have resulted in IHCA [29]; this could be one reason why abnormal laboratory results did not directly lead to IHCA. However, our data lacks this treatment information, so we are unable to verify this supposition conclusively. Third, our model might have failed to learn from the laboratory results. However, laboratory results, which have been clinically and theoretically proven to be associated with IHCA [13,15,30], have a high feature importance in the model (Fig 4), which indicates they were successfully learned. In addition, the newer variables (i.e. the variable measured closer to the time of prediction) of both vital signs and laboratory results have heavier weights in the models than older variables, which also evidences successful feature learning.

Regardless of whether the admission was classified as medical or surgical, we did not observe a notable difference in the performances of the Vitals-Only model and Vitals+Labs model. This result is consistent with a prior study that compares the performance of the National Early Warning Score among medical and surgical patient populations [31]. However, our model did not predict IHCA well in ICU populations. We believe this is a result of the continuous vital-sign monitoring and more frequent interventions in ICU. Even with the same abnormal vital signs, ICU patients are more likely to receive clinical interventions to prevent IHCA than are ward patients. Given the heterogeneity of patient backgrounds, interventions, and the amount of available data, a tailored prediction model should be developed specifically for ICU patients in future trials.

Strengths

Our study has several strengths. First, our model aims to predict unexpected IHCAs rather than a combined outcome or surrogate outcome. Unlike other outcomes such as ICU transfer, unexpected IHCA is always objective and always necessitates clinical intervention.

Second, to obtain a straightforward interpretation of the prediction, we used a classification model rather than a time-to-event model. Classification models were shown to outperform time-to-event models in predicting hospital deterioration in [22].

Third, all imputation methods assessed in this study are prospectively implementable. A prior study showed that the timing of each laboratory test itself has predictive value [27]. Our model used this information for variables with >50% of the data missing, thereby maximizing the utilization of the available data.

Limitations and future work

There are various limitations in our study. First, this study was conducted in a single center. Our results might be biased by the patient backgrounds and clinical practice of a tertiary care center. If the frequency or methodology of vital measurements or laboratory tests varies in other hospitals, their predictive values might be different. Even though our model shares many of the characteristics identified in other clinical deteriorations of patients in prior studies [15,19], further calibration and validation of our results in different clinical settings are warranted.

Second, we were unable to obtain any data regarding treatment. Theoretically, the patient management (e.g. treatment practice and staffing levels) could have changed over the study period. Such information might have improved the model performance, but we were unable to obtain it.

Third, some of our input features were missing. For example, various studies have stressed the importance of respiratory rate, which is mostly missing in our dataset and hence was converted into binary values. However, data will inevitably be missing when this model is used in a real-world setting. In addition, our results were found to be robust after attempting various imputation methods; we hence believe our approach is appropriate for developing and validating a model for bedside use.

Fourth, a low positive predictive value is a global issue in all prediction models of rare events such as IHCA (0.4% of prevalence) [25,32]. Though no false positives might be better than many false negatives with respect to cardiac arrest, each hospital needs to seek its own balance to optimize the tradeoff between false alarms and overlooking IHCA patients. In this study, we focused on a comparison of the Vitals-Only model and Vitals+Labs model at a widely investigated threshold (i.e. Youden’s index), rather than the provision of a highly predictive model. A different threshold and more sophisticated feature engineering might aid in developing a model with a higher positive predictive value. In addition, it would be worth investigating the outcomes of ‘false positive’ patients in future studies. These are patients flagged as being at risk of cardiac arrest, and false positives could still be identified as ‘at risk’ patients, facilitating early intervention such as a rapid response team.

Fifth, we used eight-hourly discrete vital signs rather than a continuous dataset. This approach reflects the clinical practice in our hospital (vital signs are measured three times a day in most of our patients) and prior studies [15,33]. Sometimes, the timing and frequency of vital-sign measurement reflects the concern of medical professionals [34]. Hence, the addition of such information might improve the performance of the model.

Sixth, the aim of our model is discrimination rather than calibration. That is, the aim of this research is to compare the discrimination performance of two models rather than to provide a well-calibrated risk score. However, if we were to implement the Vitals-Only model in an actual clinical setting, a well-calibrated model might be highly valuable for clinical decision-making because it enables clinicians to interpret the predicted probability as a risk score. Although various clinical tests still solely focus on discrimination (e.g. pregnancy tests and fecal occult blood tests), they should ideally be all calibrated. As shown in Fig 3, both models were similarly far from the diagonal of the calibration curve. It is promising that the highest risk group had a 20%–30% occurrence of IHCA, but future studies should investigate the more sophisticated calibration of the Vitals-Only model before its clinical application.

Seventh, the random forest model is only partially interpretable. The random forest model provides the feature importance, which shows the weight of each variable in the model and enables us to assess the clinical validity of the model’s decision process. However, the random forest model does not provide the reason why a patient was flagged as likely to have an IHCA in the next 24 h. If a clinician were keen to obtain such information, our model would remain a ‘black box’. In addition, different clinicians could have different interpretations for the flag if this model were unable to provide any reason for the alert. Despite these drawbacks, we used the random forest model for the performance reasons summarized in the Methods. Moreover, for an event such as an IHCA, a clinician’s response to an alert may be quite simple. Regardless of its cause, a flag will necessitate a clinical review, and clinicians will be able to synthesize all the available information for clinical decision-making. In such a process, the most important part of the alert is the act of alerting the clinician rather than providing a reason for the alert. Moreover, some cardinal investigations in medicine are ‘black boxes’ in nature. Not all clinicians understand why a certain saddle shape in an electrocardiogram is highly associated with a fetal congenital arrhythmia, but all emergency physicians rely on this waveform in their daily practice. Ultimately, the tradeoff between a model’s performance and its interpretability will depend on medical professionals at the bedside. In future study, such factors will be highly important to consider before the Vitals-Only model can be applied in clinical practice.

Eighth, we did not provide any clinical practice as a benchmark comparison for the Vitals-Only model. This study focuses on the comparison of Vitals-Only model and Vitals+Labs model rather than providing a single model for clinical use. However, it will be essential to provide appropriate benchmarks in order to assess a model’s clinical utility.

Finally, as is often the case with all prediction models, we do not yet know whether our prediction model would actually improve the trajectory of patients at risk. Clinically, we have found little evidence of the reliability, validity, and utility of these systems. However, our results are persuasive enough to facilitate a prospective validation of the Vitals-Only model.

Implications of this research

While it is important to achieve the best performance by using all the data available, it is also important to focus on developing a simple model for better generalizability [35,36]. We hope our results will stimulate further investigations into and implementations of such a model.

Conclusions

In this single-center retrospective cohort study, the addition of laboratory results to a patient’s vital signs did not increase the performance of a machine-learning-based model for predicting IHCA. The prediction of IHCAs for patients in the ICU was found to be unreliable. However, the simpler Vitals-Only model performed well enough on other patient types to merit further investigation.

Supporting information

S1 File

(DOCX)

Acknowledgments

We sincerely appreciate Mr Hiroshi Harada, Mr Masahiko Sato, and all the other members of the Data Science Team at Kameda Medical Center for their tremendous help in data collection.

Notation of prior abstract publication/presentation

The preliminary results of this study were presented at the International Society for Rapid Response Systems 2019 Annual General Meeting in Singapore and World Congress of Intensive Care in Melbourne.

Data Availability

Data cannot be shared publicly because of the confidentiality. Data are available from the Kameda Medical Centre/Ethics Committee for researchers who meet the criteria for access to confidential data. Please contact the corresponding author and/or the medical centre ethics committee (contact via clinical_research@kameda.jp).

Funding Statement

RU is supported by the Masason Foundation (MF) and has received a grant from MF. MF has not contributed to the study design, collection, management, analysis, and interpretation of data; the manuscript preparation; or the decision to submit the report for publication.

References

  • 1.Andersen LW, Holmberg MJ, Berg KM, Donnino MW, Granfeldt A. In-Hospital Cardiac Arrest. JAMA 2019;321:1200 10.1001/jama.2019.1696 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Brady WJ, Gurka KK, Mehring B, Peberdy MA, O’Connor RE, American Heart Association’s Get with the Guidelines (formerly NI. In-hospital cardiac arrest: impact of monitoring and witnessed event on patient survival and neurologic status at hospital discharge. Resuscitation 2011;82:845–52. 10.1016/j.resuscitation.2011.02.028 [DOI] [PubMed] [Google Scholar]
  • 3.Larkin GL, Copes WS, Nathanson BH, Kaye W. Pre-resuscitation factors associated with mortality in 49,130 cases of in-hospital cardiac arrest: A report from the National Registry for Cardiopulmonary Resuscitation. Resuscitation 2010;81:302–11. 10.1016/j.resuscitation.2009.11.021 [DOI] [PubMed] [Google Scholar]
  • 4.Andersen LW, Kim WY, Chase M, Berg K—> M, Mortensen SJ, Moskowitz A, et al. The prevalence and significance of abnormal vital signs prior to in-hospital cardiac arrest. Resuscitation 2016;98:112–7. 10.1016/j.resuscitation.2015.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Galhotra S, DeVita MA, Simmons RL, Dew MA, Members of the Medical Emergency Response Improvement Team (MERIT) Committee. Mature rapid response system and potentially avoidable cardiopulmonary arrests in hospital. BMJ Qual Saf 2007;16:260–5. 10.1136/QSHC.2007.022210 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Subbe CP, Kruger M, Rutherford P, Gemmel L. Validation of a modified Early Warning Score in medical admissions. QJM 2001;94:521–6. 10.1093/qjmed/94.10.521 [DOI] [PubMed] [Google Scholar]
  • 7.Smith GB, Prytherch DR, Meredith P, Schmidt PE, Featherstone PI. The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. Resuscitation 2013;84:465–70. 10.1016/j.resuscitation.2012.12.016 [DOI] [PubMed] [Google Scholar]
  • 8.Prytherch DR, Sirl JS, Schmidt P, Featherstone PI, Weaver PC, Smith GB. The use of routine laboratory data to predict in-hospital death in medical admissions. Resuscitation 2005;66:203–7. 10.1016/j.resuscitation.2005.02.011 [DOI] [PubMed] [Google Scholar]
  • 9.Churpek MM, Yuen TC, Park SY, Meltzer DO, Hall JB, Edelson DP. Derivation of a cardiac arrest prediction model using ward vital signs*. Crit Care Med 2012;40:2102–8. 10.1097/CCM.0b013e318250aa5a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Churpek MM, Yuen TC, Edelson DP. Risk stratification of hospitalized patients on the wards. Chest 2013;143:1758–65. 10.1378/chest.12-1605 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Badriyah T, Briggs JS, Meredith P, Jarvis SW, Schmidt PE, Featherstone PI, et al. Decision-tree early warning score (DTEWS) validates the design of the National Early Warning Score (NEWS). Resuscitation 2014. 10.1016/j.resuscitation.2013.12.011 [DOI] [PubMed] [Google Scholar]
  • 12.Loekito E, Bailey J, Bellomo R, Hart GK, Hegarty C, Davey P, et al. Common laboratory tests predict imminent death in ward patients. Resuscitation 2013;84:280–5. 10.1016/j.resuscitation.2012.07.025 [DOI] [PubMed] [Google Scholar]
  • 13.Ng YH, Pilcher D V, Bailey M, Bain CA, MacManus C, Bucknall TK. Predicting medical emergency team calls, cardiac arrest calls and re-admission after intensive care discharge: creation of a tool to identify at-risk patients. Anaesth Intensive Care 2018;46:88–96. 10.1177/0310057X1804600113 [DOI] [PubMed] [Google Scholar]
  • 14.Churpek MM, Yuen TC, Park SY, Gibbons R, Edelson DP. Using electronic health record data to develop and validate a prediction model for adverse outcomes in the wards*. Crit Care Med 2014;42:841–8. 10.1097/CCM.0000000000000038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Churpek MM, Yuen TC, Winslow C, Robicsek AA, Meltzer DO, Gibbons RD, et al. Multicenter development and validation of a risk stratification tool for ward patients. Am J Respir Crit Care Med 2014;190:649–55. 10.1164/rccm.201406-1022OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Churpek MM, Yuen TC, Winslow C, Meltzer DO, Kattan MW, Edelson DP. Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards. Crit Care Med 2016. 10.1097/CCM.0000000000001571 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kwon J, Lee Y, Lee Y, Lee S, Park J. An Algorithm Based on Deep Learning for Predicting In‐Hospital Cardiac Arrest. J Am Heart Assoc 2018;7:e008678 10.1161/JAHA.118.008678 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. Lancet Respir Med 2015;3:42–52. 10.1016/S2213-2600(14)70239-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sanchez-Pinto LN, Venable LR, Fahrenbach J, Churpek MM. Comparison of variable selection methods for clinical predictive modeling. Int J Med Inform 2018;116:10–7. 10.1016/j.ijmedinf.2018.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. Eur Urol 2015;67:1142–51. 10.1016/j.eururo.2014.11.025 [DOI] [PubMed] [Google Scholar]
  • 21.Dziadzko MA, Novotny PJ, Sloan J, Gajic O, Herasevich V, Mirhaji P, et al. Multicenter derivation and validation of an early warning score for acute respiratory failure or death in the hospital. Crit Care 2018;22:286 10.1186/s13054-018-2194-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jeffery AD, Dietrich MS, Fabbri D, Kennedy B, Novak LL, Coco J, et al. Advancing In-Hospital Clinical Deterioration Prediction Models. Am J Crit Care 2018;27:381–91. 10.4037/ajcc2018957 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Churpek MM, Yuen TC, Winslow C, Hall J, Edelson DP. Differences in vital signs between elderly and nonelderly patients prior to ward cardiac arrest. Crit Care Med 2015;43:816–22. 10.1097/CCM.0000000000000818 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Churpek MM, Adhikari R, Edelson DP. The value of vital sign trends for detecting clinical deterioration on the wards. Resuscitation 2016;102:1–5. 10.1016/j.resuscitation.2016.02.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Goto T, Camargo CA, Faridi MK, Freishtat RJ, Hasegawa K. Machine Learning–Based Prediction of Clinical Outcomes for Children During Emergency Department Triage. JAMA Netw Open 2019;2:e186937 10.1001/jamanetworkopen.2018.6937 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Meyer A, Zverinski D, Pfahringer B, Kempfert J, Kuehne T, Sündermann SH, et al. Machine learning for real-time prediction of complications in critical care: a retrospective study. Lancet Respir Med 2018;6:905–14. 10.1016/S2213-2600(18)30300-X [DOI] [PubMed] [Google Scholar]
  • 27.Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 2018;361:k1479 10.1136/bmj.k1479 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kellett J, Murray A. Should predictive scores based on vital signs be used in the same way as those based on laboratory data? A hypothesis generating retrospective evaluation of in-hospital mortality by four different scoring systems. Resuscitation 2016;102:94–7. 10.1016/j.resuscitation.2016.02.020 [DOI] [PubMed] [Google Scholar]
  • 29.Rozen TH, Mullane S, Kaufman M, Hsiao Y-FF, Warrillow S, Bellomo R, et al. Antecedents to cardiac arrests in a teaching hospital intensive care unit. Resuscitation 2014;85:411–7. 10.1016/j.resuscitation.2013.11.018 [DOI] [PubMed] [Google Scholar]
  • 30.Jones D, Mitchell I, Hillman K, Story D. Defining clinical deterioration. Resuscitation 2013;84:1029–34. 10.1016/j.resuscitation.2013.01.013 [DOI] [PubMed] [Google Scholar]
  • 31.Kovacs C, Jarvis SW, Prytherch DR, Meredith P, Schmidt PE, Briggs JS, et al. Comparison of the National Early Warning Score in non-elective medical and surgical patients. Br J Surg 2016;103:1385–93. 10.1002/bjs.10267 [DOI] [PubMed] [Google Scholar]
  • 32.Bedoya AD, Clement ME, Phelan M, Steorts RC, O’Brien C, Goldstein BA. Minimal Impact of Implemented Early Warning Score and Best Practice Alert for Patient Deterioration. Crit Care Med 2019;47:49–55. 10.1097/CCM.0000000000003439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Churpek MM, Yuen TC, Edelson DP. Predicting clinical deterioration in the hospital: the impact of outcome selection. Resuscitation 2013;84:564–8. 10.1016/j.resuscitation.2012.09.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Yoder JC, Yuen TC, Churpek MM, Arora VM, Edelson DP. A prospective study of nighttime vital sign monitoring frequency and risk of clinical deterioration. JAMA Intern Med 2013;173:1554–5. 10.1001/jamainternmed.2013.7791 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Guo J, Li B. The Application of Medical Artificial Intelligence Technology in Rural Areas of Developing Countries. Heal Equity 2018;2:174–81. 10.1089/heq.2018.0037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wahl B, Cossy-Gantner A, Germann S, Schwalbe NR. Artificial intelligence (AI) and global health: how can AI contribute to health in resource-poor settings? BMJ Glob Heal 2018;3:e000798 10.1136/bmjgh-2018-000798 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Wisit Cheungpasitporn

28 Feb 2020

PONE-D-20-03952

Additional Value of Laboratory Results over Vital Signs in a Machine Learning Algorithm to Predict In-Hospital Cardiac Arrest: A Single-Centre Retrospective Cohort Study

PLOS ONE

Dear Ueno Ryo,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: The reviewers have raised a number of points which we believe major modifications are necessary to improve the manuscript, taking into account the reviewers' remarks. Our expert reviewers, especially statistician review has concerns on methods and statistical analysis, and Tuhe use of random forest does not lead to a generalizable model. Please consider and address each of the comments raised by the reviewers before resubmitting the manuscript. This letter should not be construed as implying acceptance, as a revised version will be subject to re-review.

==============================

We would appreciate receiving your revised manuscript by Apr 13 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Wisit Cheungpasitporn, MD, FACP

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Thank you for including your competing interests statement; "No"

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

 This information should be included in your cover letter; we will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

2.  We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

3. Please include your tables as part of your main manuscript and remove the individual files. Please note that supplementary tables (should remain/ be uploaded) as separate "supporting information" files

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The topic of this article is important however substantial revisions are suggested to improve the quality and clarify of the work. Overall the language has to be improved. Additionally, authors have often used uncommon words such as “parsimonious” , “ plausible” , etc — it is suggested to use simpler words and sentences. Although the meaning remains the same , but a simple sentence/word is less likely to be misinterpreted and is also well received by international readers.

1. Introduction: it is too short and needs major changes. The research gap is not well supported. It is suggested to clarify the state of the art, the motivation, and the research question.

2. Method: the subsection “outcome” might confuse readers. Please assign a different subsection name. Also provide with a reasoning for the choice of your ML model. It is also suggested to list the limitations of the selected model and address them in the paper.

3. Discussion: it required major rework. Not only English but also the flow has to be improved. Many claims needs citations. I have highlighted some of them :

“Prior studies have extensively investigated the importance of early prediction of IHCA.” [cite]

“Clinical deterioration is common prior to the cardiac arrest [4,5] and monitored or witnessed patient have more favorable outcomes [2,3] “ — this sentence is very confusing. I suggest rephrasing it.

“ Historically, multiple early warning scores have been developed, ranging from an analogue model based on vital signs to a digital scoring system using both vital signs and laboratory resu...” [cite]

“ Similar results were obtained in other studies [28]. ” — cite more studies or rephrase the sentence (studies — study) .

In the discussion I also suggest to include subsections discussing the effects biases in such ML models and how this has been addressed in the study. What are the risks associated?. Is the outcome clinically meaningful? (Please read the following to build upon discussion section: https://doi.org/10.12968/bjhc.2019.0066 ; https://doi.org/10.1126%2Fscience.aaw0029)

Also a section dedicated to future research direction is highly recommended.

Reviewer #2: General Comment: The authors present their findings from a diagnostic model development project, aiming to compare two machine learning algorithms for predicting in-hospital cardiac arrest: a standard approach using both vital sign and laboratory measures from an EHR, and a reduced model using only the vital sign components. They report that both models provide similar discrimination in-hospital cardiac arrest occurrence, with similar results in several settings. The debilitating omission in this manuscript is that authors provide only the crudest of calibration/validation measurements, and double down on their refusal to calibrate in their Discussion. The use of random forest does not lead to a generalizable model, at least how used here.

Specific Comments:

1. The Introduction and Methods section combined are shorter than the Discussion. As a result there is little motivation of the project and the methods used by the authors are not reported in enough detail, particularly the analytic approach (more on both of these below).

2. Introduction, Second paragraph: The authors main justification of why they would like to investigate omitting laboratory measurements is that the effect of doing so is unknown. While true, this is hardly justification for why they should feel doing so will not affect the diagnostic tool. There has to be a conceptual reason as to why they think this would work, and that reason(s) needs to be reported.

3. Methods Section, Study Population: The authors report that the vital signs and laboratory results are shown in detail in Figure 1. However, this figure shows nothing of the sort.

4. Methods Section, Statistical Methods: The authors use the random forest model for the prognostic tool since -- as they state -- it usually outperforms other algorithms. The reason for that is because it leads to models of great complexity, and model-averages across numerous well-fitting models. This leads to an uninterpretable model, which limits the ability for anyone (let alone diagnostic clinicians or scientists) who might want to use it. Because of this method, the authors do not -- and could not even if they wanted to -- report any prognostic model. As such, there approach is not reproducible and appears of little value then "use random forests, add these vital sign measurements, and let the computer go to work."

5. Methods Section, Statistical Methods, Second Paragraph: The authors use ROC-AUC to assess their model discrimination, which is good. However, their calibration measures (sensitivity, specificity, PPV, NPV) are not adequate for assessing model calibration. Van Calster (2016) actually refers to these measures as "mean calibration", which is a step below "weak calibration". The authors are strongly encouraged to add a calibration plot of observed vs. predicted risk, which more definitively shows the quality of model calibration. Otherwise, the authors need to explicitly state, in their abstract, methods, results and discussion, that their model is not calibrated, even in the weak sense.

6. Results, Table 2: It would be more clear if the authors referred to their models as Vitals-Only and Vitals+labs. Reporting Vitals vs. Labs makes it sound like one model contains only lab measures and the other only vital measures, which is not the case.

7. Results: Only model discrimination is discussed, as only AUC is elaborated upon. No model calibration measures (sensitivity, specificity, PPV, NPV) are mentioned in the results. This is troubling, as the PPV for this model is abysmally low; less than 5% in most cases. As such this model would lead to more false positives than true positives at a rate of at least 19:1. No false positives might be better than false negatives with respect to cardiac arrest, but the authors make no mention of this in the results.

8. Discussion, Limitations, Sixth Paragraph: It is hear that the authors state that they focused on discrimination rather than calibration. They confusingly make a general comment stating that calibration was "unfavorable to our research design," since calibration is "highly unstable as opposed to model discrimination." This is simply unacceptable. The biomedical informatics field has been consistently clear that uncalibrated models are worthless, and both discrimination and calibration for diagnostic and prognostic models are required, especially in supervised learning scenarios such as this. Both the Journal of Biomedical Informatics and the Journal of the American Medical Informatics Association are filled with articles (see anything by Royston, Moons, or Van Calster) stating the importance of calibrated models; they also have numerous articles showing HOW to calibrate.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jul 13;15(7):e0235835. doi: 10.1371/journal.pone.0235835.r002

Author response to Decision Letter 0


19 May 2020

14 May 2020

Dr Joerg Heber

Editor-in-Chief

PLOS ONE

Dear Dr Heber:

Thank you for your ongoing consideration of our manuscript for publication in PLOS ONE. We appreciate the time spent by you and the reviewers and believe the revised manuscript has been improved. Below, we have addressed the reviewers’ comments.

We look forward to your editorial decision.

Sincerely,

Ryo Ueno

On behalf of the authors.

Australian and New Zealand Intensive Care Research Centre, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia

Email: ryo.ueno@monash.edu

Reviewer #1:

General Comment:

The topic of this article is important however substantial revisions are suggested to improve the quality and clarify of the work. Overall the language has to be improved. Additionally, authors have often used uncommon words such as “parsimonious” , “ plausible” , etc — it is suggested to use simpler words and sentences. Although the meaning remains the same , a simple sentence/word is less likely to be misinterpreted and is also well received by international readers.

[Response]

We appreciate your comments. As suggested, we have modified the words and sentences in all sections. In addition, we have sought help from a professional language editing service for this purpose.

Specific Comments:

1. Introduction: it is too short and needs major changes. The research gap is not well supported. It is suggested to clarify the state of the art, the motivation, and the research question.

[Response]

As requested, we have changed the Introduction section (page 4, para 1-3) to address the following topics in separate paragraphs: the state of the art, research gap, motivation, and the research question.

Overall, our biggest motivation for this study was to compare the additional value of laboratory results using the same dataset and same algorithm. Prior studies use various datasets with various algorithms for the prediction of IHCA, and a comparison of such studies is unable to clarify whether laboratory results are really necessary in the prediction of IHCA.

2. Method: the subsection “outcome” might confuse readers. Please assign a different subsection name. Also provide with a reasoning for the choice of your ML model. It is also suggested to list the limitations of the selected model and address them in the paper.

[Response]

We thank the reviewer for this helpful suggestion. First, as requested, we have changed the subsection name from “outcome” to “prediction outcome measures.” in the Methods section (page 7, para2). Second, as suggested, we have summarized three reasons why we chose the random forest model. The limitations of this algorithm are briefly summarized in the Methods section (page 7, para3), and further discussed in the Discussion section (page 16, para 3).

3. Discussion: it required major rework. Not only English but also the flow has to be improved. Many claims need citations. I have highlighted some of them :

“Prior studies have extensively investigated the importance of early prediction of IHCA.” [cite]

“Clinical deterioration is common prior to the cardiac arrest [4,5] and monitored or witnessed patients have more favorable outcomes [2,3] “ — this sentence is very confusing. I suggest rephrasing it.

“ Historically, multiple early warning scores have been developed, ranging from an analogue model based on vital signs to a digital scoring system using both vital signs and laboratory resu...” [cite]

“ Similar results were obtained in other studies [28]. ” — cite more studies or rephrase the sentence (studies — study) .

[Response]

As suggested, we have added a number of citations. In addition, we have restructured the discussion with input from all the authors and sought the help of a professional language editing service to refine the text.

4. In the discussion I also suggest to include subsections discussing the effects biases in such ML models and how this has been addressed in the study. What are the risks associated?. Is the outcome clinically meaningful? (Please read the following to build upon discussion section: https://doi.org/10.12968/bjhc.2019.0066 ; https://doi.org/10.1126%2Fscience.aaw0029)

[Response]

We thank the reviewer for these insightful comments on the bias inherit to ML models. We have read the citations with great interest and respond to each issue below.

1) Bias and risks inherent to this model

First, this model was developed in a single-center setting. External validation with other clinical setting with other demographic will be of great importance. For clinicians, we have prepared Table 1 to detail the background of our cohort.

Second, we appreciate that we did not provide any clinical practice as a benchmark comparison for the Vitals-only data. However, this study focuses on the comparison of the Vitals-Only model and Vitals+Labs model rather than on providing a single model for clinical use. It would be essential to provide appropriate benchmarks before assessing a model’s clinical utility, but that is outside the scope of this paper.

Third, we have yet to assess whether our models are interoperable. Theoretically, our model only uses commonly collected variables and is both generalizable and interoperable. Future investigation should include actually implementing the random forest based model in an electronic medical record.

These limitations are now summarized in the Limitations section (page 15, para1; page 17, para2-3).

2) Validity of the outcome

As stated above, we have read the suggested articles with great interest. We noticed that there are a few definitions of “outcome” in those articles. Here, we discuss three outcomes related to our research.

i) Outcome, as in the performance measure

In our study, we used not only the AUC but other commonly used statistical values such as sensitivity, specificity, positive (negative) prediction value, and positive (negative) log-likelihood. In addition, we have created a calibration plot in the modified manuscript. In this plot, the x-axis summarizes the predicted probability whereas the y-axis shows the observed proportions of patients who have a cardiac arrest in next 8 hours. In this plot, almost 20%–30% of the patients had an IHCA if their predicted risk score was higher than 0.9. These results are summarized in the Results section (page 10, para 2; page 11, para 1).

ii) Outcome, as in the endpoint of the prediction

The aim of our models is to predict unexpected IHCA rather than a combined outcome or surrogate outcome. Unlike other outcomes such as ICU transfer, unexpected IHCA is always objective and always necessitates clinical intervention. This is now described in the Discussion section (page 14, para 2).

iii) Outcome, as in the patient-centered outcome as a result of implementing this model

We have yet to determine whether our prediction model might actually improve the trajectory of patients at risk. However, our results are persuasive enough to facilitate a prospective validation of the Vitals-Only model. We describe this in the Discussion section (page 17, para 3).

Reviewer #2:

General Comment:

The authors present their findings from a diagnostic model development project, aiming to compare two machine learning algorithms for predicting in-hospital cardiac arrest: a standard approach using both vital sign and laboratory measures from an EHR, and a reduced model using only the vital sign components. They report that both models provide similar discrimination in-hospital cardiac arrest occurrence, with similar results in several settings. The debilitating omission in this manuscript is that authors provide only the crudest of calibration/validation measurements, and double down on their refusal to calibrate in their Discussion. The use of random forest does not lead to a generalizable model, at least how used here.

[Response]

We thank the reviewer for these insightful suggestions. As suggested, we have added a calibration analysis. In addition, we have described both the reasons for choosing and limitations of using a random forest algorithm in our model. Below, we elaborate on both topics in detail.

Specific Comments:

1. The Introduction and Methods section combined are shorter than the Discussion. As a result there is little motivation for the project and the methods used by the authors are not reported in enough detail, particularly the analytic approach (more on both of these below).

[Response]

As suggested, we have clarified the motivation for the project in the Introduction section (page 4, para 3). In addition, we have restructured the Introduction with paragraphs on the following topics: the state of the art, the research gap, motivation, and the research question. (page 4, para 1-3; page 5, para 1)

Overall, our biggest motivation for this study was to compare the additional value of laboratory results using the same dataset and same algorithm. Prior studies use various datasets with various algorithms for the prediction of IHCA, and comparison of such studies is unable to clarify whether laboratory results are necessary in the prediction of IHCA.

We also address the concerns you mention regarding our methods in the comments below.

2. Introduction, Second paragraph: The authors main justification of why they would like to investigate omitting laboratory measurements is that the effect of doing so is unknown. While true, this is hardly justification for why they should feel doing so will not affect the diagnostic tool. There has to be a conceptual reason as to why they think this would work, and that reason(s) needs to be reported.

[Response]

As the reviewer points out, our initial draft lacked a conceptual reason as to why the Vitals-only model may have a similar performance to that of Vitals+Labs model. As summarized in the modified Introduction (page 4, para 2), we believe that vital signs are more immediately available for all the patients regardless of the clinical settings compared to laboratory results. Also, vital signs may reflect the acute physiological change in human body better than change in electrolytes or proteins in the blood. In addition, the amount of data will be much larger for the vital signs, as they can be easily collected any time. Unlike laboratory results, which are usually taken twice or thrice weekly, the amount of data of vitals signs will be larger. Thus, we believe it important to investigate if Vitals-Only model could perform as well as Vitals+Labs model.

3. Methods Section, Study Population: The authors report that the vital signs and laboratory results are shown in detail in Figure 1. However, this figure shows nothing of the sort.

[Response]

We mistakenly added this information to the caption of the Figure 1. We have now changed Figure 1 to provide this content.

4. Methods Section, Statistical Methods: The authors use the random forest model for the prognostic tool since -- as they state -- it usually outperforms other algorithms. The reason for that is because it leads to models of great complexity, and model-averages across numerous well-fitting models. This leads to an uninterpretable model, which limits the ability for anyone (let alone diagnostic clinicians or scientists) who might want to use it. Because of this method, the authors do not -- and could not even if they wanted to -- report any prognostic model. As such, there approach is not reproducible and appears of little value then "use random forests, add these vital sign measurements, and let the computer go to work."

[Response]

We appreciate the reviewer’s comments and the opportunity to clarify this important point.

As the reviewer points out, the random forest model is uninterpretable. As a result, we are unable to determine why our model predicts an IHCA for a particular patient. Hence, the interpretation of the model’s results may differ among medical personnel and thus are not reproducible.

Such a drawback, however, may not be an insurmountable obstacle in the clinical application of this model for the following three reasons.

First, a clinician’s response to this alert is likely to be quite straightforward regardless of the reason for the alert. Clinicians will review the patient, collect both electronic and bedside information to assess the situation, and will escalate the care as appropriate. There are variety of interventions that could be made such as additional testing, fluid administration, or transfer to the ICU. However, what determines the intervention is not an alert per se but the clinician’s review of the situation. As long as the early warning system triggers such action of the clinician, the reason behind the alert might be of marginal importance. The clinician can then work out an appropriate response based on the patient’s features rather than the model’s feature.

Second, various de facto standards in our usual medical practice are full of “black boxes.” Not all clinicians are aware of the mechanisms behind MRI. Not all clinicians understand why a certain saddle shape of an electrocardiogram is highly associated with a fetal congenital arrhythmia, but all emergency physicians utilize this knowledge in their daily practice. “Prepare the ECG machine, add electrodes, and let the machine prepare the waveform” is often what actually happens at the bedside. Likewise, the fact that this model is uninterpretable does not always mean it is not useful at the bedside.

Third, the random forest model is in fact partially interpretable. Of course, we are unable to know why a patient was predicted to have IHCA (or otherwise), but we can at least know the weight of each variable in the feature matrix. This at least enables us whether the decision-making algorithm is consistent with our clinical intuition.

We appreciate the insightful comments of the reviewer and added the above discussion to the Discussion section. (page 16, para 4; page 17, para 1)

5. Methods Section, Statistical Methods, Second Paragraph: The authors use ROC-AUC to assess their model discrimination, which is good. However, their calibration measures (sensitivity, specificity, PPV, NPV) are not adequate for assessing model calibration. Van Calster (2016) actually refers to these measures as "mean calibration", which is a step below "weak calibration". The authors are strongly encouraged to add a calibration plot of observed vs. predicted risk, which more definitively shows the quality of model calibration. Otherwise, the authors need to explicitly state, in their abstract, methods, results and discussion, that their model is not calibrated, even in the weak sense.

[Response]

We strongly agree with the reviewer. As suggested, we have added a calibration plot in Figure 3. We discuss the interpretation of the result in the Results (page 11, para 1) and Discussion (page 16, para 3) sections.

6. Results, Table 2: It would be more clear if the authors referred to their models as Vitals-Only and Vitals+labs. Reporting Vitals vs. Labs makes it sound like one model contains only lab measures and the other only vital measures, which is not the case.

[Response]

As suggested, we have changed the names of each model to Vitals-Only and Vitals+Labs throughout the manuscript.

7. Results: Only model discrimination is discussed, as only AUC is elaborated upon. No model calibration measures (sensitivity, specificity, PPV, NPV) are mentioned in the results. This is troubling, as the PPV for this model is abysmally low; less than 5% in most cases. As such this model would lead to more false positives than true positives at a rate of at least 19:1. No false positives might be better than false negatives with respect to cardiac arrest, but the authors make no mention of this in the results.

[Response]

As suggested, we added a calibration plot in Figure 3. We will discuss this plot in the next response. As the reviewer points out, the model has a low positive predictive value, and this is a global issue in all prediction models of rare events such as IHCA (0.4% of prevalence)[1][2]. Although no false positives might be better than false negatives with respect to cardiac arrest, each hospital needs to seek its own balance to optimize the tradeoff between false alarms and overlooked IHCA patients when implementing our model in clinical practice.

In this study, we focused on the comparison of the Vitals-Only model and Vitals+Labs model at a widely investigated threshold (i.e., Youden’s index) rather than the provision of a highly predictive model. Therefore, we did not aim for overly sensitive model nor an overly specific model. A different threshold and more sophisticated feature engineering might aid in developing a model with a higher positive predictive value.

Of note, it is also worth investigating the outcomes of “false positive” patients in future studies. Patients flagged as being at risk of cardiac arrest, even if the prediction is a “false positives” may be still identified as “at risk” patients, which would facilitate early interventions such as those of a rapid response team.

We have added this discussion in the Results (page 10, para 2) and Discussion section (page 15, para 4; page 16, para 1).

8. Discussion, Limitations, Sixth Paragraph: It is here that the authors state that they focused on discrimination rather than calibration. They confusingly make a general comment stating that calibration was "unfavorable to our research design," since calibration is "highly unstable as opposed to model discrimination." This is simply unacceptable. The biomedical informatics field has been consistently clear that uncalibrated models are worthless, and both discrimination and calibration for diagnostic and prognostic models are required, especially in supervised learning scenarios such as this. Both the Journal of Biomedical Informatics and the Journal of the American Medical Informatics Association are filled with articles (see anything by Royston, Moons, or Van Calster) stating the importance of calibrated models; they also have numerous articles showing HOW to calibrate.

[Response]

We thank the reviewer for these insightful comments. As suggested, we have created a calibration plot. As shown in Figure 3, the calibration performance of the two models does not differ much. The highest risk group has a 20%–30% occurrence of IHCA, which is far more frequent than its pre-test probability among the entire population. However, the calibration plot is far from the diagonal line with poor calibration score (Vitals-Only model vs Vitals+Labs model, Hosmen-Lemeshow C-statistics = 8592.40 vs 9514.76, respectively). Therefore, we should not interpret the predicted probability from our models as an observed proportion of patients having IHCA in the next 8 hours.

The reason for this calibration performance may stem from the design of our model. The aim of this research is a comparison of the discrimination performance of two models rather than a provision of a well-calibrated risk score. Therefore, we attempted to maximize its discrimination score at the expense of calibration performance. However, if we were to implement the Vitals-Only model in an actual clinical setting, a well-calibrated model might be highly influential on clinical decision making by enabling clinicians to interpret the predicted probability as a risk score. Although some clinical tests are still solely focused on discrimination (e.g., pregnancy tests, fecal occult blood test), they should ideally all be calibrated. Future studies should investigate a more sophisticated calibration of the Vitals-Only model before its clinical application.

Regardless, we believe that the result of our study shows that the simpler Vitals-Only model performs as good as Vitals+Labs model, and this result will be of value in the further development of IHCA prediction models

We have added the above discussion in the Discussion section (page 16, para 3).

Figure 3: Calibration plot

The x-axis summarizes the predicted probability of having a cardiac arrest in the next 8 hours, whereas the y-axis shows the observed proportions of patients who have a cardiac arrest in next 8 hours. Histogram below the calibration plot summarizes the distribution of predicted probability amongst all the patients in our dataset.

Abbreviations: IHCA; in-hospital cardiac arrest

REFERENCES

[1] Goto T, Camargo CA, Faridi MK, Freishtat RJ, Hasegawa K. Machine Learning–Based Prediction of Clinical Outcomes for Children During Emergency Department Triage. JAMA Netw Open 2019;2:e186937. doi:10.1001/jamanetworkopen.2018.6937.

[2] Haibo He, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng 2009;21:1263–84. doi:10.1109/TKDE.2008.239.

Decision Letter 1

Wisit Cheungpasitporn

2 Jun 2020

PONE-D-20-03952R1

Value of Laboratory Results in Addition to Vital Signs in a Machine Learning Algorithm to Predict In-Hospital Cardiac Arrest: A Single-Center Retrospective Cohort Study

PLOS ONE

Dear Dr. Ueno Ryo,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: Our expert reviewer(s) have recommended additional revisions to your revised manuscript. There are many area that need clarifications and improvement. Therefore, I invite you to respond to the reviewer(s)' comments as listed and revise your manuscript.

==============================

Please submit your revised manuscript by Jul 17 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Wisit Cheungpasitporn, MD, FACP

Academic Editor

PLOS ONE

Additional Editor Comments:

Our expert reviewer(s) have recommended additional revisions to your revised manuscript. There are many area that need clarifications and improvement. Therefore, I invite you to respond to the reviewer(s)' comments as listed and revise your manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have made substantial revisions. However, more work is needed:

1. I understand the motivation of the work, however, just stating that vitals are readily available. I suggest elaborating on the "data dimensionality," "minimal optimal" problems, etc. emphasizing the benefits of a smaller or optimal dataset. Additionally, did the computational time reduced significantly after eliminating predictors? It would be better if authors can report the computation time (for their computer configuration).

2. "A simpler model with fewer input ... " I do not agree. Especially in a healthcare setting, the adoption of any technology takes place based on its performance and economic feasibility.

3. "... it does not require any log-transformation .... computational complexity of the pre-processing is reduced ....". Log transformation and normalization are not computationally complex. I do not suggest stating this as a reason to justify RF.

4. k-fold cross-validation is always preferred to manually divide data into test and train set.

5. "Implausible values were removed ...". This is not clear. What is "implausible values"? Outlier or anomalies? Please rewrite and explain, just giving citations is not sufficient

6. In the STRENGTH section of the paper, the authors stated that " A prior study showed that the existence of missing data itself has predictive value [27]" Here is the quote from the study [27] "However, missing and incorrect demographic information in both the Death Master File and EHR data can affect the accuracy of the matches and the resulting estimated survival rates. To circumvent these limitations, our outcome was literally whether the EHR indicates that the patient is alive three years after our cohort period ended. We were not modeling time until death or conducting a traditional survival analysis." The study [27] did not replace missing values to binary. Also, the standard practice is to delete the column when more than 50% is missing. Missing data can only be indicative of "non-essential information" (if it is deliberately not captured).

7. The study has 3 strengths and nine limitations.

8.The implication of the study is unclear. How eliminating data make an ML model simpler? The model's complexity is independent of data; it depends on the underlying algorithm.

9.Lastly, the language needs to be improved (tense). The sentence structures are not indicative of technical writing. " Second, clinically, doctors and nurses might intervene in patients ... "

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Avishek Choudhury

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jul 13;15(7):e0235835. doi: 10.1371/journal.pone.0235835.r004

Author response to Decision Letter 1


19 Jun 2020

Reviewer #1:

The authors have made substantial revisions. However, more work is needed:

[Response]

We thank the reviewer for the encouraging comment and very constructive suggestions. We provide point-by-point responses below.

1. I understand the motivation of the work, however, just stating that vitals are readily available. I suggest elaborating on the "data dimensionality," "minimal optimal" problems, etc. emphasizing the benefits of a smaller or optimal dataset. Additionally, did the computational time reduced significantly after eliminating predictors? It would be better if authors can report the computation time (for their computer configuration).

[Response]

Thank you for the constructive suggestions. As suggested, we have further elaborated on the motivation of this study in the Introduction section as follows:

Fourth, from a computational point of view, the minimal optimal model with a low data dimensionality is always better than a more complicated model as long as it has similar performance.

As the reviewer has pointed out, the computational time is an important aspect of a machine learning model. In our study, however, the reduction in dimensions was only 117 to 49, and we did not observe any significant reduction in computation time.

Despite the lack of computational benefit, we believe achieving a minimal optimal model is of clinical importance. The bedside application of a ML-model always comes with the cost of data acquisition; every single laboratory test is a physical and financial burden for both hospital staff and patients. If the performance of a Vitals-Only model is not inferior to that of the Vitals+Labs model, it is more economically feasible.

2. "A simpler model with fewer input ... " I do not agree. Especially in a healthcare setting, the adoption of any technology takes place based on its performance and economic feasibility.

[Response]

We agree with the reviewer. We believe that Vitals-Only model will be more economically feasible than the Vitals+Labs model because laboratory tests have physical and financial costs. As our results show, the performance is comparable despite its simplicity. This was our reasoning, which we agree we did not make explicit in this sentence. We have edited the Introduction section as follows:

Third, Vitals-Only model would be non-invasive and economically feasible, because it does not require any laboratory tests, which are both physically and financially stressful for patients.

We also changed the sentence you mention in your comment as follows.

First, a simpler model with fewer input variables can be adopted in a wide variety of settings.

3. "... it does not require any log-transformation .... computational complexity of the pre-processing is reduced ....". Log transformation and normalization are not computationally complex. I do not suggest stating this as a reason to justify RF.

[Response]

We thank the reviewer for this comment. We have deleted the sentence as suggested.

4. k-fold cross-validation is always preferred to manually divide data into test and train set.

[Response]

We appreciate the reviewer’s insightful suggestion. We chronologically defined the training period and test period a priori. We used this approach for the following reasons. First, most of our variables are time dependent; vital signs and laboratory results change over time. In addition, the information about the previous episode of in-hospital cardiac arrest is also time dependent. Second, when we actually implement the model in an actual clinical setting, we need to develop a model using data from a certain period of time and prospectively apply it to data from a distinct and more recent period. As such, our approach may be a proxy for such a situation.

5. "Implausible values were removed ...". This is not clear. What is "implausible values"? Outlier or anomalies? Please rewrite and explain, just giving citations is not sufficient

[Response]

We mean data that is not biologically possible, which indicates that an error in data entry has occurred. We have changed this term in the text to ‘values that are clearly errors’. We agree with the reviewer that more clarification is needed. All the biologically inconsistent data that were excluded are summarised in the Supplementary Information as follows:

List of excluded inconsistent data

1. body temperature < 30 ℃ or > 45 ℃

2. heart rate <20 beats/min or >300 beats/min

3. respiratory rate >80/min

4. systolic blood pressure < 20 mm Hg or > 300 mm Hg

5. diastolic blood pressure < 10 mm Hg or > 200 mm Hg

6. urine output > 10,000 ml per eight hours

7. oxygen saturation < 40% or > 101%

6. In the STRENGTH section of the paper, the authors stated that " A prior study showed that the existence of missing data itself has predictive value [27]" Here is the quote from the study [27]

"However, missing and incorrect demographic information in both the Death Master File and EHR data can affect the accuracy of the matches and the resulting estimated survival rates. To circumvent these limitations, our outcome was literally whether the EHR indicates that the patient is alive three years after our cohort period ended. We were not modeling time until death or conducting a traditional survival analysis."

The study [27] did not replace missing values to binary. Also, the standard practice is to delete the column when more than 50% is missing. Missing data can only be indicative of "non-essential information" (if it is deliberately not captured).

[Response]

Thank you for the meticulous comment and the opportunity for further clarification.

We agree with the reviewer; the literature [27] assessed the predictive value of the time when each laboratory test was ordered, compared with the predictive value of the result of each laboratory result. As such, this study is not about missing data imputation. We do believe, however, that this study proved the importance of the time and frequency of each test (including the decision NOT to measure) in predicting adverse outcomes. Hence, we aim to incorporate the timing and frequency of each measurement by changing each variable to a binary value. We note that the limitation quoted by the reviewer was about the demographic information, not clinical tests.

To ensure this issue is addressed, we have repeated the primary analysis with four different methods of handling missing values. All the results were similar to the primary analysis, as summarized in Supplementary Table 3. We believe that this sensitivity analysis reinforces the robustness of the primary outcome. To avoid confusion, we have changed the text in the Discussion section.

Supplementary Table 3: Predictive Performance of Each Model for the In-hospital Cardiac Arrest with Various Imputation Methods

Imputation methods

IHCA Imputation Binary Deletion Categorical

0-8h Vitals-Only 0.877 0.896 0.887 0.899

0-8h Vitals+Labs 0.855 0.864 0.848 0.862

0-24h Vitals-Only 0.851 0.871 0.850 0.873

0-24h Viltas+Labs 0.828 0.849 0.828 0.846

Missing values were imputed with the most recent variable value or average of the overall sample if <50% of the values were missing. Otherwise, a variable was converted according to the following rules:

1) Imputation (the variable was imputed with the most recent variable value or average of the overall sample);

2) Binary (if the variable was missing it was converted to 0, otherwise it was set to 1);

3) Categorical (the variable was converted to a categorical value, i.e. missing or quantile category);

4) Deletion (the variable was removed from the entire analysis).

7. The study has 3 strengths and nine limitations.

[Response]

We have noticed that Limitations section includes both limitations and future research directions. We have changed the subheading to better reflect the content of the paper.

8.The implication of the study is unclear. How eliminating data make an ML model simpler? The model's complexity is independent of data; it depends on the underlying algorithm.

[Response]

We thank the reviewer for this insightful comment.

We agree that improving the underlying algorithm is essential to creating a faster and more accurate model. We also agree that the complexity of the dataset is unimportant in a retrospective setting, when a dataset is readily available. When we apply such a model at the bedside, however, we always need to collect data prospectively, and such data is full of missing variables for the following reasons:

1. We cannot justify collecting laboratory results every day, just for the purposes of prediction. Collecting laboratory results at every opportunity is both a physical and financial burden on patients.

2. Even if we take a blood sample every day, we rarely order all types of blood test for every single patient. For some patients, just checking the hemoglobin is enough for his/her management, and we cannot justify ordering all blood tests.

3. Even if we were able to order all the blood test results, a couple of hours are needed before those results become available to the clinicians.

Therefore, a model with vital signs that is readily available and does not place any additional burden on patients is highly desirable in practice.

9.Lastly, the language needs to be improved (tense). The sentence structures are not indicative of technical writing. " Second, clinically, doctors and nurses might intervene in patients ... "

[Response]

We appreciate the reviewer’s suggestion. As suggested, we have edited the language with a specific focus on tense and structure. Furthermore, we used an additional language editing service.

Decision Letter 2

Wisit Cheungpasitporn

24 Jun 2020

Value of Laboratory Results in Addition to Vital Signs in a Machine Learning Algorithm to Predict In-Hospital Cardiac Arrest: A Single-Center Retrospective Cohort Study

PONE-D-20-03952R2

Dear Dr. Ryo,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Wisit Cheungpasitporn, MD

Academic Editor

PLOS ONE

Additional Editor Comments:

I reviewed the revised manuscript and the response to reviewers' comments. Revised Manuscript is well written. All comments have been addressed and thus accepted for publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #3: All concerns have been fully elucidated, missing sections and analyses have been completed. Finally, comprehension errors have been corrected. Good work!

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Avishek Choudhury — Stevens Institute of Technology

Reviewer #3: No

Acceptance letter

Wisit Cheungpasitporn

29 Jun 2020

PONE-D-20-03952R2

Value of Laboratory Results in Addition to Vital Signs in a Machine Learning Algorithm to Predict In-Hospital Cardiac Arrest: A Single-Center Retrospective Cohort Study

Dear Dr. Ueno:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Wisit Cheungpasitporn

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    Data Availability Statement

    Data cannot be shared publicly because of the confidentiality. Data are available from the Kameda Medical Centre/Ethics Committee for researchers who meet the criteria for access to confidential data. Please contact the corresponding author and/or the medical centre ethics committee (contact via clinical_research@kameda.jp).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES