Skip to main content
PLOS One logoLink to PLOS One
. 2020 Nov 17;15(11):e0242166. doi: 10.1371/journal.pone.0242166

Using the National Trauma Data Bank (NTDB) and machine learning to predict trauma patient mortality at admission

Evan J Tsiklidis 1, Carrie Sims 2, Talid Sinno 1, Scott L Diamond 1,*
Editor: Zsolt J Balogh3
PMCID: PMC7671512  PMID: 33201935

Abstract

A 400-estimator gradient boosting classifier was trained to predict survival probabilities of trauma patients. The National Trauma Data Bank (NTDB) provided 799233 complete patient records (778303 survivors and 20930 deaths) each containing 32 features, a number further reduced to only 8 features via the permutation importance method. Importantly, the 8 features can all be readily determined at admission: systolic blood pressure, heart rate, respiratory rate, temperature, oxygen saturation, gender, age and Glasgow coma score. Since death was rare, a rebalanced training set was used to train the model. The model is able to predict a survival probability for any trauma patient and accurately distinguish between a deceased and survived patient in 92.4% of all cases. Partial dependence curves (Psurvival vs. feature value) obtained from the trained model revealed the global importance of Glasgow coma score, age, and systolic blood pressure while pulse rate, respiratory rate, temperature, oxygen saturation, and gender had more subtle single variable influences. Shapley values, which measure the relative contribution of each of the 8 features to individual patient risk, were computed for several patients and were able to quantify patient-specific warning signs. Using the NTDB to sample across numerous patient traumas and hospital protocols, the trained model and Shapley values rapidly provides quantitative insight into which combination of variables in an 8-dimensional space contributed most to each trauma patient’s predicted global risk of death upon emergency room admission.

1. Background

Trauma is the third leading cause of mortality in the United States and results in approximately 6 million deaths and a cost over 500 billion dollars worldwide each year [1, 2]. A unique characteristic of trauma in relation to other diseased states is not only the large patient-to-patient variability, but non-physiological considerations such as the distance from a trauma center, the resources available for resuscitation, and the number of other casualties. These additional complexities make patient risk analysis difficult, but necessary, to implement in real time. Given the intricacy of traumatic injury, patient-scale modeling of trauma from first principles is extremely challenging. Consequently, machine learning approaches have been the mainstay of modeling in this arena. In this regard, a large and diverse data set is valuable for the training of an accurate model to efficiently predict patient risk, warning signs, and survival probabilities from easily measurable or estimable quantities.

Trauma centers prioritize patients as they arrive by dividing them into various tiers based upon patient vital signs (respiratory rate, systolic blood pressure, etc.), nature of the injury (i.e., penetrating), and Glasgow Coma score [3]. This prioritization is essential, as it is understood that the sooner a patient receives surgical or medical treatment the greater the likelihood for patient survival [4]. A well-trained model has the potential to help in this patient prioritization by providing a quantitative metric for patient risk. Moreover, the use of SHAP values [5], for example, to explain the prediction of the model may help alert the clinician of difficult-to-discern combinatorial risks in a high dimensional pathophysiological space.

To date, neural networks have been the primary model in the study of trauma. Edwards and Diringer et al. showed that a neural network could accurately classify mortality in 81 intracerebral hemorrhage patients [6]. Marble et al. used a neural network to predict the onset of sepsis in blunt injury trauma patients with 100% sensitivity and 96.5% specificity [7]. Estahbanati and Bouduhi used neural networks to predict mortality in burn patients to a 90% training set accuracy [8]. DiRusso et al. compared the accuracy of logistic regression (linear) and neural networks (non-linear) in predicting outcomes in pediatric trauma patients [9]. Walczak used neural networks to predict the transfusion requirements of trauma patients, an important problem considering potential resource limitations and adverse responses [10]. Mitchell et al. used comorbidities, age, and injury information to predict survival rates and ICU admission [11]. Recently, Liu and Salinas published an extensive review on how machine learning has been used in the study of trauma [12]. In general, studies have focused on the capability to predict mortality, hemorrhage, and hospital length of stay. The datasets used in these studies generally came from local trauma centers and varied greatly in training and test set size, with most studies on the order of hundreds of patients and some on the order of thousands [7, 13]. Models based on ordinary and partial differential equations have also been used to study trauma but were not used in this report [1418].

Here, we take a machine learning approach based on a gradient boosting classifier [19] for predicting survival probabilities. Furthermore, we make use of Shapley values to garner a physiological and quantitative understanding of why patients are either at high-risk or low-risk. With a reasonably small set of 8 easily measurable and commonly known features, we demonstrate accurate prediction of patient survival probabilities and the ability to indicate patient warning signs.

2. Methods

2.1 Patient dataset

All training and testing data was obtained from the National Trauma Data Bank (NTDB), the largest aggregation of trauma data ever assembled in the United States [20]. The 2016 NTDB dataset was used for all training and testing and consisted of 968665 unique patients. Each patient was identified by a unique incident key with comorbidities, vital signs, and injury information, present in separate.csv files. The open-source library, Pandas, was used to import, clean, and merge each of the csv incident files and generate a matrix of features and a vector of outcomes. Input features consisted of binary categorical features (e.g., gender, alcohol use disorder, etc.) and numerical features (e.g., age, systolic blood pressure, heart rate, etc.) while the outcome vector consisted of the binary states, survived or deceased.

2.2 Preprocessing

Of the 968,665 unique patients in the trauma database, 351,253 patients contained missing data. The death rate of the population of patients with missing data was 1.4 times greater than that of the patients without, suggesting that patients with missing data should not be ignored. Therefore, we used an iterative imputation method (see S1 File) to impute the missing values of all patients missing 2 or fewer features. This threshold was chosen based on the distribution of quantity of missing features, which is shown in S1 Fig. This captured 181,821 additional patients, and the death rate of all included patients was now approximately equal to that of the excluded patients.

Commonly presenting categorical features (hypertension, alcoholism, etc.) that were initially present in the comorbidities.csv file were encoded into their own binary columns, indicating whether a patient had the preexisting comorbidity or not. Continuous variables such as vital signs were also included. The feature matrix, X, is NxM dimensional where N is the total number of patients and M is the total number of features used to construct the model. All feature values in the feature matrix were rescaled to be between 0 and 1 by the minimum and maximum of each feature, i.e.,

X_s:,j=X:,jmin(X:,j)max(X:,j)min(X:,j), (1)

where X_si,j is the rescaled jth column in the new feature matrix and X:,j is the unscaled jth column of the original feature matrix. Feature rescaling is a standard preprocessing step performed so that all features are dimensionless and of the same order of magnitude.

2.3 Class imbalance

One of the challenges associated with classifying whether the patient survived or deceased is that the dataset had 778303 survived patients and only 20930 deceased patients, a very large class imbalance [21]. We chose to address this problem by undersampling from the survived class. A total of 85% of the deceased patients were randomly selected to be included in the training set, and an equal number of survived patients were randomly selected to be included in the training set. All other patient records were included in the test set resulting in a training set of size 35580 records and a test set of size 763653 records, as shown in Fig 1B.

Fig 1. Process flow diagram of the process of building a predictive trauma model.

Fig 1

The dataset is acquired from the National Trauma Data Bank (NTDB) and any patient with more than 2 missing data fields were removed from the dataset (cleaning). The data consisted of relational tables with each patient identified by a unique incident key. By merging using the incident key, it was possible to generate a matrix of data where each row represented a unique patient and each column represented a unique feature. Features were included in the column based on physiological information that was expected to contribute to the outcome of the model. This included age, gender, vital signs, coma and severity scores, and comorbidities. Facility and demographic information (other than age) was not included in the analysis. The dataset was then divided into a balanced training set (equal number of survived and deceased patients) and a test set, a model was trained on the training set with optimized hyperparameters (see S2 Fig), and then the results reported and analyzed.

2.4 Feature selection

A critical step in determining the final accuracy and utility of the model in the trauma unit (i.e., bedside) was determining which features to include. In this study, features were chosen not only based upon which would be the most predictive of outcome but also upon which would be the easiest to measure on admission. The permutation importance (PI) method [22] was used to determine which features were most likely to be the most predictive of trauma, see Fig 1C. The method consists of training a model, obtaining an accuracy for that model on the independent test set, then randomly permuting each feature column and measuring the change in accuracy. If the accuracy of the model decreases significantly, then this implies that the permuted feature was heavily contributing to the prediction of the model and should be kept in the final model.

In this case, the PI method guided the reduction of 32 feature to only 8 final features per patient record. The final 8 features used for model training were: systolic blood pressure (SBP), heart rate (HR), respiratory rate (RR), temperature (Temp), oxygen saturation (SaO2), gender, age, and total Glasgow coma score (GCSTOT). The full list of features before and after the reduction can be found in S1 Table. While the Glasgow Coma Score can be unreliable in intubated patients, only an exceedingly small percentage (<1%) of the entire NTDB trauma patient set were in that state [23].

Our description of the gradient boosting classifier, partial dependence curves, and SHAP values can be found in S1 File [5, 2427].

3. Results

The accuracy of the model is expressed as the area under the receiver operating characteristic curve coefficient (AUC). Given two patients, one survived and one deceased, the AUC represents the relative likelihood of the classifier predicting that the patient who survived had the higher probability of surviving. An AUC of 0.50 is the worst-case, as it implies that the classifier is no better than random guessing while an AUC of 1.0 is the best case, as it will always classify the two patients correctly. The AUC method is also relatively insensitive to the class imbalance between the survived and deceased patients making it a logical choice as the metric for accuracy. The gradient boosting model was able to achieve an accuracy of 0.924.

In other words, with a single 8-feature vector consisting of age, gender, five vital sign measurements, and the easily measurable Glasgow coma score, it was possible to predict the outcome of the patient up to ~92.4% accuracy, making this a useful tool for quantifying patient risk. Using 32 features per patient for model training resulted in minimal improvement of the AUC (black line, Fig 2). Importantly, the high accuracy of the model implies that a single snapshot view (8 features) can give a quantitative prediction of the patient’s mortality risk on admission. For further validation, we tested the model on the 2017 Trauma Quality Programs participant use file (TQP PUF). The dataset consisted of an additional 648192 complete patient records and our model was able to achieve an accuracy of 91.2%, further validating the robustness of the model and eliminating concerns of data leakage and biased evaluation. The high accuracy on a completely different cohort of patients is perhaps unsurprising given that the data points from the NTDB represent patients from numerous trauma centers. As we show below, the utility of the model is not only in its prediction of mortality risk, but also in its insight in quantifying key metrics that could be viewed as potential warning signs in a trauma setting.

Fig 2. The receiver operating characteristic curves (ROC) for 2 different cases.

Fig 2

The true positive rate (TPR) is plotted on the y-axis and the false positive rate (FPR) is plotted on the x-axis for classification thresholds between 0 and 1. In the red curve, only 8 easily measurable vital signs or scores were included in the prediction while the black curve included these and the comorbidities. A full list of features in each case can be found in S1 Table. All results are reported using the second case because the required inputs can be measured rapidly, while knowledge of the comorbidities of a patient is less likely. The heat map in the insert plots the 8 feature values of 100 randomly selected patients, illustrating the high dimensionality of the problem. While no obvious pattern can be seen by humans in the heat map, the algorithm is able to find and quantify one. 4 zoomed-in examples are provided for clarity. Note that each column is normalized by its own feature value range.

The partial dependence curves are shown in Fig 3. Age, GCSTOT, and systolic blood pressure all display substantial influence on the probability of survival. The model predicts that by the age of ~60, a patient’s “youth protection” has substantially dissipated on average and ages greater than this will further reduce the probability of survival. Likewise, a GCSTOT below 12 will increase the likelihood of death. More interestingly however, is the apparent threshold behavior in the heart rate and blood pressure profiles. The probability of survival begins to drop dramatically if the SBP < ~110 mmHg or the HR > ~100 beats per minute, consistent with the findings that hypotension is correlated with higher mortality rates [2830]. The two variables are related to one another via the baroreflex, a negative-feedback loop system that increases heart rate in response to the loss in blood pressure, which will decline as blood volume is lost from the injury. Both vital signs should be viewed in tandem to assess patient status during resuscitation.

Fig 3. Partial dependence curves showing how the prediction of the model is globally influenced by each of the features.

Fig 3

Pulse rate and systolic blood pressure display threshold behavior, where the probability of survival can decrease at HR > 100 beats / min and SBP < 110mmHg [30].

The probability of survival model predictions for deceased and survived patients were plotted on histograms in Fig 4 for a visual representation of the effectiveness of the model. The distributions of survival probability had means of 0.21 (deceased) and 0.78 (survived) and were highly skewed (1.51 and -1.38, respectively) suggesting that the model was very confident in the predictions that it made.

Fig 4. Histograms of the survival probabilities for survived and deceased patients.

Fig 4

If probabilities of death greater than 20% are marked as high risk, then ~96% of the deceased patients would been labeled.

Next, SHAP values were used to examine individual patient records and quantify patient risk. As examples, 4 cases are shown in Fig 5. Note that the scales of Fig 5 are expressed as the log-odds ratio of the probability of survived to probability of deceased (i.e., log(psurv1psurv)). A log-odds ratio of 0 (Psurv = 0.5) was used to binarize the patients into survived and deceased patients. With this metric, the model correctly predicted all 4 patient outcomes in Fig 5. The force plot of the SHAP values of each feature identifies the relative contribution of each variable, both positively (blue) and negatively (red). In panel A, while the patient was conscious (GCSTOT = 15) and had relatively normal heart and respiratory rates, his low blood pressure and age were quantifiably more significant and the reason the model predicted deceased (sum of SHAP values < 0). In case B, the patient’s youth and consciousness were enough to overcome his abnormal vital signs. While the patient did experience mild tachypnea and tachycardia, he was ultimately not a very high-risk patient. In the third case, the patient’s youth and consciousness could not compensate for the significant drop in blood pressure and elevated respiratory rate. The model identified 60 mmHg systolic blood pressure as a “red flag”, which should indicate to a trauma team that this patient is a priority. In panel D, the patient’s age and oxygen saturation were the key warning signs and the reason she was high-risk.

Fig 5. SHAP feature importance metrics for 4 patients that were correctly predicted as survived or deceased.

Fig 5

Output values (bold), expressed as log odds ratio of probability of survival to probability of deceased (i.e. log(psurv1psurv)), that are < 0 represent deceased patients (Cases A, C, D). Blue bars indicate that the feature value is increasing the probability of survival while red bars indicate that the feature is decreasing it.

We also computed the SHAP values for 8 cases where the model made incorrect predictions, as shown in Figs 6 and 7. Notably, in all 8 cases, there was a single feature that dominated the model prediction (GCSTOT, Age, and SaO2) instead of a combination of features as in Fig 5. Machine learning models are typically best at making predictions on unseen data that are as close to an interpolation of the training data as possible, and often fail when the unseen data is significantly different from the training data. One possible solution could be to explicitly model cross terms in the data to force the model to consider all feature-feature interactions. While the model is ideally learning the feature-feature interactions during the training process, sometimes explicit inclusion of these terms can improve model accuracy (although potentially at the expense of model interpretability).

Fig 6. SHAP feature importance metrics for 4 patients that were incorrectly predicted as deceased.

Fig 6

Output values (bold), expressed as log odds ratio of probability of survival to probability of deceased (i.e. log(psurv1psurv)), that are < 0 represent deceased patients. Blue bars indicate that the feature value is increasing the probability of survival while red bars indicate that the feature is decreasing it. In all 4 cases, there was one feature that dominated the model prediction.

Fig 7. SHAP feature importance metrics for 4 patients that were incorrectly predicted as survived.

Fig 7

Similar to incorrectly predicting deceased cases, there was one feature that dominated the model prediction.

4. Discussion

The NTDB-trained gradient boosting model was trained with thousands of trauma patients from participating trauma centers around the country and was able to fit a robust decision boundary to the dataset. With only 8 features that can be measured upon a patient’s ER admission, our model was able to provide an accurate metric for patient risk of death (AUC = 0.924, Fig 2) even when death was rare. This high accuracy was further tested on a withheld dataset to further validate our claims of its high accuracy.

In a trauma setting, the usefulness of a model is not only limited by its accuracy but also by its interpretability–clinicians must understand how the model makes predictions in order to trust it. This has typically resulted in the use of linear models, but unfortunately, linear models are incapable of modeling complex decision boundaries—resulting in the loss of accuracy in exchange for interpretability. SHAP scores circumvent this problem by providing a robust method for quantitatively explaining a model’s predictions. Using our model in conjunction with SHAP scores for the 8 features (Figs 57) provides a detailed and quantitative view of each vital sign’s contribution to risk. The method may serve several distinct uses. In a prioritization setting, where both time and nurse/surgeon availability are limited, the rapid generation of a hierarchy for patient treatment has value. The model can provide an objective ranking as to which patients should receive the most available resources and guide triage. Another use is to help explain those objective rankings with specific references to patient vital signs, potentially enhancing the prioritization. Furthermore, they can be used to evaluate how actions taken to alter these variables may affect patient survival probability.

A limitation of this model is that it was trained based on the vital signs of patients upon admission. While the model was accurate, predicting the time-series evolution of the patient will requires dynamical training data. A natural extension would be to train a model that can predict patient-risk in real-time if time-series trauma patient data is available.

On the machine learning side, one limitation of our approach was the random undersampling procedure for balancing the number of survived patients with the number of deceased patients in the training set [21]. It is possible that more informative survived patients were not included in the training set that could have led to an even more robust decision boundary [31]. Undersampling methods that include survived examples based upon their distances to deceased patients in the 8-dimensional space could improve model predictability, as they can more accurately model the decision boundary near “hard-to-classify” cases [31]. We also used the synthetic minority oversampling technique (SMOTE) to try to balance the training set, where artificial patient records from the survived class are generated from existing survived patient records, but it was ineffective in this instance and the accuracy of our model decreased significantly [32].

Although not the focus of this paper, we also note that the gradient boosting method exhibited a ROC-AUC that exceeded that of various neural networks and other machine learning models. Neural networks and tree-based models are two of the most commonly used classification models in the machine learning community with neural networks frequently outperforming tree-based models as well [33]. While we extensively tuned the parameters to the neural network in hopes of attaining a higher accuracy, it was unable to eclipse our gradient boosting model.

The advantages of the present approach are: (1) only 8 features are needed, (2) all 8 features are readily available on admission, (3) the calculation is exceedingly fast, portable and accurate, (4) the relative risk of each feature is determined and graphically presentable as in Figs 5 and 6, and (5) actual outcomes can be compared to the NTDB average performance on an individual basis. While features were chosen for inclusion based upon availability and importance, the accuracy of the model could be further improved by also including approximated inputs. For example, trauma surgeons generally also have an approximation for the injury severity score upon admission. If estimates for the severity of each injury is included into the model, the accuracy of the model was found to increase by ~2–3%. In future work, a goal will be to determine if the model predictions can be refined as a patient’s vital signs evolve in time.

Supporting information

S1 Fig. A distribution of the number of patients per number of missing features.

The distribution is bimodal, suggesting that including patients with a maximum of 2 missing features is the appropriate threshold for inclusion.

(TIF)

S2 Fig. The 5-fold grid-search cross validation method for selecting the approximately optimal hyperparameters.

Importantly, the test set was withheld during the grid-search cross validation process allowing it to remain a fair metric for evaluating performance on new data.

(TIF)

S3 Fig. A single weak learner randomly chosen from the trained gradient boosting ensemble of weak learners.

Variable thresholds, Friedman mean squared errors [19], percentage of training samples passed through each node, and log odds ratios are all present.

(TIF)

S1 Table. Table of all features used to make predictions.

(DOCX)

S1 File

(DOCX)

Data Availability

Data are available for purchase from the American College of Surgeons (ACS) via https://www.facs.org/quality-programs/trauma/tqp/center-programs/ntdb/datasets. The authors did not have any special access privileges to this data.

Funding Statement

This work was supported by NIH UO1-HL-131053 (S.L.D, T.S.) Funder website: https://report.nih.gov/ Funder Information: Andrei Kindzelski (kindzelskial@nhlbi.nih.gov) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Rhee P, Joseph B, Pandit V, Aziz H, Vercruysse G, Kulvatunyou N, et al. Increasing trauma deaths in the United States. Ann Surg. 2014;260(1):13–21. 10.1097/SLA.0000000000000600 [DOI] [PubMed] [Google Scholar]
  • 2.Gruen RL, Brohi K, Schreiber M, Balogh ZJ, Pitt V, Narayan M, et al. Haemorrhage control in severely injured patients. Lancet [Internet]. 2012;380(9847):1099–108. Available from: 10.1016/S0140-6736(12)61224-0 [DOI] [PubMed] [Google Scholar]
  • 3.Gill MR, Reiley DG, Green SM. Interrater Reliability of Glasgow Coma Scale Scores in the Emergency Department. Ann Emerg Med. 2004;43(2):215–23. 10.1016/s0196-0644(03)00814-x [DOI] [PubMed] [Google Scholar]
  • 4.Cowley A, Ems T, States U. The golden hour in trauma: Dogma or medical folklore? Injury. 2015;46:525–7. 10.1016/j.injury.2014.08.043 [DOI] [PubMed] [Google Scholar]
  • 5.Lundberg SM, Erion GG, Lee S. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv. 2018;(2). [Google Scholar]
  • 6.Edwards DF, Hollingsworth H, Zazulia AR, Diringer MN. Artificial Neural Networks improve the prediction of mortality in intracerebral hemorrhage. Neurology. 1999;53(2):351–7. 10.1212/wnl.53.2.351 [DOI] [PubMed] [Google Scholar]
  • 7.Marble RP, Healy JC. A neural network approach to the diagnosis of morbidity outcomes in trauma care. Artif Intell Med. 1999;15:299–307. 10.1016/s0933-3657(98)00059-1 [DOI] [PubMed] [Google Scholar]
  • 8.Karimi H, Bouduhi N. Role of artificial neural networks in prediction of survival of burn patients—a new approach. BURNS. 2002;28:579–86. 10.1016/s0305-4179(02)00045-1 [DOI] [PubMed] [Google Scholar]
  • 9.DiRusso SM, Sullivan T, Holly C. C, S. N., Savino J. An artificial neural network as a model for prediction of survival in trauma patients: validation for a regional trauma area. J Trauma Acute Care Surg. 2000;2:212–20. 10.1097/00005373-200008000-00006 [DOI] [PubMed] [Google Scholar]
  • 10.Walczak S. Artificial Neural Network Medical Decision Support Tool: Predicting Transfusion Requirements of ER Patients. IEEE Trans Biomed Eng. 2005;9(3):468–74. 10.1109/titb.2005.847510 [DOI] [PubMed] [Google Scholar]
  • 11.Mitchell RJ, Ting HP, Driscoll T, Braithwaite J. Identification and internal validation of models for predicting survival and ICU admission following a traumatic injury. Scand J Trauma Resusc Emerg Med. 2018;26(95):1–11. 10.1186/s13049-018-0563-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu NT, Salinas J. Machine Learning for Predicting Outcomes in Trauma. SHOCK. 2017;48(5):504–10. 10.1097/SHK.0000000000000898 [DOI] [PubMed] [Google Scholar]
  • 13.Ahmed N, Kuo Y, Sharma J, Kaul S. Elevated blood alcohol impacts hospital mortality following motorcycle injury: A National Trauma Data Bank analysis. Injury. 2020;51(1):91–6. 10.1016/j.injury.2019.10.005 [DOI] [PubMed] [Google Scholar]
  • 14.Tsiklidis EJ, Sinno T, Diamond SL. Coagulopathy implications using a multiscale model of traumatic bleeding matching macro- and microcirculation. Am J Physiol Heart Circ Physiol. 2019;l(33). 10.1152/ajpheart.00774.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tsiklidis E, Sims C, Sinno T, Diamond SL. Multiscale systems biology of trauma-induced coagulopathy. Wiley Interdiscip Rev Syst Biol Med. 2018;10(4):1–10. 10.1002/wsbm.1418 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Neal ML, Bassingthwaighte JB. NIH Public Access. 2013;7(3):97–120. [Google Scholar]
  • 17.Reisner AT, Heldt T. A computational model of hemorrhage and dehydration suggests a pathophysiological mechanism: Starling-mediated protein trapping. Am J Physiol Heart Circ Physiol [Internet]. 2013;304(4):H620–31. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23203962 10.1152/ajpheart.00621.2012 [DOI] [PubMed] [Google Scholar]
  • 18.Ursino M. Interaction between carotid baroregulation and the pulsating heart: a mathematical model. Am J Physiol. 1998;275(5 Pt 2):H1733–47. 10.1152/ajpheart.1998.275.5.H1733 [DOI] [PubMed] [Google Scholar]
  • 19.Friedman JH. Greedy Function Approximation: A Gradient Boosting Machine. Ann Stat. 2001;29(5):1189–232. [Google Scholar]
  • 20.Comittee on Trauma AC of S. NTDB Version 2016. Chicago, Il,. 2017;
  • 21.Liu X, Wu J, Zhou Z. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans Syst Man, Cybern Part B. 2009;39(2):539–50. [DOI] [PubMed] [Google Scholar]
  • 22.Breiman L. Random Forests. Mach Learn. 2001;5–32. [Google Scholar]
  • 23.Meredith W, R R, Fakhry S, Emery S, Kromhout-Schiro S. The conundrum of the Glasgow Coma Scale in intubated patients: a linear regression prediction of the Glasgow verbal score from the Glasgow eye and motor scores. J Trauma Acute Care Surg. 1998;44(5):839–825. 10.1097/00005373-199805000-00016 [DOI] [PubMed] [Google Scholar]
  • 24.Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng [Internet]. 2018;2(October):749–60. Available from: 10.1038/s41551-018-0304-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lundberg SM, Lee S. A Unified Approach to Interpreting Model Predictions. Conf Neural Inf Process Syst. 2017;(Section 2):1–10. [Google Scholar]
  • 26.Shapley L. A Value for n-Person Games. Contrib to Theory Games. 1953;307–17. [Google Scholar]
  • 27.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(1):2825–30. [Google Scholar]
  • 28.Zenati MS, Billiar TR, Townsend RN, Peitzman AB, Harbrecht BG. A Brief Episode of Hypotension Increases Mortality in Critically Ill Trauma Patients. J Trauma Inj Infect Crit Care. 2002;(August):232–7. 10.1097/00005373-200208000-00007 [DOI] [PubMed] [Google Scholar]
  • 29.Bhandarkar P, Munivenkatappa A, Roy N, Al. E. On-admission blood pressure and pulse rate in trauma patients and their correlation with mortality: Cushing’s phenomenon revisited. Int J Crit Illn Inj Sci. 2017;7:14–7. 10.4103/2229-5151.201950 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Eastridge B, Salinas J, McManus J, Blackburn L, Bugler EM, Cooke WH, et al. Hypotension begins at 110 mm Hg: redefining “hypotension” with data. J Trauma Acute Care Surg. 2007;63(2):291–7. 10.1097/TA.0b013e31809ed924 [DOI] [PubMed] [Google Scholar]
  • 31.Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction. In: Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets. 2003.
  • 32.Fernandez A, Garcia S, Herrera F, Chawla V. N. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res [Internet]. 2018;61:863–905. Available from: https://www.jair.org/index.php/jair/article/view/11192 [Google Scholar]
  • 33.Haldar M, Abdool M, Ramanathan P, Xu T, Yang S, Duan H, et al. Applying deep learning to airbnb search. arXiv e-prints. 2018;1927–35. [Google Scholar]

Decision Letter 0

Zsolt J Balogh

29 Jul 2020

PONE-D-20-16709

Using the National Trauma Data Bank (NTDB) and machine learning to predict trauma patient mortality at admission

PLOS ONE

Dear Dr. Diamond,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 12 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Zsolt J. Balogh, MD, PhD, FRACS

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I congratulate the authors for using the NTDB and TQIP in their research efforts. I cannot comment on the methodology. I would ask the authors to expand their discussion on when the model predicted death, but the patients survived. I will ask the authors on how they think clinicians can use this methodology. Is in selecting the right research patients? can this method be used prehospital to guide triage to the appropriate location? will it be used to limit care? these issues are important, and as the authors have launched down this pathway of clinical decision support (which I personally support) they must marry the technology with how we are supposed to ethically utilize.

GCS is notoriously unreliable when patients are intubated. how did the authors handle this? Lastly, I agree that the traditional cut points for HR and SBP are wrong, as shown by Eastridge el al. over a decade ago.

Eastridge BJ, Salinas J, McManus JG, et al. Hypotension begins at 110 mm Hg: redefining "hypotension" with data. 2008 Aug;65(2):501. Concertino, Victor A. Trauma. 2007;63(2):291-299.

Reviewer #2: The authors have used an immense dataset (nearly 1 million patients) from the USA National Trauma Data Bank to generate a prediction model for risk of death from major trauma.

Broadly, the article provides quantitative support for the ability of an experienced observer to pick the ‘gestalt’ of a patient. Those experienced surgeons who can pick "big sick" just by seeing the patient in the trauma bay with the most basic of clinical observations.

Specific comments

The authors excluded incomplete data due to their concern regarding introduction of bias. 36% of the initial dataset was excluded due to incomplete data. The authors tantalizingly mention that the excluded data had over double the risk of death compared to the included data. As is often the case, these patients may be the most interesting, and is the real weakness with large database studies like this. There is no analysis beyond the relative mortality rate of the included vs excluded groups. The authors should explore the characteristics of the missing data more. It is very surprising that the authors have access to mortality data from the incomplete records, but not to admission HR/RR/GCSTOT/SBP/Age/Temp/Gender which could greatly expand the test dataset and reassure the reader that the model doesn’t lack external validity.

The authors allay concerns about this exclusion weakening the external validity of the model by further testing it against the TQP PUF dataset from 2017 which is a separate population. The model maintains predictive power of over 90% in this dataset which is reassuring.

The authors explore the characteristics of patients the model incorrectly overclassified as likely to die, but did not explore those under-classified. From a trauma systems point of view (where trauma activation criteria is deliberately over-inclusive, to avoid missing at-risk patients), it would be very useful to understand the characteristics of patients whom the model failed to identify as likely to die. This is where the real risk lies in implementing a model like this as a clinical decision aid. (Ambulance dispatch, remote and regional hospital transfer decisions etc)

The article’s discussion section could be expanded to discuss the usefulness of the model’s predictive ability. There should be a structured discussion regarding limitations of the study.

Line 308 is unhelpful and should be revised.

“Clinicians continually make decisions based on complex information, however they do not construct highly predictive, non-linear functions, in an 8-dimensional space based upon statistical learning over 10^4 to 10^5 cases.”

It can be argued that clinicians actually do exactly this (albeit over 10^3 – 10^4 patients instead) and that that’s what clinical experience and training is. One potential revision is to point out that models like this allow ‘experience’ to be generated programmatically, allowing it to be applied in personnel-limited settings and over many more patients than any one clinician can establish in their experience.

Reviewer #3: In this article the authors try to estimate mortality by utilizing a machine learning modality, using the National Trauma Data Bank. In a final setup 8 variables were identified that could further detail risk patterns for mortality.

Research question and embedding in current literature: the rationale and timeliness are adequate and very well delineated in the introduction.

Methods used: adequate methods were used, with separate modeling, training and evaluation sets. The sample set with almost 1 million unique datasets is a very good basis for this kind of studies. One can considerably argue about the removal of incomplete datasets. Even more if the death rate in this set is 2.5 times higher than that for complete sets. Not being a advocate for imputation, however leaving out a that kind of skewed population comes with hazards for the outcome, which should be better substantiated. Did the authors consider a test run with and with out imputation to at least have a faint idea what the consequences are of this way of treating the data?

Moreover the authors decided to go for a random sample for the training database and because of the skewness toward non-deceased under sampled the non-survivors. Is there any idea on what the consequences were? Any time differences? Adequacy of the randomization proces? How did this under sampling take technically place?

In the set the GCS is taken as an important factor. One of the major problems with the GCS is that it can be influenced by the pre-hospital treatment and is especially problematic in the severely injured patients that come in already intubated. How did the authors correct for this?

Why was chosen for the supervised learning method as this already points mostly into a certain direction of the results. Any comment on this?

The explanation of the machine learning proces (from line 151 on) is very detailed, however adequate, but it possibly would be better to get this part into the supplemental material as most of the PLOS readership will not have any connection with it. From paragraph 2.6 could be left in.

Results: the results are impressive with an accuracy of over 91 %.The authors advise that the vital signs (BP and HR) should be viewed as tandem, mainly because these are interrelated. Could they explain how they would see (advice) this in practice. This is in line with the very old Algöwer principal of the shock index.

In line 265 and further the authors should explain a little bit more what they mean with their statement, taken the average readership in mind.

Discussion: The authors constantly have the average clinician in mind and should be complimented with that, however no real advise how to use the very nice findings in clinical practice can be found in the discussion. How could the readership get advantage from the findings that are obviously better than most predictions at hand, but how should they going to use it in clinical practice what are the next steps to bring this to the doctor and the patient, who should benefit from this.

Moreover in the discussion is very limited and no real comparison with the literature has been made, which should be extended in my mind.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Nov 17;15(11):e0242166. doi: 10.1371/journal.pone.0242166.r002

Author response to Decision Letter 0


15 Aug 2020

Response to reviewers document included in the files. However, a copy and pasted version is available below.

Response to Reviewer #1 (Our comments in bold and italics)

Reviewer #1: I congratulate the authors for using the NTDB and TQIP in their research efforts. I cannot comment on the methodology. I would ask the authors to expand their discussion on when the model predicted death, but the patients survived.

We thank the reviewer for their supportive comments.

1. I will ask the authors on how they think clinicians can use this methodology. Is in selecting the right research patients? Can this method be used prehospital to guide triage to the appropriate location? will it be used to limit care? these issues are important, and as the authors have launched down this pathway of clinical decision support (which I personally support) they must marry the technology with how we are supposed to ethically utilize.

We have added:

The method may serve several distinct uses. In a prioritization setting, where both time and nurse/surgeon availability are limited, the rapid generation of a hierarchy for patient treatment has value. The model can provide an objective ranking as to which patients should receive the most available resources and guide triage. Another use is to help explain those objective rankings with specific references to patient vital signs, potentially enhancing the prioritization.

2. GCS is notoriously unreliable when patients are intubated. How did the authors handle this? Lastly, I agree that the traditional cut points for HR and SBP are wrong, as shown by Eastridge el al. over a decade ago.

Eastridge BJ, Salinas J, McManus JG, et al. Hypotension begins at 110 mm Hg: redefining "hypotension" with data. 2008 Aug;65(2):501. Concertino, Victor A. Trauma. 2007;63(2):291-299.

We agree that GCS can be unreliable when patients are intubated and have included some context regarding intubation. Of the 799,233 patient population, only 903 patients were labeled as “intubated or chemically sedated”, representing only ~0.1% of the population, a likely very small contribution. We added the statement:

“While the Glasgow Coma Score can be unreliable in intubated patients [Meredith], only an exceedingly small percentage (<1%) of the entire NTDB trauma patient set were in that state.”

Meredith, W., Rutledge, R., Fakhry, S. M., Emery, S., & Kromhout-Schiro, S. (1998). The conundrum of the Glasgow Coma Scale in intubated patients: a linear regression prediction of the Glasgow verbal score from the Glasgow eye and motor scores. The Journal of trauma, 44(5), 839–845. https://doi.org/10.1097/00005373-199805000-00016

We also included the earlier references identified by the reviewer in our discussion on the threshold behavior of the systolic blood pressure of trauma patients.

Response to Reviewer #2 (Our comments in bold and italics)

Reviewer #2: The authors have used an immense dataset (nearly 1 million patients) from the USA National Trauma Data Bank to generate a prediction model for risk of death from major trauma. Broadly, the article provides quantitative support for the ability of an experienced observer to pick the ‘gestalt’ of a patient. Those experienced surgeons who can pick "big sick" just by seeing the patient in the trauma bay with the most basic of clinical observations.

We thank the reviewer for their thoughts.

Specific comments

1. The authors excluded incomplete data due to their concern regarding introduction of bias. 36% of the initial dataset was excluded due to incomplete data. The authors tantalizingly mention that the excluded data had over double the risk of death compared to the included data. As is often the case, these patients may be the most interesting, and is the real weakness with large database studies like this. There is no analysis beyond the relative mortality rate of the included vs excluded groups. The authors should explore the characteristics of the missing data more. It is very surprising that the authors have access to mortality data from the incomplete records, but not to admission HR/RR/GCSTOT/SBP/Age/Temp/Gender which could greatly expand the test dataset and reassure the reader that the model doesn’t lack external validity. The authors allay concerns about this exclusion weakening the external validity of the model by further testing it against the TQP PUF dataset from 2017 which is a separate population. The model maintains predictive power of over 90% in this dataset which is reassuring.

The decision to remove the missing data records was difficult but necessary. Motivated by the reviewer comment, we have gone back and taken a closer look at the data that was removed. We observed that the death rate was 1.4x higher in the missing data group than in the complete data group, which was too significant to ignore. Therefore, imputation was performed on missing entries for patients who were missing a maximum of 2 features. This increased the size of the usable patient database to 799,233 patients and the accuracy increased to ~92.4%. We thank the reviewer for this suggestion. Most notably, the death rate of the remaining excluded data (with more than 2 missing features) was now found to be approximately equal to that of the included data – which should further allay concerns over excluded data. This is now discussed further in the manuscript.

In the methods section, we added:

Of the 968,665 unique patients in the trauma database, 351,253 patients contained missing data. The death rate of the population of patients with missing data was 1.4x greater than that of the patients without, suggesting that missing data could not be ignored. Therefore, we used an iterative imputation method (see supplemental section for details) to impute the missing values of all patients missing 2 or fewer features. This captured 181,821 additional patients, and the death rate of all included patients was now approximately equal to that of the excluded patients.

In the supplemental section, we added:

We used the IterativeImputer class from the scikit-learn library to impute missing data in patients missing 2 or fewer features [Pedregosa]. Missing features were modeled as functions of present features and a ridge linear regression model was trained to predict the missing value. At a single step in the iteration, a single feature was treated as the missing output while the remaining features were assigned as inputs. This was repeated for each feature in a single round, and then iterated for 50 rounds. In the rare instances that this imputation led to unphysical values (e.g., age < 0), we simply imputed the value with a nearby physical value. This is superior to simply imputing the missing value with the mean of the present features as it imputes the average over many approximations of possible values. Furthermore, simple imputation significantly model variance leading to lower generalizability.

2. The authors explore the characteristics of patients the model incorrectly overclassified as likely to die, but did not explore those under-classified. From a trauma systems point of view (where trauma activation criteria is deliberately over-inclusive, to avoid missing at-risk patients), it would be very useful to understand the characteristics of patients whom the model failed to identify as likely to die. This is where the real risk lies in implementing a model like this as a clinical decision aid. (Ambulance dispatch, remote and regional hospital transfer decisions etc)

An additional figure (Fig. 7) and discussion for this under-classified case has been added to the paper. We have added/edited the following from the discussion section:

We also computed the SHAP values for 8 cases where the model made incorrect predictions, as shown in Fig. 6 and Fig. 7. Notably, in all 8 cases, there was a single feature that dominated the model prediction (GCSTOT, Age, and SaO2) instead of a combination of features as in Fig. 5. Machine learning models are typically best at making predictions on unseen data that are as close to an interpolation of the training data as possible, and often fail when the unseen data is significantly different from the training data. One possible solution could be to explicitly model cross terms in the data to force the model to consider all feature-feature interactions. While the model is ideally learning the feature-feature interactions during the training process, sometimes explicit inclusion of these terms can improve model accuracy (although potentially at the expense of model interpretability).

3. The article’s discussion section could be expanded to discuss the usefulness of the model’s predictive ability. There should be a structured discussion regarding limitations of the study.

Line 308 is unhelpful and should be revised.

“Clinicians continually make decisions based on complex information, however they do not construct highly predictive, non-linear functions, in an 8-dimensional space based upon statistical learning over 10^4 to 10^5 cases.” It can be argued that clinicians actually do exactly this (albeit over 10^3 – 10^4 patients instead) and that that’s what clinical experience and training is. One potential revision is to point out that models like this allow ‘experience’ to be generated programmatically, allowing it to be applied in personnel-limited settings and over many more patients than any one clinician can establish in their experience.

We agree. The discussion section has been expanded/revised to the following:

The NTDB-trained gradient boosting model was trained with thousands of trauma patients from participating trauma centers around the country and was able to fit a robust decision boundary to the dataset. With only 8 features that can be measured upon a patient’s ER admission, our model was able to provide an accurate metric for patient risk of death (AUC = 0.924, Fig. 2) even when death was rare. This high accuracy was further tested on a withheld dataset to further validate our claims of its high accuracy.

In a trauma setting, the usefulness of a model is not only limited by its accuracy but also by its interpretability – clinicians must understand how the model makes predictions in order to trust it. This has typically resulted in the use of linear models, but unfortunately, linear models are incapable of modeling complex decision boundaries - resulting in the loss of accuracy in exchange for interpretability. SHAP scores circumvent this problem by providing a robust method for quantitatively explaining a model’s predictions. Using our model in conjunction with SHAP scores for the 8 features (Fig. 5-7) provides a detailed and quantitative view of each vital sign’s contribution to risk. The method may serve several distinct uses. In a prioritization setting, where both time and nurse/surgeon availability are limited, the rapid generation of a hierarchy for patient treatment has value. The model can provide an objective ranking as to which patients should receive the most available resources and guide triage. Another use is to help explain those objective rankings with specific references to patient vital signs, potentially enhancing the prioritization. Furthermore, they can be used to evaluate how actions taken to alter these variables may affect patient survival probability.

A limitation of this model is that it was trained based on the vital signs of patients upon admission. While the model was accurate, predicting the time-series evolution of the patient will requires dynamical training data. A natural extension would be to train a model that can predict patient-risk in real-time if time-series trauma patient data is available.

On the machine learning side, one limitation of our approach was the random undersampling procedure for balancing the number of survived patients with the number of deceased patients in the training set [Liu]. It is possible that more informative survived patients were not included in the training set that could have led to an even more robust decision boundary [Zhang]. Undersampling methods that include survived examples based upon their distances to deceased patients in the 8-dimensional space could improve model predictability [Zhang], as they can more accurately model the decision boundary near “hard-to-classify” cases. We also used the synthetic minority oversampling technique (SMOTE) to try to balance the training set, where artificial patient records from the survived class are generated from existing survived patient records, but it was ineffective in this instance and the accuracy of our model decreased significantly [Fernandez].

Although not the focus of this paper, we also note that the gradient boosting method exhibited a ROC-AUC that exceeded that of various neural networks and other machine learning models. Neural networks and tree-based models are two of the most commonly used classification models in the machine learning community with neural networks frequently outperforming tree-based models as well [Haldar]. While we extensively tuned the parameters to the neural network in hopes of attaining a higher accuracy, it was unable to eclipse our gradient boosting model.

The advantages of the present approach are: (1) only 8 features are needed, (2) all 8 features are readily available on admission, (3) the calculation is exceedingly fast, portable and accurate, (4) the relative risk of each feature is determined and graphically presentable as in Fig. 5 and 6, and (5) actual outcomes can be compared to the NTDB average performance on an individual basis. While features were chosen for inclusion based upon availability and importance, the accuracy of the model could be further improved by also including approximated inputs. For example, trauma surgeons generally also have an approximation for the injury severity score upon admission. If estimates for the severity of each injury is included into the model, the accuracy of the model was found to increase by ~2-3%. In future work, a goal will be to determine if the model predictions can be refined as a patient’s vital signs evolve in time.

Fernandez A, Garcia S, Herrera F, Chawla, V. N. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res [Internet]. 2018;61:863–905.

Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study involving information Extraction. In: Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets. 2003.

Liu X, Wu J, Zhou Z. Exploratory Undersampling for Class-Imbalance Learning.

IEEE Trans Syst Man, Cybern Part B. 2009;39(2):539–50.

Haldar M, Abdool M, Ramanathan P, Xu T, Yang S, Duan H, et al. Applying deep learning to Airbnb search. arXiv e-prints. 2018;1927-35.

Response to Reviewer #3 (Our comments in bold and italics)

Reviewer #3: In this article the authors try to estimate mortality by utilizing a machine learning modality, using the National Trauma Data Bank. In a final setup 8 variables were identified that could further detail risk patterns for mortality.

Research question and embedding in current literature: the rationale and timeliness are adequate and very well delineated in the introduction.

We thank the reviewer for their thoughts.

1. Methods used: adequate methods were used, with separate modeling, training and evaluation sets. The sample set with almost 1 million unique datasets is a very good basis for this kind of studies. One can considerably argue about the removal of incomplete datasets. Even more if the death rate in this set is 2.5 times higher than that for complete sets. Not being a advocate for imputation, however leaving out a that kind of skewed population comes with hazards for the outcome, which should be better substantiated. Did the authors consider a test run with and with out imputation to at least have a faint idea what the consequences are of this way of treating the data?

We have taken a closer look at the missing data and noticed some near duplicate entries where the patient vital signs were sometimes counted twice, so when we removed them from the database – the true death rate of the excluded data population was found to actually be ~1.4 fold higher (not 2.5 fold higher). This has been corrected. However, the referee’s point is still valid and must be addressed carefully.

Handling missing data requires imputation of the missing values but it is impossible to determine the extent to which this is helping or hurting the model’s prediction (Jakobsen, Sterne). Poor imputation can add noise to the model and decrease model accuracy, while excluding missing data from the analysis can introduce bias. Ultimately, the onus is on the data analyst to consider the situation and access the missingness and propose a solution for it.

Based on the referee’s comments, we spent considerable time on this question and came up with a solution. Of the subset of missing data, ~52% of them were missing fewer than 3 features. We chose to add this subset to the dataset by imputing the missing values with the mean value. This captured 181,821 additional patients, 7,011 of whom were deceased. We believe this is the appropriate compromise between including additional data and adding noise into the model. Furthermore, the distribution of quantity of missing features was bimodal, making a maximum of 2 missing features a natural threshold. Conveniently, this raises the death rate in the usable patient dataset to 2.62% and lowers the death rate in the excluded patient database to 2.59% - which should allay concerns of biasing. Moreover, the accuracy of the model was found to increase to 92.4% with the additional data. We thank the reviewer for this suggestion.

The text has been updated in all relevant places to account for this change.

In the methods section, we added:

Of the 968,665 unique patients in the trauma database, 351,253 patients contained missing data. The death rate of the population of patients with missing data was 1.4 times greater than that of the patients without, suggesting that missing data could not be ignored. Therefore, we used an iterative imputation method (see supplemental section for details) to impute the missing values of all patients missing 2 or fewer features. This captured 181,821 additional patients, and the death of the included patients was approximately equal to that of the excluded patients.

In the supplemental section, we added:

We used the IterativeImputer class from the scikit-learn library to impute missing data in patients missing 2 or fewer features [Pedregosa]. Missing features are modeled as functions of present features and a ridge linear regression model is trained to predict the missing value. At a single step in the iteration, a single feature is treated as the missing output while the remaining features are the input. This is repeated for each feature in a single round, and then iterated for 50 rounds. In the rare instances that this imputation led to unphysical values (e.g., age < 0), we simply imputed the value with a nearby physical value. This is superior to simply imputing the missing value with the mean of the present features as it imputes the average over many approximations of possible values. Furthermore, simple imputation significantly model variance leading to lower generalizability.

Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials - A practical guide flowcharts. BMC Med Res Methodology. 2017;17(1):1–10. 2009;339(7713):157–60.

Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ. 2009;339(7713):157-60.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learningi in Python. J Mach Learn Res. 2011;12: 2825-30

2. Moreover the authors decided to go for a random sample for the training database and because of the skewness toward non-deceased under sampled the non-survivors. Is there any idea on what the consequences were? Any time differences? Adequacy of the randomization proces? How did this under sampling take technically place?

Under sampling from the majority class to create a balanced training set is a well-established, common solution to the class-imbalance problem (Liu). However, it does come with some potential consequences that should be acknowledged. The first is that we are limiting instances of the survived population from the training set, which could be important to fitting a robust decision boundary. We may not be preserving information-rich survived patients in the training set – which could be determinantal to model fit and reduce accuracy. An alternative approach is oversampling, whereby instances from the minority class are duplicated to balance the relative number of survived – deceased patients, but this actually led to a significant reduction in accuracy.

The randomization process was carried out by randomly selecting 85% of the deceased patients from the database to be included in the training set, and then selecting an equivalent number of randomly selected survived patients to balance the training set. The randomization process was performed with the python library, numpy, with a specified seed for reproducibility from run-to-run.

We added:

On the machine learning side, one limitation of our approach was the random undersampling procedure for balancing the number of survived patients with the number of deceased patients in the training set [Liu]. It is possible that more informative survived patients were not included in the training set that could have led to an even more robust decision boundary [Zhang]. Undersampling methods that include survived examples based upon their distances to deceased patients in the 8-dimensional space could improve model predictability [Zhang], as they can more accurately model the decision boundary near “hard-to-classify” cases. We also used the synesthetic minority oversampling technique (SMOTE) to try to balance the training set, where artificial patient records from the survived class are generated from existing survived patient records, but it was ineffective in this instance and the accuracy of our model decreased significantly [Fernandez].

Fernandez A, Garcia S, Herrera F, Chawla, V. N. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J Artif Intell Res [Internet]. 2018;61:863–905.

Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study involving information Extraction. In: Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets. 2003.

Liu X, Wu J, Zhou Z. Exploratory Undersampling for Class-Imbalance Learning.

IEEE Trans Syst Man, Cybern Part B. 2009;39(2):539–50.

3. In the set the GCS is taken as an important factor. One of the major problems with the GCS is that it can be influenced by the pre-hospital treatment and is especially problematic in the severely injured patients that come in already intubated. How did the authors correct for this?

This comment was also made in point 2 by reviewer #1, please see our specific response there.

4. Why was chosen for the supervised learning method as this already points mostly into a certain direction of the results. Any comment on this?

Gradient Boosting was chosen as the method as it was the most accurate at classifying the results. Unsupervised learning was performed upon the dataset, but no meaningful inferences could be made from that analysis and it was not included in the paper.

5. The explanation of the machine learning proces (from line 151 on) is very detailed, however adequate, but it possibly would be better to get this part into the supplemental material as most of the PLOS readership will not have any connection with it. From paragraph 2.6 could be left in.

We have moved these sections to the supplemental section.

6. Results: the results are impressive with an accuracy of over 91 %.The authors advise that the vital signs (BP and HR) should be viewed as tandem, mainly because these are interrelated. Could they explain how they would see (advice) this in practice. This is in line with the very old Algöwer principal of the shock index.

In line 265 and further the authors should explain a little bit more what they mean with their statement, taken the average readership in mind. Discussion: The authors constantly have the average clinician in mind and should be complimented with that, however no real advise how to use the very nice findings in clinical practice can be found in the discussion. How could the readership get advantage from the findings that are obviously better than most predictions at hand, but how should they going to use it in clinical practice what are the next steps to bring this to the doctor and the patient, who should benefit from this.

Lines 265 and beyond primarily discuss our insights from the SHAP value analysis. We added:

Using our model in conjunction with SHAP scores for the 8 features (Fig. 5 and 6) provides a detailed and quantitative view of each vital sign’s contribution to risk. The method may serve several distinct uses. In a prioritization setting, where both time and nurse/surgeon availability are limited, the rapid generation of a hierarchy for patient treatment has value. The model can provide an objective ranking as to which patients should receive the most available resources and guide triage. Another use is to help explain those objective rankings with specific references to patient vital signs, potentially enhancing the prioritization. Furthermore, they can be used to evaluate how actions taken to alter these variables may affect patient survival probability.

8. Moreover in the discussion is very limited and no real comparison with the literature has been made, which should be extended in my mind.

We agree and addressed it in our response to the 3rd point made by reviewer #2. Please see our response there.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Zsolt J Balogh

28 Oct 2020

Using the National Trauma Data Bank (NTDB) and machine learning to predict trauma patient mortality at admission

PONE-D-20-16709R1

Dear Dr. Diamond,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zsolt J. Balogh, MD, PhD, FRACS

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thank you.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: the authors have adequately addressed all concerns. no additional comments for the authors. no additional comments for the authors. the authors have adequately addressed all concerns.

Reviewer #2: (No Response)

Reviewer #3: the authors have followed largely the comments mad and have changed and added considerable parts of manuscript accordingly

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Luke PH Leenen

Acceptance letter

Zsolt J Balogh

30 Oct 2020

PONE-D-20-16709R1

Using the National Trauma Data Bank (NTDB) and machine learning to predict trauma patient mortality at admission

Dear Dr. Diamond:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Zsolt J. Balogh

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. A distribution of the number of patients per number of missing features.

    The distribution is bimodal, suggesting that including patients with a maximum of 2 missing features is the appropriate threshold for inclusion.

    (TIF)

    S2 Fig. The 5-fold grid-search cross validation method for selecting the approximately optimal hyperparameters.

    Importantly, the test set was withheld during the grid-search cross validation process allowing it to remain a fair metric for evaluating performance on new data.

    (TIF)

    S3 Fig. A single weak learner randomly chosen from the trained gradient boosting ensemble of weak learners.

    Variable thresholds, Friedman mean squared errors [19], percentage of training samples passed through each node, and log odds ratios are all present.

    (TIF)

    S1 Table. Table of all features used to make predictions.

    (DOCX)

    S1 File

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Data are available for purchase from the American College of Surgeons (ACS) via https://www.facs.org/quality-programs/trauma/tqp/center-programs/ntdb/datasets. The authors did not have any special access privileges to this data.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES