A Clinically Practical and Interpretable Deep Model for ICU Mortality Prediction with External Validation

Yanni Kang; Xiaoyu Jia; Kaifei Wang; Yiying Hu; Jianying Guo; Lin Cong; Xiang Li; Guotong Xie

. 2021 Jan 25;2020:629–637.

A Clinically Practical and Interpretable Deep Model for ICU Mortality Prediction with External Validation

Yanni Kang ¹, Xiaoyu Jia ¹, Kaifei Wang ², Yiying Hu ¹, Jianying Guo ¹, Lin Cong ¹, Xiang Li ¹, Guotong Xie ¹

PMCID: PMC8075474 PMID: 33936437

Abstract

Deep learning models are increasingly studied in the field of critical care. However, due to the lack of external validation and interpretability, it is difficult to generalize deep learning models in critical care senarios. Few works have validated the performance of the deep learning models with external datasets. To address this, we propose a clinically practical and interpretable deep model for intensive care unit (ICU) mortality prediction with external validation. We use the newly published dataset Philips eICU to train a recurrent neural network model with two-level attention mechanism, and use the MIMIC III dataset as the external validation set to verify the model performance. This model achieves a high accuracy (AUC = 0.855 on the external validation set) and have good interpretability. Based on this model, we develop a system to support clinical decision-making in ICUs.

Introduction

In the intensive care area, the early prediction of deterioration mortality is of primary concern. An estimate of 11% of deaths in hospital follow a failure to promptly recognize and treat deteriorating patients¹. Accurate early prediction of deterioration could improve the treatment outcomes significantly. For instance, an early alert can help doctors to intervene in advance, and make well-informed treatment plans based on sufficient contextual information.

A clinical prediction model can help physicians to estimate the risk of a specific outcome in advance. In practical, simple risk scores are more common adopted by physicians because it is easy to apply and explain. For example, the Simplified Acute Physiology Score (SAPS)² and the Acute Physiologic Assessment and Chronic Health Evaluation (APACHE)³ is well generalized due to simplicity. While the simplicity might lead to neglect of some critical features and a simple risk score is unable to deal with complex combinations of features. In contrast, machine learning and deep learning can achieve higher accuracy in prediction with capability of handling more features such as sequential features, and implementing complex combinations. However, it is difficult to generalize deep models in critical care applications due to the lack of external validation and interpretability.

External validation is essential for generalization of deep models in practice of critical care. Many researchers have adopted deep learning models for prediction of mortality in ICUs. However, most of the works either validate the deep models on private datasets without availability in other critical care scenarios, or perform validation on public dataset but depend on private dataset to construct the model, which is unduplicatable to others. To our knowledge, few works exist applying the two open-source databases to construct and externally validate the deep models respectively and achieve a good performance^4-10, leading to obstacles in generalization and replication of the deep models in intensive care area.

In this work, we propose a clinical prediction model constructed with data derived from the more recent Phillips eICU Collaborative Research Dataset¹³, and externally validated on the publicly available Medical Information Mart for Intensive Care (MIMIC-III¹²). We design a RNN (LSTM)¹¹ model to achieve precise prediction, and improve interpretability with two-level attention mechanism, which can detect influential timesteps and significant clinical variables after ICU admission. In the study conducted by Edward Choi¹⁴, attentions are used to evaluate the effect of visits and features on heart failure. With reference to the model designed in this paper, we implement the LSTM model based on two levels of attention in our work: one for the timestep-level attention and the other for variable-level attention. This model not only achieves high accuracy but also has clinical interpretability. The end-to-end behavior of the model is interpreted below by randomly choosing a patient from the test set and calculating the contribution of the significant risk factors associated with mortality prediction. Based on this model, we develop a system to support clinical decision-making in ICUs.

Methods

Figure 1 demonstrates the pipeline of the proposed approach, which follows the procedures below. Specifically, cohort construction and features extraction, dating cleaning and transformation, construction of statistical features and time series features, missing values imputation. Training set is used to build the proposed model using cross validation method. Then the model performance is evaluated on the testing data and external validation data.

Study Dataset Description

The clinical data used in this study were collected by the MIMIC-III Database and more recent Phillips eICU Dataset. MIMIC-III is a large, freely-available database which is comprised of deidentified electronic medical data contained over 40,000 patients, over 58,000 admissions, who stayed in critical care units of the Beth Israel Deaconess Medical Center from 2001 to 2012. The eICU Database is a multi-center ICU dataset with high granularity data for over 200,000 admissions to ICUs monitored by eICU programs across the United States. The eICU database comprises 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015 to hospitals located throughout the US¹³. Both MIMIC-III and eICU datasets are open-source and representative EHR datasets for intensive care. The eICU datasets which includes 198,167 patients is used as training data, and the MIMIC which includes 21,139 patients is used as testing data.

In the current study, we investigate the incidence of in-hospital mortality and the risk factors associated with its development in an ICU population. A curated set of clinically relevant features was chosen according to existing in-literatures about in-hospital mortality prediction. The risk factors include information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, medications, caregiver notes, medications, prescriptions, fluid balance, procedure codes, diagnostic codes, and so on. For this project, we also take two important addition calculated variables: Sequential Organ Failure Assessment scores (SOFA) and Estimated Glomerular Filtration Rate (eGFR). we extracted the following 67 variables from both the MIMIC-III and eICU datasets:

1) the Demographic information (static variables): Age, Gender, Ethnicity, Height, Weight
2) Vital signs (time-series variables): Mean Arterial Blood Pressure, Heart Rate, Respiration Rate, Temperature, Diastolic blood pressure, Oxygen saturation, Systolic blood pressure
3) Lab measurements (time-series variables): Amylase, Albumin, Alkaline phosphate, Asparate aminotransferase, Bicarbonate, Direct bilirubin, Total bilirubin, Blood urea nitrogen (BUN), Calcium, Creatinine, Fraction inspired oxygen, Glucose, Hematocrit, Hemoglobin, Partial Pressure of Oxygen (PaO2), Partial pressure of carbon dioxide, Partial thromboplastin time, Platelets, Potassium, Prothrombin time (PT), Sodium, White blood cell count (WBC), pH, Lipase, Lymphocytes, Erythrocyte Sedimentation Rate (ESR)
4) Medications (static variables): Amiodarone, Ativan, Calcium Gluconate, Cisatracurium, Diltiazem, Dopamine, Epinephrine, Fentanyl, Heparin, Insulin, Midazolam, Milrinone, Nicardipine, Norepinephrine, Precedex, Propofol, Total Parenteral Nutrition (TPN), Vasopressin, Hydromorphone or Dilaudid, Morphine
5) Fluids (time-series variables): Urine Output and GU Irrigant Volume In
6) Glascow coma scale (static variables): Glascow coma scale eye opening, Glascow coma scale motor response, Glascow coma scale verbal response
7) Interventions (static variables): Usage of Mechanical Ventilation, Dialysis
8) Calculate variable (time-series variables): SOFA, eGFR

Pre-processing

The ages of patient are restricted to 18 years or older. We only take the first ICU record of each admission to avoid data redundancy. In this research, we use the data stream during the first 48 hours after the admission to the ICU to predict the in-hospital mortality risk. Therefore, those patients who died in the first 48 hours are excluded. Finally, a total of 21,139 patients in MIMIC dataset and 198,167 patients in eICU dataset are used as the study cohort. The incidence of mortality in MIMIC III and eICU is 13.2% and 8.8%, respectively.

The basic features from MIMIC and eICU datasets include demographics, vital sign measurements, laboratory tests and medications, admission information, fluids, interventions, and co-morbidities and so on that are directly associated with an increased risk of mortality. The first step is to perform the feature extraction and data cleaning for numerical variables, in particular, variable of vital signs. There are obviously outliers beyond the normal threshold. So, after capping the extreme values at the 1st and 99th percentile, all numerical variables were normalized to the [0,1] range.

Then statistical hypothesis tests (e.g. t-test and chi-square test) are performed to select significant variables with p-values smaller than 0.05. Finally, 67 features are kept for modeling purpose in this work.

Many lab features exhibit a skewed distribution. The box-cox transformation is used to ensure that the transformed variables are normally distributed. The normality of each variable is tested with visual methods: quantile-quantile plots and frequency histograms. After box-cox transformation, the residual can better satisfy the assumptions such as normality and independence, and reduce the probability of pseudo-regression.

The patient measurements are made irregularly. In addition to vital signs, which are measured about once an hour, other variable such as laboratory values are measured irregularly and unfrequently. Take the variable creatine for example, there are averaged four creatine measurement values during the 48-hour observation window. Therefore, we resample the time series into regularly spaced intervals. Each variable sample spans a 1-hour window. We impute the missing values using the previous value if it exists and a pre-specified normal value otherwise. Categorical variables such as gender and ethnicity are encoded using a one-hot vector at each timestep. After the data cleaning and box-cox transformation if needed, then the inputs are normalized by subtracting the mean and divided by standard deviation. This approach makes in-hospital mortality prediction possible, and allows our models to use patterns of missingness in the data during the training process.

Model Framework

Firstly, we consider the standard attention mechanism which is similar to the attention mechanism for language translation¹⁵ (Figure 2(a)). The simple attention mechanism is implemented with Keras¹⁶. The attention is applied on the LSTM’s output layer. The high dimensional spaces spanned by the LSTM share the timesteps in common with the input’s parameter ‘return_sequences’¹⁷, which is set to be True. We apply a ‘Dense-SoftMax’ layer with the same number of output parameters as the feature dimensions of the ‘Input’ layer. Also, the time series is 67-dimensional, so we have a 67-D time series on 48 timesteps. The reduction dimension is realized by taking the mean value of the feature dimensions. It means that the attention vector will be of shape ‘(48,)’and shared across the input dimensions. Finally, we merge the ‘Inputs’ layer with the attention layer by multiplying element-wise. Also, the activation vector can be derived to graph.

This standard attentional mechanism can only be used to identify the timestep of high importance to the target outcome, but it cannot show the importance of the features in each timestep. With reference to the model design in the study conducted by Edward Choi¹⁴, we implement the LSTM model based on two levels of attention in our work: one for the timestep-level attention and the other for variable-level attention. The scalars α_ts are the timestep-level attention weights that govern the influence of each time block. The vectors β_ts are the variable-level attention weights that focus on each coordinate of the LSTM’s output unit. We use two RNNs to separately generate α_ts and β_ts as follows.

\begin{array}{l} S t e p 1 : α_{t s} = \frac{\exp (s c o r e (v_{t}, h_{s}))}{\sum_{s = 1}^{N} \exp (s c o r e (v_{t}, h_{s}))} \\ S t e p 2 : s c o r e (v_{t}, h_{s}) = v_{t}^{T} W_{a} h_{s} \\ S t e p 3 : β_{s t} = \tanh (s c o r e (\bar{h_{s}}, v_{t})) \\ S t e p 4 : s c o r e (\bar{h_{s}}, v_{t}) = v_{t}^{T} W_{β} \bar{h_{s}} \\ S t e p 5 : c_{t} = α_{t s} β_{s t} ⊙ v_{t} \\ S t e p 6 : \hat{y_{t}} = s i g m o d (W c_{t}) \end{array}

The two-attention architecture is shown in Figure 2(b): supposing giving an input sequence, we predict the label $y_{t}$ . in the following way. Step1: The attention is applied on the LSTM's output layer. This approach firstly takes LSTM’s output as the input of the attention processing module, and we note it as $v_{t}$ . Hidden state $h_{s}$ is obtained after applying a ‘Dense-SoftMax’ layer with the same number of output parameters as the ‘Input’ layer. Step1: generating $α_{ts}$ values, Step 2 and Step 4: $s c o r e (v_{t}, h_{s}), s c o r e (\bar{h_{s}}, v_{t})$ are full connection layers. Step3: generating $β_{s t}$ respectively. Step 5: Generating the context vector using attention and representation vectors, and Step 6: making prediction. Since this is a binary prediction task, we use the sigmoid function. As noted, using the generated attentions, we obtain the context vector $c_{t}$ , where ⊙ denotes element-wise multiplication. We use the context vector to predict the true label $y_{t}$ .

Overall, our attention mechanism can be viewed as the architecture of the standard attention mechanism for language translation where the words are encoded by RNN and the attention weights are generated by MLP. In contrast, our method uses RNN to generate two sets of attention weights, recovering the sequential information as well as mimicking the behavior of physicians.

Experiments

In this study, we use the Philips eICU dataset to train the model, and use the MIMIC III dataset as the external test set to verify the performance of the model. Model accuracy is measured by AUC of comparing predicted $\hat{y_{t}}$ with the true label $y_{t}$ . AUC is more robust to imbalanced positive prediction label, making it appropriate for evaluation of classification accuracy in in-hospital mortality prediction task¹⁴. For comparison, we complete the following models.

Logistic regression (LR): As we know, stationary models often aggregate past information and remove the temporality from the input data. So, for our linear baseline, we use a hand-engineered features described in papers. Functions of features that are well-conceived can consider the inherent properties and distributions of the different features and capture important information in data better than the raw values of the variables would.

For each variable, the observation window includes only the first 48 hours after ICU admission, and we construct the statistical features for each variable. For each variable, three sample statistics, namely, minimum, maximum and mean are features for each variable. For each variable, three sample statistics, namely, minimum, maximum and mean are calculated on a given time series. We use the recorded feature’s mean value to filled with the missing feature value. All the vectors are normalized to zero mean and unit variance. We use the resulting vector to train the LR model.

RNN(LSTM) (67 features):: The proposed RNN system is a recurrent neural network that operates sequentially over individual electronic health records, processing the data one step at a time. The model implemented by using a three-layers network. The first hidden layer is an RNN layer based on LSTM with the timestep is 48. The second layer is a full connect layer, a ‘Batch Normalization’¹⁸ layers is used before the full connection layer, we use a sigmoid activation function in the output layer for it’s a binary classification task. Here, to reduce overfitting, both dropout¹⁹ and ‘recurrent dropout’ parameters are used. The loss function used is Adam, with a learning rate of 0.001. Because of the unbalanced label, the parameter class weight ='balanced' is set, and the callback function is used to save the training results.

RNN (65 features): The only difference between this model and the above RNN(LSTM) (67 features) is that there are only 65 features, we don’t use the calculated feature: SOFA and eGFR. SOFA²⁰ is the Sepsis-related Organ Failure Assessment score (also referred to as the Sequential Organ Failure Assessment score) and it is used to describe organ dysfunction/failure of a patient in the ICU. eGFR is an important index for evaluating renal function.

RNN+ $α_{t s}$ : One-layer single directional RNN along time to generate the input sequence, the MLP is used with a single hidden layer to generate the timesteps level attentions $α_{t s}$ . The output of LSTM layer with ‘return sequence’ set true is used as the input to the attention mechanism.

Implementation details: We implement the model with artificial neural network library Keras. For training the model, we use Adam²¹ to optimize the loss function of cross-entropy²² with the mini-batch of 32 patients. The training was done in a Linux server equipped with two Nvidia Tesla P100’s and CUDA 10.0.

In-hospital Mortality Prediction

Objective & Cohort Construction: Given a visit sequence $x_{1, .. T}$ , we predicted patient’s in-hospital mortality. This is the many-to-one type of RNN model which receives a sequence of data and produces the outcome at the end of the sequence. From the training dataset, about 17483 cases are selected and approximately 10 controls are selected for each case. From the external dataset, 2797 cases are selected and approximately 7 controls are selected for each case.

Training details: The Philips dataset is used as training dataset, and split into training and validation datasets. The MIMIC III dataset is used as external test dataset. The patient cohort is divided into the training, validation and test sets in a 0.64:0.16:0.2 ratio. The training data are used to train the proposed models. The validation set is used to iteratively improve the models by selecting the best model architectures and hyperparameters. And the external test dataset is used to further verify the stability and generalization of model performance.

Results

The performance of in-hospital mortality prediction of two-level attention-based model is shown in Table 1. Firstly, the LR is taken as a baseline model. Feature engineering helps enhance model performance, however, the LR method does not work well due to severe over-fitting (training AUC = 0.865, testing AUC = 0.741), and is underperformed compared to the temporal learning algorithms.

Table 1.

In-hospital Mortality prediction performance of Two-level attention-based LSTM model and baselines

Model	LR	RNN_{(65 features)}	RNN	RNN+ $α_{t s}$	RNN+ $α_{t s}$ + $β_{s t}$
Training Set AUC	0.872	0.883	0.895	0.911	0.912
Test Set AUC	0.865	0.877	0.889	0.897	0.899
External Test Set AUC	0.741	0.791	0.839	0.852	0.855

Open in a new tab

The comparative analysis demonstrates that the performance of the LSTM model significantly outperforms traditional models. The performance in external validation is slightly worse than what we measured using cross-validation. The only difference between the RNN (65 features) and RNN is that the former model has not included the significant variables SOFA, eGFR that we derived from feature engineering. As a result, the AUC decreases by about 5% in RNN (65 features)model.

Note that RNN+ $α_{t s}$ model is a standard attention model with one timestep attention, which is still a competitive model as shown in table 1. This confirms the efficiency of generating attention weights using the RNN. However, RNN+ $α_{t s}$ model only provides scalar timestep-level attention, which is not sufficient for healthcare applications. There are often several measurement values in the one-hour interval, and it is important to distinguish their relative importance to the prediction target.

Two-level neural attention model is comparable to the other RNN variants in terms of prediction performance and interpretability. The proposed model not only achieve similar accuracy, but also has high clinical interpretability. In particular, this model is based on a two-level neural attention, which can detect influential timesteps and significant clinical variables after ICU admission.

Meanwhile, we compared our model to the existing benchmarking result. This studies is multitask learning with clinical time series data, and we only focus on the in-hospital mortality prediction task. For the benchmarking result using MIMIC III dataset⁸, we replicate the LSTM model with a binary mask input for each variable indicating the timesteps that contain a true (vs. imputed) measurement, and we validate the model performance on the eICU dataset. The AUC in Table 2 shows that the generalization of this model in external dataset is lower (training AUC = 0.866, testing AUC = 0.693). MIMIC III benchmark (67 features) takes the same model, except that it uses 67 features explicitly stated in the previous feature extraction instead of the original 17 features. However, the external validation result does not improve significantly. As shown in the Table 2: the external performance does not perform well when using the MIMIC dataset to train the LSTM model. The result is consistent with the database description. For the eICU model, cohort is approximately 10 times that of mimic. Most important, the eICU is a multicentric database with high granularity data, however, the MIMIC dataset contains information of only one center.

Table 2.

Compare in-hospital Mortality prediction performance with benchmarking result¹²

Model	MIMIC III benchmark¹² (17 features)	MIMIC III benchmark (67 features)
Training Set AUC	0.866	0.887
Test Set AUC	0.857	0.868
External Test Set AUC (eICU dataset)	0.693	0.716

Open in a new tab

We evaluated the interpretability of two-layer attentions RNN in the in-hospital mortality prediction task by choosing a case from the test set and calculating the contribution of the variables to mortality prediction. The prediction result on a randomly picked patient, out of the 2798 cases, is used to demonstrate the influence of two-level attention. The plot of the two-level attention weights is shown in Figure 3(a) and (b), respectively. In the figure 3 (a): The x-axis is the 48 timesteps after ICU admission. The y-axis is indicating the weight of each timestep, and the total weight in the 48 timesteps is 1. The figure shows that the last few timesteps pay more attention to the result, in particular, the last 2 timesteps account for 50% of the weights. In figure 3 (b): two-level attention: one for the timestep-level attention and the other for variable-level attention. Four important features strongly associated with mortality are selected to illustrate the effect of these features on the outcome. The weight of each variable changes with timesteps, the weights from the earlier steps make less contribution to the result as expected. As for the correlations of variables on outcomes, the absolute value of the weight represents the magnitude of the weight and the symbol represents the direction. It shows that the oxygen saturation and bicarbonate variables has a negative effect on the result and can been seen as protective factors, and direct bilirubin and white blood cell count have a positive effect on the predicted result. It is consistent with the results in the hypothesis test and clinical practice.

Figure 3: — (a) RNN+ $α_{t s}$ model is a standard attention model with one timestep attention. (b) RNN+ two-level attention with one timestep attention and the other feature attention.

Discussion:

Our approach provides interpretability through an attention mechanism. The results are then interpreted by the doctor as follows: the oxygen saturation and bicarbonate are negatively correlating with in-hospital mortality, reflected as protective factors, and direct bilirubin and white blood cell count have positive effects on the in-hospital mortality, which is consistent with physiology mechanism. At a late stage of the illness, hypoxemia and decreased oxygen saturation can cause organs deterioration, promoting the process of death²³. During the end stage, the circulatory function decreases, leading to the decrease of bicarbonate level, which is a manifestation of poor prognosis²⁴. Bilirubin is a key indicator of liver function deterioration, a biomarker of malignancy, it increases at the end of the disease when the liver is hypoxic²⁵. Higher count of white blood cells indicates more serious infection, which is one of the critical causes of clinical death.

In-hospital mortality prediction is crucial for assessing severity of illness and providing real time information for supporting clinical decision-making. Our study focuses on prediction of mortality risk, one of the most common clinical tasks. We have already developed a disease prediction system for physicians. The following figure shows the detail of the system.

Our model has compared to the baseline models such as LR and other RNN variants based on practicality. Unfortunately, we have not compared with more novel deep learning frameworks, since most of the novel work is hard to replicate without sharing data. Another deficiency is that we did not invent a new deep learning framework. In the future, we consider changing the original RNN structure and the importance evaluation of variables and integrate them into the RNN structure in order to achieve better performance.

Conclusion

In this paper, we present a practical clinical model for in-hospital mortality prediction in ICU, which is designed with a two-level neural attention mechanism to improve RNN’s predictive ability while allowing a higher degree of interpretability. We also verified the model with A completely different external public dataset. Moreover, this model has been developed to a practical assessment system able to support clinical decision-making.

Acknowledgments

Funding: This work was funded by the China National Key R&D Program (Grant No. 2018YFC0910700). The authors declare no conflict of interest. We thank Limiao Sheng from Department of Intelligent Public Health of Ping An Health Technology for providing the development and deployment of the disease prediction system. We thank Dr. Xiao Xu and Xian Xu for their advice in the paper writing and model verification.

Figures & Table

Figure 4: — the main interface of the ICU decision support system. The red matrix and row on the left highlights the in-hospital mortality prediction task, and the two pictures on the right are the risk level and risk factor respectively. Below is the electronic medical record. Most important Chinese information has been translated, while personal sensitive information has been masked.

References

1.Zimmerman JE, Kramer AA, Knaus WA. Changes in hospital mortality for United States intensive care unit admissions from 1988 to 2012. Critical Care. 2013;17:R81. doi: 10.1186/cc12695. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA. 1993;270:2957–2963. doi: 10.1001/jama.270.24.2957. [DOI] [PubMed] [Google Scholar]
3.Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13:818–829. [PubMed] [Google Scholar]
4.Awad A., Bader-El-Den M., McNicholas J., Briggs J. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. International Journal of Medical Informatics; 2017. [DOI] [PubMed] [Google Scholar]
5.Wojtusiak J., Elashkar E., Nia R. M. C-Lace: Computational Model to Predict 30-Day Post-Hospitalization Mortality., in: HEALTHINF. 2017. pp. 169–177.
6.Beaulieu-Jones B. K., Orzechowski P., Moore J. H. Mapping patient trajectories using longitudinal extraction and deep learning in the MIMIC-III critical care database, bioRxiv. 2017. p. 177428. [PubMed]
7.Purushotham S., Meng C., Che Z., Liu Y. Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics. 2018;83(April):112–134. doi: 10.1016/j.jbi.2018.04.007. [DOI] [PubMed] [Google Scholar]
8.Harutyunyan H., Khachatrian H., Kale D. C., Ver Steeg G., Galstyan A. Multitask learning and benchmarking with clinical time series data. Scientific data. 2019;6(1):96. doi: 10.1038/s41597-019-0103-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Seyedmostafa Sheikhalishahi, Vevake Balaraman. Benchmarking machine learning models on eICU critical care dataset. 2019;arXiv:1910.00964v1. [Google Scholar]
10.Parikh R B, Manz C, Chivers C, et al. Machine Learning Approaches to Predict 6-Month Mortality Among Patients With Cancer[J] JAMA network open. 2019;2(10):e1915997–e1915997. doi: 10.1001/jamanetworkopen.2019.15997. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
12.Johnson A. E., Pollard T. J., Shen L., Li-Wei H. L., Feng M., Ghassemi M., Moody B., Szolovits P., Celi L. A., Mark R. G. Mimic-Iii, a freely accessible critical care database. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Pollard T. J., Johnson A. E., Raffa J. D., Celi L. A., Mark R. G., Badawi O. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data. 2018:5. doi: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, et al. In NIPS; 2016. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism; pp. 3504–3512. [Google Scholar]
15.Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. In ICLR; 2015. [Google Scholar]
16.Franc¸ois Chollet et al. Keras. 2015. https://github.com/fchollet/keras .
17.Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, Bengio Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014 arXiv preprint. arXiv: 1406.1078. [Google Scholar]
18.Sergey Ioffe, Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015;arXiv(1502.03167) [Google Scholar]
19.Baldi Pierre, Sadowski Peter J. Understanding dropout. In Advances in Neural Information Processing Systems. 2013. pp. 2814–2822.
20.Vincent J.-L., Moreno R., Takala J., Willatts S., Mendonc a A. De, Bru- ining H., Reinhart C., Suter P., Thijs L. The sofa (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive care medicine. 1996;22:707–710. doi: 10.1007/BF01709751. [DOI] [PubMed] [Google Scholar]
21.Kingma D., Ba J. Adam: A method for stochastic optimization. 2014. p. 6980. Preprint at, https://arxiv.org/abs/1412 .
22.De Boer P T, Kroese D P, Mannor S, et al. A tutorial on the cross-entropy method[J] Annals of operations research. 2005;134(1):19–67. [Google Scholar]
23.Lamba TS1, Sharara RS, Singh AC, Balaan M. Pathophysiology and Classification of Respiratory Failure. Crit Care Nurs Q. 2016;39(2):85–93. doi: 10.1097/CNQ.0000000000000102. [DOI] [PubMed] [Google Scholar]
24.J Surg Res Metabolic acidosis and the role of unmeasured anions in critical illness and injury. 2018;224:5–17. doi: 10.1016/j.jss.2017.11.013. [DOI] [PubMed] [Google Scholar]
25.Curr Opin Gastroenterol Diagnosis and evaluation of hyperbilirubinemia. 2017;33(3):164–170. doi: 10.1097/MOG.0000000000000354. [DOI] [PubMed] [Google Scholar]

[r1-097_3417163] 1.Zimmerman JE, Kramer AA, Knaus WA. Changes in hospital mortality for United States intensive care unit admissions from 1988 to 2012. Critical Care. 2013;17:R81. doi: 10.1186/cc12695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2-097_3417163] 2.Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA. 1993;270:2957–2963. doi: 10.1001/jama.270.24.2957. [DOI] [PubMed] [Google Scholar]

[r3-097_3417163] 3.Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13:818–829. [PubMed] [Google Scholar]

[r4-097_3417163] 4.Awad A., Bader-El-Den M., McNicholas J., Briggs J. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. International Journal of Medical Informatics; 2017. [DOI] [PubMed] [Google Scholar]

[r5-097_3417163] 5.Wojtusiak J., Elashkar E., Nia R. M. C-Lace: Computational Model to Predict 30-Day Post-Hospitalization Mortality., in: HEALTHINF. 2017. pp. 169–177.

[r6-097_3417163] 6.Beaulieu-Jones B. K., Orzechowski P., Moore J. H. Mapping patient trajectories using longitudinal extraction and deep learning in the MIMIC-III critical care database, bioRxiv. 2017. p. 177428. [PubMed]

[r7-097_3417163] 7.Purushotham S., Meng C., Che Z., Liu Y. Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics. 2018;83(April):112–134. doi: 10.1016/j.jbi.2018.04.007. [DOI] [PubMed] [Google Scholar]

[r8-097_3417163] 8.Harutyunyan H., Khachatrian H., Kale D. C., Ver Steeg G., Galstyan A. Multitask learning and benchmarking with clinical time series data. Scientific data. 2019;6(1):96. doi: 10.1038/s41597-019-0103-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-097_3417163] 9.Seyedmostafa Sheikhalishahi, Vevake Balaraman. Benchmarking machine learning models on eICU critical care dataset. 2019;arXiv:1910.00964v1. [Google Scholar]

[r10-097_3417163] 10.Parikh R B, Manz C, Chivers C, et al. Machine Learning Approaches to Predict 6-Month Mortality Among Patients With Cancer[J] JAMA network open. 2019;2(10):e1915997–e1915997. doi: 10.1001/jamanetworkopen.2019.15997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-097_3417163] 11.Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[r12-097_3417163] 12.Johnson A. E., Pollard T. J., Shen L., Li-Wei H. L., Feng M., Ghassemi M., Moody B., Szolovits P., Celi L. A., Mark R. G. Mimic-Iii, a freely accessible critical care database. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-097_3417163] 13.Pollard T. J., Johnson A. E., Raffa J. D., Celi L. A., Mark R. G., Badawi O. The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data. 2018:5. doi: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-097_3417163] 14.Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, et al. In NIPS; 2016. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism; pp. 3504–3512. [Google Scholar]

[r15-097_3417163] 15.Bahdanau D., Cho K., Bengio Y. Neural machine translation by jointly learning to align and translate. In ICLR; 2015. [Google Scholar]

[r16-097_3417163] 16.Franc¸ois Chollet et al. Keras. 2015. https://github.com/fchollet/keras .

[r17-097_3417163] 17.Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, Bengio Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014 arXiv preprint. arXiv: 1406.1078. [Google Scholar]

[r18-097_3417163] 18.Sergey Ioffe, Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015;arXiv(1502.03167) [Google Scholar]

[r19-097_3417163] 19.Baldi Pierre, Sadowski Peter J. Understanding dropout. In Advances in Neural Information Processing Systems. 2013. pp. 2814–2822.

[r20-097_3417163] 20.Vincent J.-L., Moreno R., Takala J., Willatts S., Mendonc a A. De, Bru- ining H., Reinhart C., Suter P., Thijs L. The sofa (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive care medicine. 1996;22:707–710. doi: 10.1007/BF01709751. [DOI] [PubMed] [Google Scholar]

[r21-097_3417163] 21.Kingma D., Ba J. Adam: A method for stochastic optimization. 2014. p. 6980. Preprint at, https://arxiv.org/abs/1412 .

[r22-097_3417163] 22.De Boer P T, Kroese D P, Mannor S, et al. A tutorial on the cross-entropy method[J] Annals of operations research. 2005;134(1):19–67. [Google Scholar]

[r23-097_3417163] 23.Lamba TS1, Sharara RS, Singh AC, Balaan M. Pathophysiology and Classification of Respiratory Failure. Crit Care Nurs Q. 2016;39(2):85–93. doi: 10.1097/CNQ.0000000000000102. [DOI] [PubMed] [Google Scholar]

[r24-097_3417163] 24.J Surg Res Metabolic acidosis and the role of unmeasured anions in critical illness and injury. 2018;224:5–17. doi: 10.1016/j.jss.2017.11.013. [DOI] [PubMed] [Google Scholar]

[r25-097_3417163] 25.Curr Opin Gastroenterol Diagnosis and evaluation of hyperbilirubinemia. 2017;33(3):164–170. doi: 10.1097/MOG.0000000000000354. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Clinically Practical and Interpretable Deep Model for ICU Mortality Prediction with External Validation

Yanni Kang, MS

Xiaoyu Jia

Kaifei Wang, MD

Yiying Hu, MS

Jianying Guo, PhD

Lin Cong, MS

Xiang Li, PhD

Guotong Xie, PhD

Abstract

Introduction

Methods