Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2019 Dec 19;27(3):355–365. doi: 10.1093/jamia/ocz205

Temporal convolutional networks allow early prediction of events in critical care

Finneas J R Catling 1,, Anthony H Wolff 1
PMCID: PMC7647248  PMID: 31858114

Abstract

Objective

Clinical interventions and death in the intensive care unit (ICU) depend on complex patterns in patients’ longitudinal data. We aim to anticipate these events earlier and more consistently so that staff can consider preemptive action.

Materials and Methods

We use a temporal convolutional network to encode longitudinal data and a feedforward neural network to encode demographic data from 4713 ICU admissions in 2014–2018. For each hour of each admission, we predict events in the subsequent 1–6 hours. We compare performance with other models including a recurrent neural network.

Results

Our model performed similarly to the recurrent neural network for some events and outperformed it for others. This performance increase was more evident in a sensitivity analysis where the prediction timeframe was varied. Average positive predictive value (95% CI) was 0.786 (0.781–0.790) and 0.738 (0.732–0.743) for up- and down-titrating FiO2, 0.574 (0.519–0.625) for extubation, 0.139 (0.117–0.162) for intubation, 0.533 (0.492–0.572) for starting noradrenaline, 0.441 (0.433–0.448) for fluid challenge, and 0.315 (0.282–0.352) for death.

Discussion

Events were better predicted where their important determinants were captured in structured electronic health data, and where they occurred in homogeneous circumstances. We produce partial dependence plots that show our model learns clinically-plausible associations between its inputs and predictions.

Conclusion

Temporal convolutional networks improve prediction of clinical events when used to represent longitudinal ICU data.

Keywords: critical care, machine learning, artificial intelligence, oxygen inhalation therapy, airway extubation

OBJECTIVE

This study presents a model for predicting clinical interventions and death in the ICU. We aim to anticipate these events earlier and more consistently, so that staff can consider intervention in a more-timely manner. For example, anticipating tracheal intubation in a deteriorating patient with a difficult airway facilitates an early, expert intubation in accordance with recent guidelines.1 Timely tracheal extubation minimizes the complications of prolonged intubation.2 Anticipating the need to titrate FiO2 and administer intravenous fluids minimizes the effects of hypoxia, hyperoxia and hypovolaemia.3

INTRODUCTION

To determine the probability of a clinical intervention in a given patient at a given time, prediction models require a mathematical representation of that patient’s current state. We hypothesize that for this representation to be rich enough to facilitate good predictions, it must reflect the patient’s preadmission health status, short-term trends in their data over a few hours, and longer-term trends over several days. It must also reflect that the likelihood of clinical intervention varies nonlinearly with some patient parameters and depends on interactions between several of those parameters.

Most existing models ignore temporal trends and predict events using only the current values of patients’ longitudinal variables.4,5 Other models only predict events within prespecified timeframes,6,7 and thus they do not derive a general representation of patients’ clinical status which extends to event prediction within arbitrary timeframes. Some studies derive separate variables for values of the same parameter measured at different times,8 but the additional variables this necessitates limit the size of the temporal context window and prevent representation of longer-term trends. Other studies represent longitudinal patient intensive care unit (ICU) data using summary statistics for parameters computed over different time windows,9–11 but this approach ignores interactions and loses much trend information.

Neural networks can learn arbitrary functions of their input variables.12 Hence, they have been increasingly used to learn representations of challenging source data such as images and free text,13,14 producing excellent performance in downstream tasks. These successes have motivated healthcare researchers to use recurrent neural networks (RNNs) to represent longitudinal patient data.15–22

Long short-term memory (LSTM) is an RNN designed to capture long-term dependencies in sequences.23 LSTMs perform competitively in comparison to other RNN variants.24,25 Despite their popularity, a recent study found that RNNs were outperformed across several nonmedical tasks by temporal convolutional networks (TCNs). In particular, TCNs were able to capture longer-range dependencies in sequences than LSTMs of equivalent size.26

Longitudinal variables are measured at different frequencies in the ICU. Vital signs are monitored continuously, whereas other investigations (blood gases, etc.) are intermittent. High-temporal-resolution ICU data thus contain many missing values, and the recorded values are subject to human and equipment error. Patients’ disease processes and responses to treatment are heterogeneous, and both are incompletely observed by the tests available to clinicians.

TCNs process longitudinal data by applying filters along the temporal dimension. The initial filters only cover a small number of adjacent time steps at once, and thus recognize short-term patterns. Long-term patterns are represented by using the output of earlier filters as input to later, more “dilated” filters.27 It is unclear from previous studies whether the initial filters will be able to recognize useful patterns in ICU data featuring noise, incompleteness, heterogeneity, and missingness.

The flexibility of neural networks renders them prone to overfitting (ie, to learning noise patterns in data). We hypothesize that representing clinically relevant patterns in patient data (eg, a patient’s current respiratory status) rather than noise patterns (eg, an artifact associated with a particular event) is more likely to improve net predictive ability across multiple interventions. We enforce this inductive bias by learning a single representation of each patient-hour which is used to predict all the different clinical interventions. This multitask learning approach has previously been demonstrated to reduce overfitting and improve performance on difficult prediction tasks.28

MATERIALS AND METHODS

We developed risk-prediction models using 4748 patient admissions to a UK-based ICU from 2014 to 2018. Admissions ≤ 3 hours in length and patients < 18 years old were excluded, leaving 4713 admissions. Our data comprised patient demographic information and 37 115 days of longitudinal measurements including vital signs, blood gases, laboratory results, and nursing observations. All data were collected in the course of routine clinical care, and the project was approved by the divisional governance team.

For each hour of each admission, our models predict risk of events in the following w hours, where w is prespecified differently for each event to reflect clinical workflow. The predicted events were intubation, extubation, and death (w =6); fluid challenge and commencing noradrenaline (w =3); up- and down-titrating FiO2 (w =1). Extubations in the context of treatment withdrawal were excluded. The airway status variable in our electronic health record was found to contain missing and erroneous values, and additional variables (eg, cuff pressure, Glasgow coma scale voice component, ventilation mode, airway status on blood gases, end-tidal CO2) were used in combination with airway status to form a set of rules for more reliable identification of intubation and extubation events.

Most events occurred rarely and in the minority of admissions. Number of events (% of admissions with ≥ 1 event) in the dataset was 1033 (15.7%) for intubation, 1764 (26.6%) for extubation, 823 (17.0%) for death, 51 252 (71.3%) for fluid challenge, 1522 (30.9%) for commencing noradrenaline, 45 405 (83.9%) for up-, and 58 923 (89.9%) for down-titrating FiO2. The median (interquartile range, IQR) length of stay was 4.8 (2.3–9.0) days.

A subset of the available demographic and longitudinal variables were manually selected for input to the models on the basis of clinical plausibility. Table 1 details the demographic variables. Table 2 details the longitudinal variables. The data were divided into training (3299 admissions; 70%), validation (471 admissions; 10%), and testing (943 admissions; 20%) folds on a temporal basis, with the training and test folds containing the earliest and latest admissions, respectively. Although this approach potentially allows patients readmitted to ICU to appear in multiple folds, it mirrors the anticipated real-world usage of our model and therefore does not limit its validity.

Table 1.

Demographic variables selected for input to the models.

Variable (units) Median (IQR) Admissions
Age (years) 70 (54–81)
Height (cm) 165 (160–175)
Weight (kg) 70 (60–84)
Pre-admission length of stay (days) 1.0 (0.7–2.7)
Previous ICU admissions 0 (0–0)
Previous hospital admissions 0 (0–2)
Male sex 2429 (51.5%)
Married 2088 (44.3%)
Single 881 (18.7%)
Widowed 449 (9.5%)
Divorced 192 (4.1%)
Separated 39 (0.8%)
Unknown relationship status 1064 (22.6%)
Admitted from ward 1657 (35.2%)
Admitted from emergency department 1560 (33.1%)
Admitted from theaters 1345 (28.5%)
Admitted from elsewhere 151 (3.2%)
Respiratory disease 68 (1.4%)
Metastatic disease 187 (4.0%)
Previous chemotherapy 165 (3.5%)
Dementia 162 (3.4%)
Emphysema 129 (2.7%)

Table 2.

Longitudinal variables selected for input to the models.

Variable (units) Median (IQR) % missing values Categories
Hour of day (hours) 12 (7–18) 0
Heart rate (BPM) 88 (75–101) 27.9
Mean arterial pressure (mmHg) 80 (70–92) 28.7
Temperature (Celsius) 36.9 (36.5–37.4) 36.9
Respiratory rate (RPM) 20 (16–26) 29.3
SpO2 (%) 95 (92–97) 27.8
FiO2 (%) 25 (21–30) 26
Tidal volume (ml) 466 (378–569) 76.9
PEEP (cmH20) 5 (5–8) 76.8
Plateau pressure (cmH20) 19 (16–25) 76.8
Urine output (ml/hour) 75 (40–140) 27.6
GCS, verbal 5 (4–5) 68.7
GCS, motor 6 (5–6) 49.8
GCS, eyes 4 (3–4) 49.8
pH 7.42 (7.36–7.46) 80.9
PaO2 (kPa) 10.0 (8.8–11.7) 83.9
PaCO2 (kPa) 5.5 (4.7–6.5) 80.9
Base excess (mEq/L) 2.1 (-1.1–5.6) 81.1
Lactate (mmol/L) 1.1 (0.8–1.6) 81.3
Hemoglobin, lab (g/L) 93 (81–110) 96.6
Neutrophils, lab (109/L) 9.2 (6.2–13.5) 96.6
Platelets, lab (109/L) 244 (162–351) 96.6
Sodium, lab (mmol/L) 138 (135–141) 96.5
Potassium, lab (mmol/L) 4.6 (4.2–5.0) 96.6
Creatinine, lab (μmol/L) 77 (58–140) 96.5
C-reactive protein, lab (mg/L) 80 (33–167) 96.7
Respiratory mode Room air, nasal, face mask, tracheostomy mask, high-flow nasal, noninvasive ventilation, invasive ventilation
Intubated Yes, no
Respiratory secretions Sparse, moderate, copious
ECG rhythm Normal sinus, sinus bradycardia, sinus tachycardia, atrial fibrillation, supraventricular tachycardia, ventricular tachycardia, paced
Receiving noradrenaline Yes, no
Treatment escalation Full, limited
Sat out Yes, no
Turns themselves Yes, no

While some previous studies impute missing values using a last-observation-carried-forward approach, this has been demonstrated to produce bias.8,29,30 Many longitudinal variables in the ICU are measured less frequently when they are assumed to be more stable or less abnormal, making it challenging to impute their missing values using standard methods. Instead, we create a missingness indicator for each longitudinal variable.15 Details of the data preparation and missingness indicators are contained in Supplementary Appendix A.

We specified baseline risk-prediction models—a logistic regression (LR) model and a feedforward neural network (FFNN)—which make predictions using the demographic data and longitudinal data from the current hour only. We compared these with other models—LSTM-FFNN and TCN-FFNN—which consider the longitudinal data from the most recent a hours (or fewer where the patient has been admitted for < a hours), where a is a hyperparameter learnt during model tuning. All models were fitted using the training fold data.

Model specification

TCN-FFNN applies a TCN to its longitudinal data input. Its demographic data input is passed through a feedforward neural network with dropout applied beforehand. Binary logistic regression models (1 per predicted event) are applied to the concatenated outputs of the TCN and the feedforward neural network. Figure 1 summarizes TCN-FFNN.

Figure 1.

Figure 1.

Overview of TCN-FFNN.

LSTM-FFNN applies a stack of LSTM cells to its longitudinal data input but otherwise uses the same architecture as TCN-FFNN. The output sequence from the first LSTM cell in the stack is used as input to the second and so on. Dropout is applied to the input of each cell using a consistent dropout mask for all time steps.31 We use the popular LSTM cell design described by Graves,32 modified so that the “forget gate” bias is initialized as a vector of ones.24

LR applies binary logistic regression models (1 per predicted event) directly to its input with an L2 regularization penalty on the model weights. FFNN applies a feedforward neural network to its input with dropout prior to each fully connected layers to reduce overfitting.33 Binary logistic regression models (1 per predicted event) are applied to the output of the final layer.

TCN-FFNN, LSTM-FFNN, and FFNN all transform their longitudinal and demographic data inputs using model components which are shared across all event prediction tasks. Their only task-specific components are the final logistic regressions, which are deliberately chosen for their low variance. The feedforward neural network components of TCN-FFNN, LSTM-FFNN, and FFNN use hyperbolic tangent activation, matching their number of hidden units to the dimensionality of their input. Models were specified and fit using Keras 2.2.2 and keras-tcn 2.2.3 for Python 3.6.6. Model training is described in Supplementary Appendix B.

Figure 2 provides details of the TCN component of TCN-FFNN, which combines elements of previous successful TCN architectures.26,27 The TCN first performs 1x1 convolution on its input in order to alter its dimensionality and thus control the number of TCN parameters.34 The 1x1-convolved input is then passed through a number of identically structured layers which perform 1-dimensional dilated causal convolution. The dilation rate increases exponentially in each subsequent layer: 20 in the first layer, 21 in the second, 22 in the third, and so on. The TCN uses the minimum number of layers needed for the dilation rate of the final layer multiplied by the filter size to exceed a, in order that the receptive field of the TCN is large enough to “see” every hour of the longitudinal data input. Each TCN layer duplicates its output to form 2 copies. The first copy is added to the layer’s input and then input to the next TCN layer. This creates residual connections between the inputs of earlier and later layers which allow TCNs to perform better when many layers are used.35 For similar reasons, the second output copies from each layer are added together to form the basis for the overall TCN output.27 We use only the final row of the output, as the causal convolutions used in each layer mean that only this row contains information from every hour of the longitudinal data input. Each TCN layer applies spatial dropout to ameliorate overfitting.36 Weights and biases are not shared across layers. The TCN uses rectified linear unit (ReLU) activation throughout to aid learning of nonlinear relationships.

Figure 2.

Figure 2.

Details of TCN-FFNN’s TCN component. Each TCN layer is drawn as a gray box. For simplicity, only 3 of the layers are drawn.

Evaluation

We evaluated the final tuned models using the test fold data. We generate risk predictions for each event at each hour of each admission, excluding the first hour of admission as no longitudinal data are available at this stage. For each event, we use the true binary labels y and the corresponding predicted risks ŷ from each model to calculate average positive predictive value (PPV) and area under the receiver operating characteristic curve (AUROC). Average PPV ignores true-negative predictions and thus compensates for the rarity of predicted events. We report AUROCs as they consider true-negative predictions, but we acknowledge that they are insensitive to large numbers of false positives when the predicted event is rare. We obtain point estimates for each score using y and ŷ. We also calculate average PPV and AUROC for 400 paired bootstrap samples from y and ŷ and use these to obtain 95% confidence intervals for the point estimates. Model calibration is discussed in Supplementary Appendix C.

TCN-FFNN learns a complex nonlinear transformation of its input data, and it is thus challenging to understand the determinants of its predictions. We address this by producing a partial dependence plot for each predicted event, with each plot using a different prespecified subset of the input variables.37 For each variable individually using the test fold data, we perturb the variable’s nonmissing values by different amounts, input the perturbed data to TCN-FFNN, and calculate its median predicted risk. Nine separate perturbations are used for each variable linearly spaced across 50% of its IQR. We use these small perturbation ranges to minimize the chance of clinically implausible combinations of variables being input to the model. Some variables are only perturbed negatively or positively to minimize the risk of producing impossible values. Where perturbation of an individual value produces an impossible result, it is set to the nearest possible value. Thus, an FiO2 of 98% with a perturbation of +4% is set to 100%. A different strategy is used to explore the association between hour of day and risk of extubation: the current hour of day for every intubated patient is set to the same value for each hour of day in turn, and the median extubation risk at each value is calculated.

In some circumstances, it is clinically useful to estimate the risk of events within timeframes other than those we prespecified (ie, to consider other values of w). We explore the sensitivity of model performance to prediction timeframe by retraining each of the final models for w∈{1, 2, 3, 4, 5, 6}, using the same value of w for each event.

RESULTS

Figure 3 compares the performance of each model on the test fold data. Table 3 presents the same data in numerical form. Figure 4 shows the partial dependence plots.

Figure 3.

Figure 3.

Model performance on the test fold data. Point estimates and their 95% confidence intervals are plotted as dots and lines, respectively.

Table 3.

Model performance on the test fold data.

Event Model Average PPV (95% CI) AUROC (95% CI) Calibration MAE 95% CI
Fluid challenge LR 0.314 (0.307–0.321) 0.772 (0.768–0.775) 0.120–0.155
Fluid challenge FFNN 0.343 (0.336–0.352) 0.793 (0.789–0.796) 0.060–0.092
Fluid challenge LSTM-FFNN 0.440 (0.432–0.447) 0.840 (0.836–0.842) 0.043–0.059
Fluid challenge TCN-FFNN 0.441 (0.433–0.448) 0.839 (0.836–0.842) 0.026–0.049
Intubation LR 0.089 (0.077–0.107) 0.847 (0.833–0.861) 0.132–0.413
Intubation FFNN 0.101 (0.085–0.119) 0.858 (0.844–0.875) 0.119–0.422
Intubation LSTM-FFNN 0.153 (0.130–0.179) 0.882 (0.869–0.895) 0.146–0.299
Intubation TCN-FFNN 0.139 (0.117–0.162) 0.896 (0.885–0.908) 0.055–0.346
Extubation LR 0.050 (0.035–0.071) 0.797 (0.771–0.825) 0.369–0.492
Extubation FFNN 0.050 (0.037–0.068) 0.859 (0.840–0.879) 0.485–0.487
Extubation LSTM-FFNN 0.592 (0.539–0.636) 0.893 (0.871–0.913) 0.093–0.202
Extubation TCN-FFNN 0.574 (0.519–0.625) 0.903 (0.881–0.921) 0.091–0.173
Death LR 0.187 (0.161–0.219) 0.903 (0.894–0.914) 0.199–0.274
Death FFNN 0.188 (0.162–0.218) 0.915 (0.905–0.925) 0.179–0.288
Death LSTM-FFNN 0.266 (0.237–0.297) 0.955 (0.948–0.962) 0.175–0.274
Death TCN-FFNN 0.315 (0.282–0.352) 0.958 (0.951–0.964) 0.155–0.243
Noradrenaline, starting LR 0.206 (0.176–0.241) 0.965 (0.959–0.970) 0.203–0.326
Noradrenaline, starting FFNN 0.317 (0.284–0.360) 0.970 (0.965–0.975) 0.204–0.242
Noradrenaline, starting LSTM-FFNN 0.477 (0.433–0.521) 0.987 (0.983–0.991) 0.117–0.241
Noradrenaline, starting TCN-FFNN 0.533 (0.492–0.572) 0.988 (0.984–0.991) 0.133–0.217
FiO2, down-titrating LR 0.275 (0.270–0.280) 0.724 (0.721–0.727) 0.064–0.091
FiO2, down-titrating FFNN 0.289 (0.285–0.295) 0.740 (0.737–0.743) 0.047–0.077
FiO2, down-titrating LSTM-FFNN 0.776 (0.771–0.780) 0.931 (0.930–0.933) 0.052–0.073
FiO2, down-titrating TCN-FFNN 0.786 (0.781–0.790) 0.932 (0.930–0.933) 0.051–0.078
FiO2, up-titrating LR 0.215 (0.210–0.219) 0.735 (0.732–0.739) 0.262–0.324
FiO2, up-titrating FFNN 0.236 (0.230–0.241) 0.746 (0.743–0.749) 0.191–0.262
FiO2, up-titrating LSTM-FFNN 0.726 (0.721–0.732) 0.909 (0.906–0.911) 0.083–0.103
FiO2, up-titrating TCN-FFNN 0.738 (0.732–0.743) 0.906 (0.904–0.909) 0.053–0.084

Abbreviations: FFNN, feedforward neural network; LR, logistic regression; LSTM, long short-term memory; TCN, temporal convolutional network.

Figure 4.

Figure 4.

Changes in predicted risks with perturbation of input variables. The lines show median predicted risk rather than the associations between inputs and predictions in any specific patient. Plot b excludes patients who are currently intubated. Plot e excludes patients currently receiving noradrenaline. Plots c, f, g, and h exclude patients not currently intubated. The risks of extubation in plot h apply to the 6-hour periods following the hours of day indicated on the x axis.

TCN-FFNN performed similarly to LSTM-FFNN when predicting most events but achieved a modest improvement in average PPV when predicting FiO2 titration. TCN-FFNN and LSTM-FFNN both consistently outperformed FFNN and LR.

TCN-FFNN’s highest average PPVs were obtained when predicting FiO2 titration, with a 3-fold improvement over FFNN and LR. Its largest relative improvement over FFNN and LR was obtained when predicting extubation, with an 11-fold improvement in average PPV and good absolute performance. All models obtained low average PPVs when predicting intubation. TCN-FFNN and LSTM-FFNN obtained higher AUROCs for all predicted events. FFNN generally made modest improvements over LR; it obtained a higher average PPV than LR when predicting fluid challenge, starting noradrenaline, and titrating FiO2.

Sensitivity analysis

Figure 5 compares the performance of each model across varying prediction timeframes (ie, varying w). For most values of w, TCN-FFNN made large improvements over LSTM-FFNN when predicting extubation, starting noradrenaline, and FiO2 titration.

Figure 5.

Figure 5.

Changes in model performance on the test fold data with varying prediction timeframes (ie, varying values of w). Point estimates and their 95% confidence intervals are plotted as dots and lines, respectively.

LR and FFNN’s performance when predicting FiO2 titration increased linearly with larger values of w. As w increased, TCN-FFNN’s performance increased when predicting fluid challenge and decreased when predicting extubation.

DISCUSSION

Our results demonstrate that TCNs can successfully represent longitudinal patient data. Despite LSTMs’ popularity in recent studies, we show that TCNs produce equivalent or better performance in downstream prediction of a variety of interventions. TCN-FFNN’s superior performance in the sensitivity analysis suggests that TCNs are less task-specific than LSTMs after hyperparameter tuning and are thus more easily extended to other clinical applications or to training on additional data.

Unlike LR and FFNN, TCN-FFNN and LSTM-FFNN make predictions using patterns which occur in data over time. For events where TCN-FFNN and LSTM-FFNN outperform FFNN by a large margin, we infer that temporal context is particularly important for good prediction. For example, a predicted extubation is more likely to be correct when made in the context of stable or improving respiratory parameters rather than a snapshot of these parameters at a single point in time. TCN-FFNN and LSTM-FFNN also have access to previous values of variables which are missing at the hour when predictions are made.

Simple protocols for oxygen titration are often used by junior ICU staff, wherein FiO2 is adjusted to achieve SpO2 or PaO2 within a target range. FFNN is sufficiently expressive to encode this logic. In practice, experienced clinicians may subvert protocols due to individual patient circumstances, or use more sophisticated strategies which account for recent trends in SpO2 and PaO2, previous FiO2 titrations and measurement error. These nuances are challenging to capture in guideline form. TCN-FFNN’s increased performance versus FFNN and the other models, particularly for low values of w, implies that it has learnt such nuances and underlines its potential as a decision support tool.

Unlike LR, FFNN is able to learn nonlinear transformations of, and interactions between, the current values of longitudinal variables. For events where FFNN outperforms LR (eg, starting noradrenaline, fluid challenge), we infer that nonlinearity and interactions are particularly important in making good predictions. This reflects common approaches to clinical decision-making: most clinicians’ propensity to start noradrenaline does not vary linearly with MAP but instead increases exponentially as MAP falls below the level required for adequate end organ perfusion. Similarly, clinicians’ propensity to administer a fluid challenge does not depend on MAP, heart rate, and urine output considered only in isolation from one another but also on the interaction between these and other parameters.

TCN-FFNN performed best when predicting clinical decisions which are largely determined by variables captured in structured electronic health data. TCN-FFNN also performed well when predicting events which occur in similar circumstances (eg, extubation). These factors are likely to explain TCN-FFNN’s poor performance when predicting intubation: intubation scenarios are diverse, ranging from emergency intubation in the setting of airway obstruction or neurological deterioration to semi-elective intubation for the purposes of diagnostic imaging. Intubation decisions also often depend strongly on some variables (eg, physical examination, clinician Gestalt) which are invisible to our models.

Our results must be interpreted in the context of our models’ intended use case (ie, prospective risk prediction on the ICU). In consultation with patients and their relatives, ICU staff decide when to make clinical interventions. Given a predicted risk, the average PPV therefore does not tell the healthcare professional the probability of the corresponding intervention but rather helps answer their question “What is the probability that my peers would make that intervention in these circumstances?”

Given the ubiquity with which AUROCs are reported in evaluation of clinical prediction models, it is instructive to compare the utility of our average PPV and AUROC results. Unlike average PPV, AUROC does not inform the decision-making of a healthcare professional who is unsure whether to make an intervention in a given scenario. Our results support previous work which suggests that AUROC is insensitive for the purposes of model selection.38 For instance, TCN-FFNN obtains similar AUROCs for prediction of intubation and extubation, but the average PPV for prediction of extubation is over 3 times larger than that for intubation. AUROC’s insensitivity to false positives for rare events should be noted when comparing the AUROCs for the more common events (eg, fluid challenge and FiO2 titration) with the others.

Model interrogation

Our partial dependence plots demonstrate that TCN-FFNN learns clinically plausible associations between input variables and predicted events. For example, the risk of fluid challenge rises with heart rate and as urine output and MAP decrease. Similarly, the risk of starting noradrenaline rises as MAP and urine output decrease, and the risk of death rises as lactate increases and as MAP decreases. Risk of extubation rises dramatically during daylight hours, reflecting the common clinical practice of avoiding overnight extubation.39 The median predicted risks obtained are roughly proportional to the rarity of the event.

Some input variables act as proxies for unobserved disease processes. For example, risk of starting noradrenaline rises with CRP presumably due to CRP’s association with septic shock.40 This relationship is weak as CRP rise is also associated with other inflammatory conditions which are less likely to require noradrenaline. FiO2 is directly controlled by ICU staff, and TCN-FFNN appears to use it as a surrogate marker of patient deterioration as perceived by those staff. For example, the risk of intubation rises only slightly as SpO2 decreases (signaling hypoxia) but rises dramatically as FiO2 increases (signaling hypoxia plus concern from ICU staff). The complex associations between increasing FiO2 and risk of intervention shown in Figure 4 are likely due to the confounding effects of perceived patient deterioration as a common cause of these.

TCN-FFNN’s lowest average PPVs were obtained when predicting intubation and death. For each of these events, we identified the 10 most confident false-positive and 10 most confident false-negative predictions TCN-FFNN made on the test fold data and reviewed the clinical notes corresponding to them. Four false-positive intubation predictions were in deteriorating patients who were intubated the same or next day following the prediction. Seven false-positive death predictions were in patients who died shortly afterwards, 4 within 6 hours of the prediction window. Six of the unpredicted deaths were sudden cardiac arrests which had not been anticipated by the ICU team.

Several of the false predictions occurred where relevant variables were invisible to TCN-FFNN. For example, 4 of the unpredicted deaths occurred after a decision to palliate. Similarly, 4 of the false-positive intubation predictions occurred in patients whose treatment plans precluded intubation. One unpredicted intubation occurred semi-electively in a patient who required surgery, and 3 false-positive death predictions were in deteriorating patients requiring various urgent interventions for potentially reversible pathology. TCN-FFNN lacks much of the relevant data (eg, imaging, availability of theater resources) to distinguish these patient cohorts. Nine of the false-negative and 1 of the false-positive intubation predictions were due to incorrect labels.

Future work

A prospective validation would be necessary before TCN-FFNN could be used in clinical practice. Validation at an external center would be required to demonstrate that TCN-FFNN generalizes beyond our ICU’s patterns of care. Analogously, we might investigate whether TCN-FFNN’s learnt representations of each patient-hour generalize (without further training) to predicting other clinical interventions. In many of TCN-FFNN’s erroneous predictions, relevant clinical information is lacking from its input. Much of this information is contained in text notes, and previous studies have demonstrated techniques for combining text data with physiological variables, resulting in improved predictive performance.41,42 Incorporating text data would significantly enlarge the feature space from which TCN-FFNN learns, which risks overfitting given the small number of positive labels in our dataset. It may be possible to counteract this by incorporating information from larger ICU datasets.43

In this study, we use all clinical interventions as positive labels (ie, we treat normal clinical practice as the gold standard we aim for TCN-FFNN to emulate). Our mini-batch training biases against learning from interventions were made in very atypical circumstances; these occur rarely by definition, so their expected contribution to mini-batch loss is low. However, it may increase the clinical utility of TCN-FFNN to explicitly limit positive labels to interventions deemed successful or appropriate. Distinguishing successful interventions is most straightforward in the case of extubation, where any extubation which is rapidly followed by reintubation would be deemed unsuccessful. However, many unsuccessful extubations (so defined) might still be deemed appropriate where they fail due to previously unobserved patient factors which inform later management. Identifying other appropriate interventions (eg, intubation) is even more challenging due to the lack of relevant counterfactual information.

Our error analysis suggests that TCN-FFNN’s death predictions identify dying patients even when they fail to identify the 6-hour window in which death occurs. Given that predicting time of death is very challenging for clinicians,44,45 it might be possible for TCN-FFNN to maintain clinical utility even using a larger mortality prediction window. Despite our efforts to mitigate errors in the airway status variable, many erroneous intubation labels remain. In order to refine our labeling rules, we need to evaluate their specificity, but doing this manually is labor-intensive given the rarity of intubation events. We could accelerate this process by using TCN-FFNN’s most-confident false-positive intubation predictions as likely candidates for unlabeled intubations.

CONCLUSION

We show that patterns in ICU patients’ longitudinal data can be represented by TCNs thereby producing improvements in downstream prediction of clinical events. Our model (TCN-FFNN) achieves this while maintaining clinically plausible associations between its inputs and predictions. TCN-FFNN shows potential as a decision support tool particularly where the important determinants of decisions are captured in structured electronic health data, such as in the cases of FiO2 titration, extubation, and starting noradrenaline. Some important variables are currently invisible to TCN-FFNN, necessitating clinical oversight of its use. Several promising approaches are available for improving its predictions further. A prospective validation and regulatory approval are required before TCN-FFNN could be deployed in practice.

AUTHOR CONTRIBUTIONS

Both authors contributed substantially to conception and design of the project and to data analysis. FJRC drafted the manuscript and AHW revised it critically for important intellectual content. Both authors gave final approval of the version to be published and agree to be accountable for all aspects of the work.

Supplementary Material

ocz205_Supplementary_Data

ACKNOWLEDGMENTS

We thank the staff of Barnet Hospital ICU for their diligent recording of patient data.

Conflict of Interest statement

None declared.

REFERENCES

  • 1. Higgs A, McGrath BA, Goddard C, et al. Guidelines for the management of tracheal intubation in critically ill adults. Br J Anaesth 2018; 120 (2): 323–52. [DOI] [PubMed] [Google Scholar]
  • 2. Bonten MJM, Kollef MH, Hall JB.. Risk factors for ventilator-associated pneumonia: from epidemiology to patient management. Clin Infect Dis 2004; 38 (8): 1141–9. [DOI] [PubMed] [Google Scholar]
  • 3. Chu DK, Kim L-Y, Young PJ, et al. Mortality and morbidity in acutely ill adults treated with liberal versus conservative oxygen therapy (IOTA): a systematic review and meta-analysis. Lancet 2018; 391 (10131): 1693–705. [DOI] [PubMed] [Google Scholar]
  • 4.Royal College of Physicians. National Early Warning Score (NEWS) 2: Standardising the Assessment of Acute-Illness Severity in the NHS. London: RCP; 2017. [Google Scholar]
  • 5. Tarassenko L, Hann A, Young D.. Integrated monitoring and analysis for early warning of patient deterioration. Br J Anaesth 2006; 97 (1): 64–8. [DOI] [PubMed] [Google Scholar]
  • 6. Celi LA, Hinske LC, Alterovitz G, et al. An artificial intelligence tool to predict fluid requirement in the intensive care unit: a proof-of-concept study. Crit Care 2008; 12 (6): R151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Zhang Z, Ho KM, Hong Y.. Machine learning for the prediction of volume responsiveness in patients with oliguric acute kidney injury in critical care. Crit Care 2019; 23 (1): 112.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Desautels T, Das R, Calvert J, et al. Prediction of early unplanned intensive care unit readmission in a UK tertiary care hospital: a cross-sectional machine learning approach. BMJ Open 2017; 7 (9): e017199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tsur E, Last M, Garcia VF, et al. Hypotensive episode prediction in ICUs via observation window splitting In: Ulf Brefeld, Edward Curry, Elizabeth Daly, et al eds. Machine Learning and Knowledge Discovery in Databases. Cham, Switzerland: Springer International Publishing; 2019: 472–87. [Google Scholar]
  • 10. Lundberg SM, Nair B, Vavilala MS, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2018; 2 (10): 749–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Nemati S, Holder A, Razmi F, et al. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit Care Med 2018; 46 (4): 547–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Netw 1991; 4 (2): 251–7. [Google Scholar]
  • 13. Krizhevsky A, Sutskever I, Hinton GE.. ImageNet classification with deep convolutional neural networks In: Pereira F, Burges CJC, Bottou L, et al. , eds. Advances in Neural Information Processing Systems 25. Lake Tahoe, Nevada, USA: Curran Associates; 2012: 1097–105. [Google Scholar]
  • 14. Sutskever I, Vinyals O, Le QV, Sequence to sequence learning with neural networks In: Ghahramani Z, Welling M, Cortes C, et al. , eds. Advances in Neural Information Processing Systems 27. Montreal, Canada: Curran Associates; 2014: 3104–12. [Google Scholar]
  • 15. Lipton ZC, Kale D, Wetzel R. Directly modeling missing data in sequences with RNNs: improved classification of clinical time series. In: proceedings of the 1st Machine Learning for Healthcare Conference PMLR; 19–20 August 2016; City is Los Angeles, California, USA.
  • 16. Zheng H, Shi D.. Using a LSTM-RNN based deep learning framework for ICU mortality prediction In: Meng X, Li R, Wang K, Niu B, Wang X, Zhao G. eds. Web Information Systems and Applications. WISA 2018. Lecture Notes in Computer Science, vol 11242. Cham: Springer; 2018. [Google Scholar]
  • 17.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1: 18. [DOI] [PMC free article] [PubMed]
  • 18.Suresh H, Hunt N, Johnson A, et al. Clinical Intervention Prediction and Understanding with Deep Neural Networks; 2nd Machine Learning for Healthcare Conference; 18–19 August 2017; Boston, Massachusetts, USA.
  • 19.Lipton ZC, Kale D, Elkan C, et al. Learning to Diagnose with LSTM Recurrent Neural Networks. 4th International Conference on Learning Representations; May 2–4, 2016; San Juan, Puerto Rico.
  • 20. Aczon M, Ledbetter D, Ho L, et al. Dynamic mortality risk predictions in pediatric critical care using recurrent neural networks. arXiv [stat.ML]. 2017. http://arxiv.org/abs/1701.06675. Accessed March 1, 2017.
  • 21.Choi E, Bahadori MT, Schuetz A, et al. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. Machine Learning for Healthcare Conference; 19–20 August 2016; Los Angeles, CA, USA. [PMC free article] [PubMed]
  • 22. Harutyunyan H, Khachatrian H, Kale DC, et al. Multitask learning and benchmarking with clinical time series data. arXiv [stat.ML]. 2017. http://arxiv.org/abs/1703.07771. Accessed February 2, 2018. [DOI] [PMC free article] [PubMed]
  • 23. Hochreiter S, Schmidhuber J.. Long short-term memory. Neural Comput 1997; 9 (8): 1735–80. [DOI] [PubMed] [Google Scholar]
  • 24. Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration of recurrent network architectures. In: proceedings of the 32nd International Conference on Machine Learning - Volume 37;6-11 July 2015; Lille, France.
  • 25.Greff K, Srivastava RK, Koutnik J, et al. LSTM: A Search Space Odyssey. IEEE Trans Neural Netw Learn Syst 2017;28:2222–32. [DOI] [PubMed]
  • 26. Bai S, Kolter JZ, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv: 180301271. 2018. http://arxiv.org/abs/1803.01271. Accessed May 5, 2018.
  • 27. van den Oord A, Dieleman S, Zen H, et al. WaveNet: a generative model for raw audio. arXiv [cs.SD]. 2016. http://arxiv.org/abs/1609.03499. Accessed July 7, 2017.
  • 28. Caruana R. Multitask learning. Mach Learn 1997; 28 (1): 41–75. [Google Scholar]
  • 29. Komorowski M, Celi LA, Badawi O, et al. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 2018; 24 (11): 1716–20. [DOI] [PubMed] [Google Scholar]
  • 30. Cook RJ, Zeng L, Yi GY.. Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation. Biometrics 2004; 60 (3): 820–8. [DOI] [PubMed] [Google Scholar]
  • 31.Gal Y, Ghahramani Z. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. Thirtieth Conference on Neural Information Processing Systems; 5–10 December 2016; Barcelona, Spain.
  • 32. Graves A. Generating sequences with recurrent neural networks. arXiv [cs.NE]. 2013. http://arxiv.org/abs/1308.0850. Accessed November 22, 2016.
  • 33. Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014; 15: 1929–58. [Google Scholar]
  • 34.Lin M, Chen Q, Yan S. Network In Network. 2nd International Conference on Learning Representations; 14–16 April, 2014; Banff, Alberta, Canada.
  • 35.He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. Conference on Computer Vision and Pattern Recognition; 26 June–1 July 2016; Las Vegas, Nevada, USA.
  • 36. Tompson J, Goroshin R, Jain A, et al. Efficient object localization using convolutional networks. arXiv [cs.CV]. 2014. http://arxiv.org/abs/1411.4280. Accessed May 5, 2018.
  • 37. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Statist 2001; 29 (5): 1189–232. [Google Scholar]
  • 38. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007; 115 (7): 928–35. [DOI] [PubMed] [Google Scholar]
  • 39. Gershengorn HB, Scales DC, Kramer A, et al. Association between overnight extubations and outcomes in the intensive care unit. JAMA Intern Med 2016; 176 (11): 1651–60. [DOI] [PubMed] [Google Scholar]
  • 40. Póvoa P, Coelho L, Almeida E, et al. C-reactive protein as a marker of infection in critically ill patients. Clin Microbiol Infect 2005; 11 (2): 101–8. [DOI] [PubMed] [Google Scholar]
  • 41. Ghassemi M, Naumann T, Doshi-Velez F, et al. Unfolding physiological state: mortality modelling in intensive care units. KDD 2014; 2014: 75–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Caballero Barajas KL, Akella R.. Dynamically modeling patient’s health state from electronic medical records: a time series approach In: proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 10-15 August 2015; Sydney, Australia: New York. [Google Scholar]
  • 43. Desautels T, Calvert J, Hoffman J, et al. Using transfer learning for improved mortality prediction in a data-scarce hospital setting. Biomed Inform Insights 2017; 9: 1178222617712994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Detsky ME, Harhay MO, Bayard DF, et al. Discriminative accuracy of physician and nurse predictions for survival and functional outcomes 6 months after an ICU admission. JAMA 2017; 317 (21): 2187–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. White N, Reid F, Vickerstaff V, et al. Imminent death: clinician certainty and accuracy of prognostic predictions. BMJ Support Palliat Care 2019; doi: 10.1136/bmjspcare-2018-001761. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocz205_Supplementary_Data

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES