Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2023 Jun 9;2(6):e0000116. doi: 10.1371/journal.pdig.0000116

Development of a dynamic prediction model for unplanned ICU admission and mortality in hospitalized patients

Davide Placido 1, Hans-Christian Thorsen-Meyer 1,2, Benjamin Skov Kaas-Hansen 2,3, Roc Reguant 1,4, Søren Brunak 1,5,*
Editor: Jessica Keim-Malpass6
PMCID: PMC10256150  PMID: 37294826

Abstract

Frequent assessment of the severity of illness for hospitalized patients is essential in clinical settings to prevent outcomes such as in-hospital mortality and unplanned admission to the intensive care unit (ICU). Classical severity scores have been developed typically using relatively few patient features. Recently, deep learning-based models demonstrated better individualized risk assessments compared to classic risk scores, thanks to the use of aggregated and more heterogeneous data sources for dynamic risk prediction. We investigated to what extent deep learning methods can capture patterns of longitudinal change in health status using time-stamped data from electronic health records. We developed a deep learning model based on embedded text from multiple data sources and recurrent neural networks to predict the risk of the composite outcome of unplanned ICU transfer and in-hospital death. The risk was assessed at regular intervals during the admission for different prediction windows. Input data included medical history, biochemical measurements, and clinical notes from a total of 852,620 patients admitted to non-intensive care units in 12 hospitals in Denmark’s Capital Region and Region Zealand during 2011–2016 (with a total of 2,241,849 admissions). We subsequently explained the model using the Shapley algorithm, which provides the contribution of each feature to the model outcome. The best model used all data modalities with an assessment rate of 6 hours, a prediction window of 14 days and an area under the receiver operating characteristic curve of 0.898. The discrimination and calibration obtained with this model make it a viable clinical support tool to detect patients at higher risk of clinical deterioration, providing clinicians insights into both actionable and non-actionable patient features.

Author summary

Early warning scores are used in the hospital to assess the severity of patient’s conditions, with higher score values indicating patients at higher risk of clinical deterioration. These scores are traditionally based on a relatively small set of variables, and they require medical personnel to collect the measurements included in the score calculation. In this study we investigated whether machine learning models using data routinely registered in electronic health record systems of hospitals can predict clinical deterioration. The outcome was defined as in-hospital mortality and unplanned transfer to the intensive care unit. Data included disease history, laboratory data, and medical notes from 12 hospitals of the Capital Region of Denmark and Region Zealand during the period 2011–2016. The model had good discrimination and calibration, with better performance when assessing the risk every 6 hours during an admission and using a prediction window of 14 days. The model was able to identify abnormal lab values and in general variables associated with critical conditions. Our study shows that machine learning models like the one developed here serve as clinical decision support tools for this purpose and are ready to be prospectively validated in the clinical setting.

Introduction

Evidence-based medicine is at the foundation of the clinical decision process: doctors must continuously take decisions using their knowledge to provide the best care to the patients. For this reason, numerous scores are used in healthcare to support clinicians with decision-making. Such scores vary depending on the target population and the intended use, therefore each medical specialty has its own examples (e.g. in the ICU, APACHE [1] and SAPS [2] scores are commonly used to assess severity of illness).

In general departments, Early Warning Scores (EWSs) are used to assess the health status of hospitalized patients. The first EWS was based solely on five physiological parameters, and was later updated by adaptations and improvements [3,4]. VitalPAC was one such early warning score (ViEWS), for which modifications were introduced [5]. These modifications were based on clinicians’ knowledge about the relationship between physiological data and adverse clinical outcomes. Further modifications were implemented in the national early warning score, NEWS, currently being the most used in the Capital Region of Denmark [6].

These scores are used extensively because their simplicity makes them applicable and easily comparable across different departments and countries. They specifically provide estimates for the risk of adverse outcomes [7] such as cardiac arrest, ICU transfer, and in-hospital mortality. However, they suffer from some limitations. The relatively few data types used limit their predictive power. Relevant features like biochemical measurements, the order of temporality in the data, or non-linear feature interactions are ignored. These risk calculations are made without exploiting all the information that current EHR systems provide [8]. Moreover, the scores mostly rely on data that are manually collected by the staff. This often entails a higher number of missing and incorrect data items, as compared to a fully automated approach. Recently it was shown in population-wide data from inpatients of the Capital Region of Denmark, that around 10% of the NEWS records were incomplete and 0.2% had implausible values [9].

To improve prediction accuracy and reduce alarm fatigue, new applications using more data and more sophisticated methodologies have been developed [10]. Given the heterogeneous nature of EHR data and the rarity of events such as clinical deterioration, machine learning methods seem well-suited for addressing this task. Deep learning, in particular, has performed well for many similar clinical tasks. For example, Shamout et al. [11] developed a deep learning model that uses vital signs to predict the composite outcome of ICU transfer, cardiac arrest and mortality. Cho et al. [12] developed a deep learning model to predict the composite outcome of cardiac arrest and ICU transfer, while da Silva et al. [13] developed a long-short term memory neural network (LSTM) to predict the worsening of vital signs. A recurrent deep neural network has also been used by Tóth et al. [14] to predict stability of vital signs to avoid unnecessary measurements at nighttime. Similarly, dynamic deep learning models using multiple data sources have been developed for different tasks; Thorsen-Meyer et al. [15] constructed one to predict 90-days mortality after ICU admission, Rajkomar et al. [16] to predict in-hospital death, 30-day unplanned readmission, prolonged length of stay and patient’s final discharge diagnoses; Tomašev et al. [17] to predict kidney failure; and Lauritsen et al. to predict risk of sepsis and acute critical illness [18,19].

Although deviating vital signs often precipitate clinical deterioration, their recognition in the general ward requires continuous engagement by health care personnel, which can delay or even preclude their recording. Often, vital signs are not included systematically in patient records in structured form; this was also the case in the data foundation for this study. Therefore, the hypothesis in this study is that a deep learning model using heterogeneous clinical data collected less frequently is able to predict at regular intervals during an admission the risk of the composite outcome of in-hospital death and unplanned ICU transfer (Fig 1). An important aspect is at the same time to inform the clinicians of features which are actionable.

Fig 1. General structure of the prediction framework.

Fig 1

Given a specific assessment rate (time between two consecutive risk assessments) and prediction window (time window within which the outcome is observed), the risk of clinical deterioration was assessed continuously during each general admission. All the data up to the time of assessment was used to train the model, which was then evaluated and interpreted on the holdout test data.

To this end, we combined natural language processing algorithms and recurrent neural networks (Fig 2) to leverage latent patterns in the data. We subsequently assessed the impact of the single data sources and the so-called tokens extracted from each of them to make the model explainable across features. We successfully validated the model internally on a patient subset selected randomly from the pool of patients available.

Fig 2. Structure of the deep learning model, exemplified by two admissions.

Fig 2

For each sample (i.e. admission at a specific time point) a 2D tensor (matrix) comprises the sequence of embedded tokens until the time of assessment. Each tensor was given as input to a recurrent block (GRU or LSTM, part of the hyperparameters search) and the time dimension was pooled through an attention pooling layer. The flattened tensors were then concatenated and fed to a linear layer with a standard logistic activation function.

Methods

This paper adheres to relevant items in the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis statement (TRIPOD) (S4 Table) [20].

Patients and outcome

The data comprise all inpatient admissions, from 2011 to 2016, to 12 public hospitals in the Capital Region of Denmark and Region Zealand. The admissions were pieced together by concatenating consecutive inpatient visits 24 hours apart so that department transfers were not considered two separate events.

ICU admissions not preceded by any other admission within the last 24 hours, outpatients and acute admissions were excluded, as were individuals with disconnected medical record history (either because these patients moved to another country or lacked a stable residence) and minors (age <16 years) (Fig 3).

Fig 3. Attrition diagram: Study profile.

Fig 3

The outcome clinical deterioration was defined as unplanned ICU transfer and in-hospital mortality within the so-called prediction window. Unplanned ICU transfer was defined as any acute admission to an ICU within 24 hours after discharge from a non-ICU ward.

Model development

The model architecture (Fig 2) was designed as a scalable network that can be adapted to different data modalities by using one sub-model per data domain; each sub-model consists of an embedding layer (which transforms the categorical variables into vectors), a recurrent neural network (which learns from the sequence of embedding vectors) and a pooling layer (which reduces the vectors’ dimensionality).

The network uses tokens, i.e. sequences of characters grouped together based on their semantics, as input. We chose the entity embedding approach to exploit the heterogeneity and sparseness of the input data, thus mapping the tokens constructed from the categorical features to embeddings in a Euclidean space (i.e. the embedding space) [21]. For each sub-model, the vocabulary size V (i.e. the number of unique tokens for the specific data source) was directly used to calculate the size of the embedding representation according to EmbeddingSize=6·α·V4, where the embedding coefficient α was estimated by hyperparameter search [16].

We explored two types of recurrent neural networks, Gated Recurrent Unit (GRU) [22] and Long Short-Term Memory (LSTM) networks [23]. Rather than max or average pooling—which retains neither positional nor intensity information—we used an attention-based pooling [24] which employs a weighted mechanism to retain the most relevant parts of a sequence. The attention-based pooling layer was used to aggregate the temporal dimension of the recurrent layer output, which was then concatenated across the different data sources. The final layer was a linear (= dense) layer with a standard logistic activation function hence mapping the output into the interval [0, 1] to yield valid predicted probabilities. Features without temporal components (age, number of previous hospital admissions and sex) were fed into the model by concatenating them to the output of the pooling layer.

Data sources and processing

The input data comprised medical history, biochemical measurements, clinical notes and demographics (age, sex and number of previous admissions). A schematic of the time spans for the different data modalities is depicted in S1 Fig.

The medical history was extracted from the Danish National Patient Registry (DNPR) for the period between 1977 and 2018 [25]. DNPR is a nation-wide registry which covers essentially all hospital encounters in Denmark. The disease codes use Danish adaptations from SKS of the International Classification of Diseases (ICD) version 8 up to 1993, and ICD-10 from 1994 and onwards. Only diagnosis codes recorded prior to the admission date were included in the predictive schemes. No manual curation was carried out on the disease history data as quality checks are handled internally by the DNPR (positive predicted values for registered ICD codes varies according to time and clinical specialty [25,26]). Summary statistics on comorbidities (Table 1) were calculated from all chapters excluding XIX and XXI (injuries and administrative codes), while all codes were included in model training.

Table 1. Summary statistics for the development and test sets.

Statistics are calculated at the admission level. Counts are given as well as medians when numerical distributions are provided, shown in the format Median [25th-75th quantile].

Development Set Test Set
Number of Patients 682041 170579
Number of Admissions 1791705 450144
Age at prediction
 Discharge Median 60 [40–73] Median 60 [40–73]
 ICU Transfer Median 67 [56–75] Median 67 [56–75]
 In-hospital Death Median 78 [68–86] Median 78 [69–86]
Sex:
 Male 798185 (45%) 199732 (44%)
 Female 993520 (55%) 250412 (56%)
Outcome
 Discharge 1752329 (97.8%) 440221 (97.8%)
 ICU Transfer 13129 (0.7%) 3305 (0.7%)
 In-hospital Death 26247 (1.5%) 6618 (1.5%)
Number of previous admissions*
 Discharge Median 1 [0–4] Median 1 [0–4]
 ICU Transfer Median 2 [0–6] Median 2 [0–6]
 In-hospital Death Median 3 [1–7] Median 3 [1–6]
Length of stay before outcome (hours)*:
 Discharge Median 31 [9–98] Median 31 [9–97]
 ICU Transfer Median 35 [8–121] Median 35 [8–132]
 In-hospital Death Median 126 [43–285] Median 122 [43–279]
Type ICU Transfer
 Surgical 2358 (18%) 639 (19.3%)
 Medical 10771 (82%) 2666 (80.7%)
Number of previous diagnoses Median 23 [14–36] Median 23 [14–36]
Number of tokens from laboratory available:
 @ 24 hours Median 19 [0–36] Median 19 [0–35]
 @ 48 hours Median 28 [8–66] Median 29 [8–66]
 @ 72 hours Median 38 [16–66] Median 38 [16 66]
Number of tokens from clinical notes available:
 @ 24 hours Median 215 [113–344] Median 215 [112–345]
 @ 48 hours Median 316 [190–480] Median 316 [189–481]
 @ 72 hours Median 396 [246–593] Median 396 [245–594]
Most common comorbidities (ICD-10 code):
 Abdominal and pelvic pain (R10.0) 563378 (31.44%) 143010 (31.77%)
 Essential hypertension (I10) 470525 (26.26%) 117239 (26.04%)
 Type2 diabetes (E11) 378979 (21.15%) 92135 (20.47%)
 Mental and behavioral disorders due to use of alcohol (F10) 377978 (21.1%) 93106 (20.68%)
 Chronic ischemic heart disease (I25) 365072 (20.38%) 91999 (20.44%)
Admitting department
 Internal medicine 253743 (14.2%) 64173 (14.3%)
 Emergency medicine 244510 (13.6%) 61219 (13.6%)
 Gynecology and obstetrics 190047 (10.6%) 47597 (10.6%)
 Orthopedic surgery 139427 (7.8%) 34761 (7.7%)
 Cardiology 138989 (7.8%) 34593 (7.7%)
 Surgical gastroenterology 129420 (7.2%) 32487 (7.2%)
 Urology 88387 (4.9%) 21769 (4.8%)
 Miscellaneous 607182 (33.9%) 153545 (34.1%)

* for all the admissions having one of the outcome listed below.

Biochemical values were extracted from the Clinical Laboratory Information System (LABKA) and the Clinical Chemistry Laboratory System (BCC) databases [27] for the period overlapping the hospital admissions (2011–2016). These databases collect all the biochemical tests performed by the Danish hospital laboratories in the Capital Region and Region Zealand, respectively. Before constructing the tokens, the data sets were manually curated as described in Muse et al., [28]: lookup dictionaries were used to standardize components, specimens’ names and units; dictionaries were also used to identify incomplete, unreadable or failed tests. Moreover, they were manually curated to identify tests having quantitative and qualitative results and to correct them for typos or wrong numerical notation. The biochemistry tokens were constructed using the name of the biochemical component, the specimen (blood, plasma etc.), the unit and the quantile of the value of the measurement (to yield a vocabulary with a reasonable size); for example, HEMOGLOBIN_B_mmol/L@8-8.5. The quantile binning was included as a hyperparameter.

The clinical notes written by physicians were extracted from the EHR data for the admissions from 2011 to 2016. The free text required some preprocessing before tokenization, such as removing punctuation, names, stop words [29], negations and signatures of the clinicians. Due to the large number of terms coming from the medical notes, the embedding of this specific data source was trained separately on the full corpus using fastText [30], to reduce the number of parameters to update during training.

Because we did not force our data into a tabular format, no data were imputed because only observed values were used. As is typical in NLP approaches, sequences were padded to a specific length (the optimal length for each data source was learned during the hyperparameter search). Sequences longer than the defined padding size were truncated; otherwise, they were extended with the padding token ‘PAD’. To control for very rare tokens (which may be outliers or very rare biochemical tests) a cut-off (“min_freq”) was learned during hyperparameter search; tokens with a frequency below min_freq were replaced with an “UNK” (unknown) token.

Training and evaluation

We randomly split the dataset into a development set (80%) for model creation and an independent holdout test set (20%) for model internal validation. The split of the dataset was done at the patient level and assigned admissions of the same patient to the same set to avoid leaking information between the sets [31]. The development set was further split into a training set (80%, 64% of total) and validation set (20%, 16% of total) to counter overfitting and to calibrate the model before testing it on the holdout set (S8 Fig).

Three submodels were trained separately to find the best submodel-specific hyperparameters; we used Optuna’s multivariate TPEsampler, based on the Three-structured Parzen Estimator (TPE) algorithm to search the hyperparameter spaces [32,33]. While searching the hyperparameter space, we fixed the prediction window at 24 hours and the assessment rate at 12 hours, to facilitate the comparison across the different data types. S1 Table shows the hyperparameters explored for each submodel; S2, S3 and S4 Figs illustrate the hyperparameter searches. For each experiment, the loss on the validation set was used as training metric to select the model at the best epoch. The performance metric used to select the best model for each search was the area under the precision-recall curve (AUPRC). Although the area under the receiver operating curve (AUROC) score is a more common metric for classification tasks, in this case it would not be sufficient to appreciate the real ability of the model to discriminate between the two classes due to their considerable imbalance [34]. AUPRC on the other hand is much more robust to class imbalances. We used 200 bootstraps samples [35] to construct the 95% confidence interval for the metrics shown in figures and tables.

Using the best submodel architectures, we trained and evaluated an ensemble model for different prediction windows (1, 2, 7, 14 days) and assessment rates (every 6, 12 and 24 hours). Finally, we applied post-hoc isotonic calibration to align the final predicted risk with the actual outcome incidence, using the data in the validation set [36]. The model re-calibration was achieved by fitting an isotonic regression using the output of the model prior to calibration as regressor and the actual label of the samples as response variable. We used the validation set to fit the isotonic regression and kept the test set untouched. The fitted isotonic model was used to adjust the output of the uncalibrated model to get calibrated predictions.

All the results reported are generated using the re-calibrated model on the test set. To control for biases driven by age or sex, performances were also evaluated at each time of assessment for the different subgroups.

Interpretation

The impact of the different tokens on the model outcome was calculated using the GradientShap algorithm [37] from the Captum library [38]. Given an input feature of a single risk assessment, its Shap value is correlated to how much (and in what direction) that feature pulls the individual-level prediction away from the population-level mean risk. Importantly, Shap values do not represent the effect of a single feature on the model outcome but rather the effect of that feature in the context of a coalition of features. Shap values were calculated for the best model after isotonic re-calibration.

Results

The model was trained on 682,041 unique patients and 1,791,705 admissions and evaluated on 170,579 patients and 450,144 admissions as described in Methods (S3 Table). In both parts, 1.5% of the admissions resulted in in-hospital mortality and 0.7% in ICU transfer. 2,583 tokens from the ICD code data type (medical history), 2,421 tokens from the biochemical measurements and 403,869 tokens from the medical notes were used as input. We investigated the inclusion of each data type separately as well as jointly in the same model.

When we explored how the prediction window and assessment rate affect the performances, the most performant model based on the AUPRC was the one trained on all the data sources using a prediction window of 14 days and an assessment rate of 6 hours with an AUROC of 0.904 [0.903–0.904] and AUPRC 0.285 [0.283–0.286] (Fig 4, Table 2). This model was after isotonic recalibration well-calibrated (Fig 4C, S5 Fig) with a calibration slope of 0.964 [0.951, 0.98], an intercept of 0.002 [0.001–0.003] and an upper bound risk of 79%. A full list of precision, recall and specificity for the different models is in the Supplementary (S2 Table).

Fig 4. Performance of the model for prediction of unplanned ICU transfer or in-hospital death.

Fig 4

Panel A: precision recall curves show the values of precision and recall at different thresholds for the six models. Panel B: receiver operating characteristic (ROC) curves show the values of true positive rate (= recall = sensitivity) and false positive rate (= 1-specificity) at different thresholds for the six models. Model A: age, sex, number of admissions. Model B: model A + medical history data. Model C: model A + biochemical data. Model D: model A + clinical notes. Models E and F: model A + medical history data, biochemical data and clinical notes. Models A–E use a 24-hour prediction window and a 12-hour assessment rate. Model F uses a 14-day prediction window and a 6-hour assessment rate. Panel C: calibration curve (observed fraction of positive sample at different predicted probability) for Model F before and after isotonic calibration.

Table 2. Performance on the test set for models using all data types.

Assessment rate Prediction window AUROC AUPRC Calibration
Slope Intercept
6h 1d 0.925 [0.923–0.926] 0.152 [0.149–0.155] 1.042 [1.02–1.064] -0.001 [-0.001–0.0]
2d 0.922 [0.921–0.923] 0.180 [0.177–0.184] 0.988 [0.966–1.007] 0.002 [0.001–0.003]
7d 0.913 [0.912–0.914] 0.269 [0.267–0.271] 0.996 [0.985–1.008] 0.002 [0.001–0.003]
14d 0.904 [0.903–0.904] 0.285 [0.283–0.286] 0.964 [0.951–0.98] 0.004 [0.003–0.005]
12h 1d 0.918 [0.916–0.920] 0.129 [0.125–0.134] 0.998 [0.96–1.045] 0.001 [0.0–0.002]
2d 0.916 [0.914–0.917] 0.187 [0.183–0.191] 0.996 [0.977–1.021] 0.002 [0.0–0.003]
7d 0.898 [0.897–0.899] 0.233 [0.231–0.236] 1.045 [1.026–1.067] -0.001 [-0.002–0.0]
14d 0.901 [0.900–0.902] 0.278 [0.275–0.282] 0.952 [0.912–0.997] 0.005 [0.001–0.007]
24h 1d 0.905 [0.901–0.908] 0.135 [0.130–0.140] 1.038 [0.992–1.078] 0.0 [-0.001–0.001]
2d 0.901 [0.899–0.903] 0.163 [0.157–0.169] 1.017 [0.969–1.059] 0.0 [-0.001–0.002]
7d 0.896 [0.894–0.897] 0.232 [0.227–0.236] 1.014 [0.969–1.05] 0.001 [-0.001–0.003]
14d 0.893 [0.892–0.894] 0.250 [0.246–0.254] 0.974 [0.951–1.006] 0.003 [0.001–0.005]

The performances were similar across sexes, with AUPRCs of 0.288 [0.285–0.290] and 0.280 [0.277–0.283] for males and females, respectively. The performances for the different age groups varied more, with a trend of AUPRC increasing with age (age 16–37: 0.168 [0.157–0.180], age 37–58: 0.252 [0.246–0.257], age 58–79: 0.277 [0.274–0.280], age 79–100: 0.310 [0.307–0.313]).

We observed an increased AUPRC for risk estimates further into the hospital stay with a peak at 7 days into the admission (S6 and S7 Figs).

Among the models trained separately using a 24-hour prediction window and 12 hours assessment rate, the most performant was the one trained on the clinical notes with an AUPRC of 0.125 [0.120–0.130]; this value improved to AUPRC of 0.129 [0.125–0.134] when the model was trained on all the data using the same prediction window and assessment rate.

The best encoding of disease diagnoses was rolling diagnoses up to the third ICD level (e.g. C341M to C34) and a padding size of 9 (S2 Fig). The optimal quantile resolution for the biochemical data was deciles (i.e. 10 bins) and a padding size of 28 was the optimal number of lab values to include before time of prediction (S3 Fig). The optimal padding size for medical notes was 299, reflecting the larger amount of information usually held by the clinical notes (S4 Fig). The complete list of the optimal parameters for each search can be found in the supplementary S1 Table.

Feature importance

The importance of the tokens from the different vocabularies can be sorted according to their attributed Shap values (Fig 5). Higher age and male sex were associated with elevated risk of deterioration. The opposite was the case for the number of previous admissions: a higher number of previous admissions was associated with lower risk of deterioration (Fig 5A–5C).

Fig 5. Contribution of the different tokens to the outcome of clinical deterioration.

Fig 5

Shap values were estimated using the best model (model using age, sex, number of admissions, medical history data, biochemical data and clinical notes with a 14-day prediction window and 6-hours assessment rate). Panels A-C: the distribution of Shap values for each feature without temporal components, with the color scale representing the feature value. Panels D-F: the distribution of Shap values for top-15 tokens (in up- and downward directions, respectively). Overlain boxes-and-whiskers show medians, quartiles and 1.5 x quartiles.

The diagnosis tokens associated with clinical deterioration were often those of acute illnesses such as respiratory failure, neoplasm and pneumonia (Fig 5D). In contrast, diagnosis codes related to pregnancy, chronic conditions and orthopedics were associated with lower risk.

Low levels of albumin, lymphocytes and sodium were associated with elevated risk of clinical deterioration; the same were high levels of C-reactive protein, leukocytes, sodium, lactate dehydrogenase, potassium and carbamide (Fig 5E). In contrast, normal levels of leucocytes, hemoglobin, C-reactive protein, sodium and potassium were all associated with low risk of deterioration.

The tokens from clinical notes most strongly associated with clinical deterioration were severely, intensive, respiratory and chronic. Pain, leukocytes, control, normal, home, discharge(d) on the other hand were all associated with low risk (Fig 5F).

Discussion

The main aim of this study was to explore whether data types registered routinely in general departments are predictive of clinical deterioration, ultimately to assess if these suffice for this task. A solution based on data collected routinely might circumvent a common weakness of EWS, i.e. they depend on data that require clinical engagement (e.g., vital signs). Consequently, offering a viable alternative for risk stratification with minimum additional manual data collection effort is preferable.

Exploiting the combined power of entity embedding of tokens from electronic health records and the ability of recurrent neural networks to learn temporal patterns from such data, we built a performant (AUROC and AUPRC up to 0.90 and 0.29, respectively) and well-calibrated deep learning model for predicting the risk of clinical deterioration. Specifically, leveraging medical history data (up to 40 years) from a national register along with in-hospital biochemical data and clinical notes, we trained the model dynamically, meaning that the same model can handle different time points for the same admission.

Direct comparison of our model to the performances of NEWS or MEWS is not possible, since the vital signs used for the calculation of such scores are not collected in our EHR dataset. Comparison of the performance of our model to other work is also non-trivial: AUROC is the most common metric for risk stratification and usually the one used to compare performance across studies. Nevertheless, it is unsuited for imbalanced prediction problems (such as the one defined here) because it disregards the prevalence of the outcome of interest [34]. Although AUPRC does account for prevalence, it is not always reported in the classic studies on EWS scores. Moreover, direct comparison of the AUPRCs requires equal (or at least similar) prevalence of the outcome across studies; this is problematic because the prevalence tends to vary between cohorts, and, more importantly, so do the criteria used to define the outcome. For example, Watkinson et al. [39] defined the outcome as the composite of ICU transfer, in-hospital death and cardiac arrest with a prediction window of 24 hours (AUROC of 0.868 [0.864–0.872]). Dziadzko et al. [40] defined the outcome as in-hospital death or respiratory failure, with a prediction window of 48 hours (AUROC 0.87 [0.85–0.88] and 0.90 [0.84–0.95] in 2013 and 2017, respectively). Malycha et al. [41] defined the outcome as in-hospital death and ICU transfer within 24 h following assessment from patients admitted longer than 24 hours (AUROC 0.823 [0.819–0.824]). Cho et al. [12] instead defined the outcome as cardiac arrest and unexpected ICU admission, occurring within 0.5–25 hours from the assessment (AUROC 0.865 [CI not reported]).

The performance difference of models A-E (Fig 4) is driven by information learnt from different data types. The performance does not increase linearly with the addition of new data types, suggesting substantial overlap in the latent information of diagnosis codes, biochemical data and clinical notes. This corresponds well to what one would expect, e.g. a clinical note may very well cite lab values (and likely those of greatest clinical interest) and summarize anamnestic information such as comorbidities also registered in the diagnosis code data.

On the other hand, the change in performance from model E to model F (AUPRC from 0.129 [0.125–0.134] to 0.285 [0.283–0.287]) is likely driven by the different incidence of the outcome when using longer prediction windows, essentially making the prediction task easier because there are more examples to learn from. Also, it is much more difficult to discriminate between patients having a severe outcome in the subsequent 24 or 48 hours, because drivers of short-term mortality/ICU transfer depend not only on the physiological status of the patient but also on many factors beyond what is captured in clinical data, e.g. the coordination of the resources within and between hospital departments. The assessment rates employed are consistent with how often new observations and information are recorded in patient files: new clinical notes and biochemical measurements are usually recorded a couple of times every day and normally at least once. The change in AUPRC and AUROC during the admissions reflects the increasing availability of new records when the patients proceed further into the hospitalization (S6 and S7 Figs), and that patients tend to deteriorate either early or late in the admission and less in between.

We decided to keep the outcome definition broad rather than predicting specific severe conditions such as sepsis or organ failure, notoriously difficult to operationalize for prediction tasks [42]. In contrast, using a general definition of clinical deterioration allowed us to keep the feature space more generic and less dependent on specific illnesses or patients subgroups. Indeed, our model seeks not to advice on interventions but to flag patients at risk of (more of less imminent) clinical deterioration so health care staff can intervene in a manner appropriate for the patient in question, hopefully translating into improved prognosis for that patient.

An ensemble structure of the network was preferred over a structure in which the different data types contributed to the creation of the same embedding space. Separate submodels (embedding + recurrent linear layer + pooling) allowed us to tune their architectures; optimal padding sizes and embedding coefficients, for example, differ for diagnosis codes and biochemical data. Its ensemble nature also renders the model scalable for incorporating new data types/domains.

To obtain a reasonably well-defined training cohort applicable to as many patients as possible, we included all in-patient admissions to any hospital department, excluding only out-patients and acute admissions to the emergency department. The former groups were excluded because they are more unlikely to experience the outcome, the latter because the rapid course of events and acute physiology recorded during acute admissions would necessitate a different experimental setup.

Interpretation

Although ranking the features that drive the predicted risks up or down is useful as a sanity check of the signals picked up by the model, these estimates are not causal. Indeed, the attribution of each token is determined by its context therefore the same token could be associated with both elevated and diminished risk depending on the other tokens with which it co-occurs. For example, a lab value outside the physiological range (e.g., low creatinine), which would normally drive the risk up, may not affect the predicted risk when co-occurring diagnosis codes counter its contribution (e.g., pregnancy codes).

This is more evident for the free text, where the semantics of a word is always dependent on its context. Overall, the feature attribution does not contain any counterintuitive explanation and the few seemingly questionable interpretations have plausible explanations. For example, we find low eosinophil counts among the lab values associated with elevated risk of clinical deterioration. Clinically, this seems counterintuitive since low values of eosinophils represent the standard and high eosinophil counts are indicative of infections (especially parasitic) and allergic disorders. Eosinophils, however, are part of standard panels for blood differential counts (a count for the different types of white blood cells) and, as such, the very presence of (any) eosinophil count probably reflects a clinical suspicion of infection which may lead to additional analysis to investigate the infectious agent. Interpretation of tokens from the clinical notes also provides some examples that at first seem odd but probably do have some contextual bearing. For example, the tokens wife (hustru) and relatives (pårørende) are strongly associated with clinical deterioration. This is likely because doctors document in the patient file when relatives have been informed or consulted, and this may well be on poor prognoses or even no-resuscitation orders in which case clinical deterioration is almost certain to ensue.

Numerical features like the number of previous admissions should also be contextualized. While higher age is indeed associated with clinical deterioration, having a lot of admissions prior to the one of the assessments does not. This seemingly paradoxical result can probably be explained by the ability of the model to integrate data from different EHR domains to recognize patients with chronic conditions who will have more frequent hospital visits but perhaps be less likely to suddenly fall critically ill.

Strengths

This study has some important strengths. First, it is one of the largest of its kind, with a total of 852,620 patients and 2,241,849 admissions taking place over 6 years. Second, we provided a dynamic risk assessment tool that can be used as clinical decision support, showing how the patient’s status changes over the course of an admission for different prediction windows and at the same time, providing feature interpretation. This supersedes early warning scores, which are limited to a smaller number of features and do not take sequential information into account but just use a snapshot of the patient’s current status. The clinical impact of this model goes beyond its application as decision support tool, as it can potentially be used for prognostic enrichment in clinical trials (enrolling patients depending on their risk of experiencing clinical deterioration) and for resource management to estimate the load of people leaving a hospital department (either because they are transferred to the ICU or die). The real-time implementation of the model in an EHR system is of course dependent on regulatory approvals and on the nature of the EHR. We assume that clinical notes and laboratory values are registered consistently throughout the admission and therefore all the data considered valid at the time of prediction can be used. Third, thanks to the model’s architecture, adding new data sources is relatively easy. New features can be added and removed from the model, adapting the tool to the available resources. Healthcare staff have a very busy daily schedule, and it is virtually impossible for them to synthesize hundreds or even thousands of data features regularly. We have shown that the model performs best when as much data as possible is included, likely other data modalities could further improve, for example raw image data, or socioeconomic data. Another advantage of the architecture of this model is that while a classic EWS depends on complete data to be calculated, we, in this setup, handle implicitly missing data by the entity embedding design, thus allowing assessments also at the doorstep. Finally, unlike existing models, ours does not focus on a specific patient group or medical specialty, but instead can be applied broadly to all patients.

Limitations

Like any study this has limitations. First, although we tried to define the outcome robustly, there are some pitfalls to consider for ICU transfer and in-hospital mortality. Patients in very severe conditions may still be discharged by the hospital if the latter is not able to provide any support to the patient. These patients will probably experience clinical deterioration within the prediction window but not within their hospital stay, hence they are labelled as negative but potentially still flagged by the model as high-risk patients; this could inflate the number of false positive patients detected by the model. A related limitation is the lack of data on do-not orders; such patients are likely very ill but may not be moved to the ICU and so will introduce noise in the data that complicates training and perhaps hamper performance to some extent. Second, unplanned ICU transfer was captured only for the patients who were admitted in one of the ICUs in our catchment area. It cannot be excluded that some patients admitted to a general department of one of the hospitals in our catchment area are transferred to ICUs of other Danish regions than included here, even if it is uncommon. Third, model performance may improve with additional data such as genetic data, vital signs and other biomarkers. Thanks to the ensemble nature, adding such data in settings where they are available is simple. Fourth, the model was trained on data from the Danish healthcare system and the model (due to the entity embeddings) would likely need to be trained anew if deployed in other geographical or healthcare systems. Finally, vital signs were not used as structured input to the model. Although enabling their full use would have been easier had they been available in tabular form, the model did have access to some information on vital signs insofar as they are recorded in clinical notes, which is especially likely when deviating.

Conclusion

Combining entity embeddings and recurrent neural networks, we built a highly performant model that every 6 hours during any admission flags patients at higher risk of developing clinical deterioration in the 14 days after the assessment. The model was developed and evaluated using training and validation data, in addition to holdout EHR test data not used during development. A proper prospective evaluation would be needed to establish whether its deployment will produce real-world benefits to patients on hard endpoints. Once clinical utility has been established by trials, the model could both help intervene earlier in patients likely to deteriorate and enrich other clinical trials seeking to identify such early interventions.

Supporting information

S1 Fig. Time coverage of the different datasets.

(TIF)

S2 Fig. Hyperparameter search results for the model trained only on the DNPR data.

(TIF)

S3 Fig. Hyperparameter search results for the model trained only on the biochemical data.

(TIF)

S4 Fig. Hyperparameter search results for the model trained only on the clinical notes data.

(TIF)

S5 Fig. Calibration curves of the best models trained on all the data for all prediction windows (1, 2, 7, 14 days) and assessment rates (6, 12, 24 hours).

(TIF)

S6 Fig. Stratified performances of the best model on AUROC and AUPRC for different age groups.

(TIF)

S7 Fig. Stratified performances of the best model on AUROC and AUPRC for different sex.

(TIF)

S8 Fig. Data split for model development and evaluation.

Patients are first assigned randomly to either development set or holdout test set (80 and 20%). The development set is used for hyperparameter search (80% assigned for training and 20% for internal validation); the holdout test is used for the assessment of models’ performances.

(TIF)

S1 Table. Hyperparameter selection.

We tested different combinations of the following parameters: bidirectional layer, units, layer, padding size for each data source, dropout, regularization, ICD level characters cut-off, biochemical percentile inclusion, top number of biochemical included, biochemical value.

(PDF)

S2 Table. Additional metrics for the 12 models at different prediction intervals [1, 2, 7 and 14 days] and different frequencies of assessment [6, 12 and 24 hours].

The metrics are selected for different risk thresholds [1%, 5%, 10%, 20% and 50%]. Statistics are provided as median and 95% confidence interval: (median, 2.5% percentile, 97.5% percentile).

(PDF)

S3 Table. Number of patients, admissions, and positive cases at different times of prediction.

Given the same prediction window, the statistics for the models at lower frequency of assessment (12 and 24 hours) would be the same as the model presented in this table (e.g., the model with a frequency of assessment of 12 hours and a prediction window of 14 days has the same statistics of the model having the same prediction window and a frequency of assessment of 6 hours).

(PDF)

S4 Table. TRIPOD statement.

(PDF)

Data Availability

The software is available online at https://github.com/daplaci/ClinicalDeteriorationNet. The authors do not have permission to share the data directly; following ethical approval data can be made available for use in secure, dedicated environments via application to the Danish Regions and the Danish Health Data Authority. Researchers wanting access to the data and to use them for research will be required to meet research credentialing requirements. This study was approved by the Danish Patient Safety Authority (3-3013-1731 and 3–3013–1723), the Danish Data Protection Agency (DT SUND 2016–48, 2016–50, 2017–57 and UCPH 514-0255/18-3000:) and the Danish Health Data Authority (FSEID 00003092, FSEID 00003724, FSEID 00004758 and FSEID 00005191).

Funding Statement

We would like to acknowledge the Novo Nordisk Foundation (grants NNF17OC0027594 and NNF14CC0001) and the Danish Innovation Fund (5153-00002B) which supported this study. These foundations contributed to the financing of salaries for SB, DP, RR. BSK-H acknowledges funding from “Grosserer Jakob Ehrenreich og Hustru Grete Ehrenreichs Fond”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute Physiology and Chronic Health Evaluation (APACHE) IV: Hospital mortality assessment for today’s critically ill patients. Crit Care Med. 2006;34: 1297–1310. doi: 10.1097/01.CCM.0000215112.84523.F0 [DOI] [PubMed] [Google Scholar]
  • 2.Metnitz PGH, Moreno RP, Almeida E, Jordan B, Bauer P, Campos RA, et al. SAPS 3-From evaluation of the patient to evaluation of the intensive care unit. Part 1: Objectives, methods and cohort description. Intensive Care Med. 2005;31: 1336–1344. doi: 10.1007/s00134-005-2762-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Smith MEB, Chiovaro JC, O’Neil M, Kansagara D, Quiñones AR, Freeman M, et al. Early warning system scores for clinical deterioration in hospitalized patients: a systematic review. Ann Am Thorac Soc. 2014;11: 1454–1465. doi: 10.1513/AnnalsATS.201403-102OC [DOI] [PubMed] [Google Scholar]
  • 4.Subbe CP, Kruger M, Rutherford P, Gemmel L. Validation of a modified Early Warning Score in medical admissions. QJM Mon J Assoc Physicians. 2001;94: 521–526. doi: 10.1093/qjmed/94.10.521 [DOI] [PubMed] [Google Scholar]
  • 5.Prytherch DR, Smith GB, Schmidt PE, Featherstone PI. ViEWS—Towards a national early warning score for detecting adult inpatient deterioration. Resuscitation. 2010;81: 932–937. doi: 10.1016/j.resuscitation.2010.04.014 [DOI] [PubMed] [Google Scholar]
  • 6.Jones M. NEWSDIG: The National Early Warning Score Development and Implementation Group. Clin Med. 2012;12: 501. doi: 10.7861/clinmedicine.12-6-501 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Churpek MM, Yuen TC, Edelson DP. Predicting clinical deterioration in the hospital: The impact of outcome selection. Resuscitation. 2013;84: 564–568. doi: 10.1016/j.resuscitation.2012.09.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Downey CL, Tahir W, Randell R, Brown JM, Jayne DG. Strengths and limitations of early warning scores: A systematic review and narrative synthesis. Int J Nurs Stud. 2017;76: 106–119. doi: 10.1016/j.ijnurstu.2017.09.003 [DOI] [PubMed] [Google Scholar]
  • 9.Pedersen NE, Rasmussen LS, Petersen JA, Gerds TA, Østergaard D, Lippert A. A critical assessment of early warning score records in 168,000 patients. J Clin Monit Comput. 2018;32: 109–116. doi: 10.1007/s10877-017-0003-5 [DOI] [PubMed] [Google Scholar]
  • 10.Al-Shwaheen TI, Moghbel M, Hau YW, Ooi CY. Use of learning approaches to predict clinical deterioration in patients based on various variables: a review of the literature. Artif Intell Rev. 2022;55: 1055–1084. doi: 10.1007/s10462-021-09982-2 [DOI] [Google Scholar]
  • 11.Shamout FE, Zhu T, Sharma P, Watkinson PJ, Clifton DA. Deep Interpretable Early Warning System for the Detection of Clinical Deterioration. IEEE J Biomed Health Inform. 2020;24: 437–446. doi: 10.1109/JBHI.2019.2937803 [DOI] [PubMed] [Google Scholar]
  • 12.Cho K-J, Kwon O, Kwon J-M, Lee Y, Park H, Jeon K-H, et al. Detecting Patient Deterioration Using Artificial Intelligence in a Rapid Response System. Crit Care Med. 2020;48: e285–e289. doi: 10.1097/CCM.0000000000004236 [DOI] [PubMed] [Google Scholar]
  • 13.da Silva DB, Schmidt D, da Costa CA, da Rosa Righi R, Eskofier B. DeepSigns: A predictive model based on Deep Learning for the early detection of patient health deterioration. Expert Syst Appl. 2021;165: 113905. doi: 10.1016/j.eswa.2020.113905 [DOI] [Google Scholar]
  • 14.Tóth V, Meytlis M, Barnaby DP, Bock KR, Oppenheim MI, Al-Abed Y, et al. Let Sleeping Patients Lie, avoiding unnecessary overnight vitals monitoring using a clinically based deep-learning model. Npj Digit Med. 2020;3: 1–9. doi: 10.1038/s41746-020-00355-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Thorsen-Meyer HC, Nielsen AB, Nielsen AP, Kaas-Hansen BS, Toft P, Schierbeck J, et al. Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records. Lancet Digit Health. 2020;2: e179–e191. doi: 10.1016/S2589-7500(20)30018-2 [DOI] [PubMed] [Google Scholar]
  • 16.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. Npj Digit Med. 2018;1. doi: 10.1038/s41746-018-0029-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Tomašev N, Glorot X, Rae JW, Zielinski M, Askham H, Saraiva A, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature. 2019;572: 116–119. doi: 10.1038/s41586-019-1390-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lauritsen SM, Kristensen M, Olsen MV, Larsen MS, Lauritsen KM, Jørgensen MJ, et al. Explainable artificial intelligence model to predict acute critical illness from electronic health records. 2019 [cited 31 Dec 2019]. http://arxiv.org/abs/1912.01266. [DOI] [PMC free article] [PubMed]
  • 19.Lauritsen SM, Kalør ME, Kongsgaard EL, Lauritsen KM, Jørgensen MJ, Lange J, et al. Early detection of sepsis utilizing deep learning on electronic health record event sequences. Artif Intell Med. 2020;104: 101820. doi: 10.1016/j.artmed.2020.101820 [DOI] [PubMed] [Google Scholar]
  • 20.Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement | BMC Medicine | Full Text. [cited 24 Nov 2022]. Available from: https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-014-0241-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Guo C, Berkhahn F. Entity Embeddings of Categorical Variables. 2016.
  • 22.Cho K, van Merrienboer B, Bahdanau D, Bengio Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv; 2014. http://arxiv.org/abs/1409.1259.
  • 23.Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9: 1735–1780. doi: 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
  • 24.Lin Z, Feng M, dos Santos CN, Yu M, Xiang B, Zhou B, et al. A Structured Self-attentive Sentence Embedding. arXiv; 2017.
  • 25.Schmidt M, Schmidt SAJ, Sandegaard JL, Ehrenstein V, Pedersen L, Sørensen HT. The Danish National Patient Registry: a review of content, data quality, and research potential. Clin Epidemiol. 2015;7: 449–490. doi: 10.2147/CLEP.S91125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Thygesen SK, Christiansen CF, Christensen S, Lash TL, Sørensen HT. The predictive value of ICD-10 diagnostic coding used to assess Charlson comorbidity index conditions in the population-based Danish National Registry of Patients. BMC Med Res Methodol. 2011;11: 83. doi: 10.1186/1471-2288-11-83 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Grann AF, Erichsen R, Nielsen AG, Frøslev T, Thomsen RW. Existing data sources for clinical epidemiology: The clinical laboratory information system (LABKA) research database at Aarhus University, Denmark. Clin Epidemiol. 2011;3: 133–138. doi: 10.2147/CLEP.S17901 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Muse VP, Aguayo-Orozco A, Balaganeshan SB, Brunak S. Population-wide analysis of laboratory tests to assess seasonal variation and the relevance of temporal reference interval modification. medRxiv; 2022. p. 2022.11.17.22282394. doi: 10.1101/2022.11.17.22282394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Perkins J. Python Text Processing with NLTK 2.0 Cookbook. 2010.
  • 30.Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of Tricks for Efficient Text Classification. 2016 [cited 29 Sep 2021]. https://github.com/facebookresearch/fastText.
  • 31.Chollet F. Deep Learning with Python. https://www.manning.com/books/deep-learning-with-python
  • 32.Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage AK USA: ACM; 2019. pp. 2623–2631.
  • 33.Falkner S, Klein A, Hutter F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. arXiv; 2018.
  • 34.Pinker E. Reporting accuracy of rare event classifiers. Npj Digit Med. 2018;1. doi: 10.1038/s41746-018-0062-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records | Nature Protocols. [cited 18 Nov 2022]. Available from: https://www.nature.com/articles/s41596-021-00513-5. [DOI] [PubMed] [Google Scholar]
  • 36.Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. Proceedings of the 22nd international conference on Machine learning—ICML ‘05. Bonn, Germany: ACM Press; 2005. pp. 625–632.
  • 37.Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
  • 38.Kokhlikyan N, Miglani V, Martin M, Wang E, Alsallakh B, Reynolds J, et al. Captum: A unified and generic model interpretability library for PyTorch. arXiv; 2020.
  • 39.Watkinson PJ, Pimentel MAF, Clifton DA, Tarassenko L. Manual centile-based early warning scores derived from statistical distributions of observational vital-sign data. Resuscitation. 2018;129: 55–60. doi: 10.1016/j.resuscitation.2018.06.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Dziadzko MA, Novotny PJ, Sloan J, Gajic O, Herasevich V, Mirhaji P, et al. Multicenter derivation and validation of an early warning score for acute respiratory failure or death in the hospital. Crit Care Lond Engl. 2018;22: 286. doi: 10.1186/s13054-018-2194-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Malycha J, Farajidavar N, Pimentel MAF, Redfern O, Clifton DA, Tarassenko L, et al. The effect of fractional inspired oxygen concentration on early warning score performance: A database analysis. Resuscitation. 2019;139: 192–199. doi: 10.1016/j.resuscitation.2019.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Joynes E. More challenges around sepsis: definitions and diagnosis. J Thorac Dis. 2016;8: E1467–E1469. doi: 10.21037/jtd.2016.11.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000116.r001

Decision Letter 0

Martin G Frasch

2 Dec 2022

PDIG-D-22-00253

Development and validation of a dynamic prediction model for unplanned ICU admission and mortality in hospitalized patients

PLOS Digital Health

Dear Dr. Brunak,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

EDITOR: Please insert comments here and delete this placeholder text when finished. Be sure to:

* Indicate which changes you require for acceptance versus which changes you recommend

* Address any conflicts between the reviews so that it's clear which advice the authors should follow

* Provide specific feedback from your evaluation of the manuscript

==============================

Please submit your revised manuscript within 60 days Jan 31 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Martin G Frasch

Section Editor

PLOS Digital Health

Journal Requirements:

1. Please send a completed 'Competing Interests' statement, including any COIs declared by your co-authors. If you have no competing interests to declare, please state "The authors have declared that no competing interests exist". Otherwise please declare all competing interests beginning with the statement "I have read the journal's policy and the authors of this manuscript have the following competing interests:"

2. Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

a. Please clarify all sources of funding (financial or material support) for your study. List the grants (with grant number) or organizations (with url) that supported your study, including funding received from your institution.

b. State the initials, alongside each funding source, of each author to receive each grant.

c. State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

d. If any authors received a salary from any of your funders, please state which authors and which funders.

3. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex.

4. We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list.

5. Since your data is not available for proprietary reasons, please explain via email why the data is not available. Please also include the contact information for the third party organization that should be contacted should other researchers want to request access to this data and please include the full citation of where the data can be found. We also request that you verify with us via email that any researcher will be able to obtain the data set in the same manner that the you have obtained it. If you feel you are unwilling or unable to adhere to this policy, please explain your reasons by return email and your exemption request will be escalated to the editor for approval. Your exemption request will be handled independently and will not hold up the peer review process, but will need to be resolved should your manuscript be accepted for publication. One of the Editorial team will be in touch if they require more information.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Overall summary: This retrospective cohort study from Placido and colleagues aimed to develop a deep learning early warning system for clinical deterioration on the hospital wards across multiple hospitals in Denmark. The authors demonstrated high discrimination for a composite outcome of ICU admission or death.

Major Strengths

1. The authors used an extremely large cohort across multiple hospitals, which will help with generalizability and replication.

2. Isotonic recalibration is a nice approach to ensuring that the model does well across multiple performance measures (i.e., beyond just AUROC).

Major Limitations

1. It’s not entirely clear to me where the novelty lies in this study. The Rajkomar study (ref # 14) has in some ways set the bar for hospital outcome prediction using deep learning from the EHR, and while this study purports to be more of an early warning system, designing predictions for a 14 day window is not much of an early warning. I think the study rationale/objectives should be more clearly articulated and more strongly argued.

2. The authors’ methods are nice from the modeling perspective, but it’s less clear how clinically operationalizable they will be. In particular, tokenizing clinical notes is fraught with uncertainty in terms of implementation and use. At what point in a real-time system would a note be “scanned” or “re-scanned”? What happens to notes written by house officers or APPs which are awaiting attending physician cosignature? What happens to notes which are revised, amended, or deleted?

3. How were missing data handled in this approach?

4. Lines 88-97 don’t make sense to me. The authors argue that vital signs might be unreliable as EWS inputs due to limited recording, but then state that they used EHR data and clinical notes – which are either (a) exactly those same vital signs just discounted or (b) even more sparse than ward vital signs (e.g., once-daily notes). Can the authors reconcile?

5. I note that TRIPOD is mentioned in the Methods, but I do not see a relevant checklist attached in the Supplement.

Minor comments:

1. Figure 1 is not referenced in the manuscript

2. Line 322 seems incomplete: “performances of h is not…” (what is “h”?)

Reviewer #2: Development and validation of a dynamic prediction model for unplanned ICU admission and mortality in hospitalized patients

Placido et al. presented a manuscript in which they developed a dynamic prediction model with a large dataset including almost all hospitalized patients from 12 hospitals in Denmark. The authors used deep learning techniques to create the best model for the prediction of clinical deterioration. In general, the manuscript is well written and the study seems to be well performed. However, there are several concerns that need to be addressed:

Our main concern is the clinical impact of this study for the patient and/or health care provider. While reading this manuscript, the following questions arose: What is the advantage for patients/clinical care when using this model? How should the caregiver use this model in practice? What does this model add to the existing deep learning and conventional ‘epidemiological’ prediction models and literature? The manuscript mainly focusses on explaining their methods/the deep learning method. We recommend to add the clinical perspective, importance and integration of the model to the manuscript.

Our second concern relates to the data handling which is not reported. Did someone perform data cleaning on this big dataset? How were outliers handled? The authors mention in their introduction that caregivers make mistakes in manually collecting data. However, part of your data were likely to be collected or reported manually by caregivers. How much data were missing? And how did the authors handle missing data? Please add this to the manuscript.

The other comments are listed below in a consecutive order:

Title:

- The authors only performed internal validation using their own dataset. Therefore, the word ‘validation’ is too broad and not appropriate in this setting. We suggest to change it to ‘internal validation’ or remove it completely. Also, the authors can provide more details on the target population. Please consider adding the country of the study population.

Abstract:

- The abstract is has been disproportionally arranged. Information about the statistical/deep learning methods used is lacking, while other parts of the abstract (introduction) are relatively long. Please rearrange the abstract and add more details on the methods and the time points used for dynamic prediction.

- The final sentence of the manuscript (line 44-46) does not match the main scope and overall conclusion of the manuscript. Please rewrite this.

- Please spell out the abbreviations used in the abstract when reported for the first time.

Introduction:

- The introduction should start with a broader introduction of the topic. What is the importance of prediction? What are the limitations of currently used risk scores and existing deep learning models? What could the model add to clinical decision-making, i.e. what are the advantages for the patient and heath care professionals? Please explain the rationale for developing the dynamic prediction model using deep-learning and add more information on what this model could add.

- The only prediction model mentioned is the EWS. However, there are many more prediction models which are often developed in a department or for a specific patient group. The authors could first write something about prediction in general, and later introduce the EWS score, as an example.

- The modified early warning score (MEWS) is more often used than the EWS nowadays. Please explain why you compare to the EWS.

- Please state the research question and objectives more specifically at the end of the introduction.

- Please mention that this is a model development study with internal validation. Moreover, a hypothesis is missing. Please add this.

- Figure 1: this figure needs clarification. The 2 different examples are now difficult to distinguish. Separation of the 2 examples could be clarifying. Furthermore, the numbers/hours/days of the time points, the prediction window (particularly for ICU transfer), and outcomes used should be clarified in the figure.

Methods:

Major concerns

- In general, the deep-learning analyses seem to be performed appropriately. It would be valuable to incorporate more metrics in the supplemental material, such as the precision and recall loss. Please comment on this suggestion.

- Please define the data used for developing the model more extensively. More information on which medical diseases, biochemical measures, and clinical notes are used in the dynamic model development is essential.

- There is no missing data section in the methods. Please elaborate on the number of missing data, how missings were handled, and which choices or assumptions have been made. For example:

o Not all biochemical data were available every day and at the specific time points, how did the authors deal with this?

o Which clinical notes were used? Daily ones?

o How much data were missing at the specific time points? And how many patients were included at different time points?

o What did the authors do with patients with a non-ICU policy? These patients are too ill, too old, or have to many comorbidities to be admitted to the ICU and could be more at risk for in-hospital mortality and theoretically ICU admission, whereas they are not allowed to be admitted to the ICU. Did the authors take this into account? Please consider a sensitivity analysis.

- Table 1, there are a few comments on this table:

o Please move this table to the results section, as it describes the characteristics of the patients.

o Please provide more information about patient characteristics, particularly clinical data. What are the comorbidities/chronic illnesses of the patients? Which department are they from? Why are they admitted? This is a lot of information that the authors can also decide to add to the supplement.

o Please clarify what the numbers in between the brackets mean.

o What does the 'Number of previous admissions' mean? How is it possible that 3 [1-7] people died in the hospital? Please clarify this.

o Has any data cleaning been performed? Did the authors check for outliers?

- What did the authors do with patients who died after, or in between 2 moments of, the prediction window? Are they not included? Please specify the numbers of patients that were included in each model (each frequency of assessment and prediction window) in the results.

- Did the data include patients from ALL wards, except the ICU and Emergency Department? Can the authors please specify or add a list with the wards of which patients are included? For example, did the authors also include patients from the ophthalmology and ear-nose-throat departments?

- Please add information on who assessed or reported the outcome measures? Were those researchers blinded?

Minor concerns:

- Line 113: please specify the definition of ‘direct ICU admissions’. Are those patients admitted within hours? Or directly?

- Line 120: this is unclear. It seems that those patients have already been discharged and are then admitted again. Please reformulate or clarify this.

- Line 121: please adjust the title of this paragraph. A suggestion could be ‘model development’.

- Figure 2: please enlarge this figure, the figure is difficult to read.

- Line 189-194: maybe the authors can add a figure to the supplements to visualize this process.

Results:

- Please add data on how many patients were included per prediction window and per time point? This is interesting because a lot of patients are in the hospital for less than 14 days. What happened if a patient died or was discharged outside the prediction model? This is also stated in the TRIPOD guidelines, 14a: ‘Specify the number of participants and outcome events in each analysis.’

- Calibration is a very important component of the predictive performance of a prediction model. Figure S5, the calibration plot of the best performing model, should be replaced to the main paper. In addition, please visualize all other calibration plots in the supplement. Furthermore, please add more textual information about the calibration of the models and their interpretation, as this is now underreported.

- Please explain more why the 6-hour/14-day model is the best model. This is not clear now as it has not the highest AUROC or best calibration. Only the AUPRC, as a measure of discrimination, is highest.

- Figure 4DEF: these figures are very interesting and important, and should be highlighted more. Please adjust these, since there are unreadable.

Discussion:

Major concerns

- In the discussion, the authors should focus more on the clinical interpretation. Why is this model better than already existing/developed models and how can we apply the model in daily patient care? How do the authors envision this?

- The paragraph 'interpretation' should be added to the limitations paragraph, as the content of this paragraph contains limitations instead of interpretations.

- Limitations:

o The authors describe that there are many errors in the manual collection of data, but there are also errors in data that has been pulled directly from the system. How did they deal with this?​

o Vital signs have been left out, but are very important for mapping a patient, assessing clinical deterioration/improvement and estimating ICU requirement. In addition, they are likely to be measured at all departments. Regardless of whether vital signs are included in the model, it is important to still measure vital signs as clinical parameter. Please add this to the limitations.

o Add a section regarding missing data.

- Strengths: this model differs from other models, because it does not select a specific patient group because with a certain disease or at a specific department. We assume that all patients admitted to all departments are included. Existing models often aim to predict at a certain group of patients or a certain department. This is a strength of the model that could be emphasized.

- Please discuss the potential clinical use of the model and implications for future research in a new paragraph. Consider external validation.

- The conclusion should apply more directly to the study, presenting the best model, results, and implication of this study. Please incorporate this.

Minor concerns:

- Line 316: number 0.28 is incorrectly rounded, please change it to 0.29

- Line 322: there is one word or there are a few words missing in this sentence. What does h mean? Please rewrite.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Iwan van der Horst

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000116.r003

Decision Letter 1

Martin G Frasch, Jessica Keim-Malpass

24 Apr 2023

Development of a dynamic prediction model for unplanned ICU admission and mortality in hospitalized patients

PDIG-D-22-00253R1

Dear Professor Brunak,

We are pleased to inform you that your manuscript 'Development of a dynamic prediction model for unplanned ICU admission and mortality in hospitalized patients' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Jessica Keim-Malpass

Academic Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for the thorough responses to my comments and those of the other reviewers. The manuscript is markedly stronger as a result. I think this work represents a nice proof of concept and opens some important conversations regarding how we might operationalize some of these approaches in the real world.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Patrick G Lyons

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Time coverage of the different datasets.

    (TIF)

    S2 Fig. Hyperparameter search results for the model trained only on the DNPR data.

    (TIF)

    S3 Fig. Hyperparameter search results for the model trained only on the biochemical data.

    (TIF)

    S4 Fig. Hyperparameter search results for the model trained only on the clinical notes data.

    (TIF)

    S5 Fig. Calibration curves of the best models trained on all the data for all prediction windows (1, 2, 7, 14 days) and assessment rates (6, 12, 24 hours).

    (TIF)

    S6 Fig. Stratified performances of the best model on AUROC and AUPRC for different age groups.

    (TIF)

    S7 Fig. Stratified performances of the best model on AUROC and AUPRC for different sex.

    (TIF)

    S8 Fig. Data split for model development and evaluation.

    Patients are first assigned randomly to either development set or holdout test set (80 and 20%). The development set is used for hyperparameter search (80% assigned for training and 20% for internal validation); the holdout test is used for the assessment of models’ performances.

    (TIF)

    S1 Table. Hyperparameter selection.

    We tested different combinations of the following parameters: bidirectional layer, units, layer, padding size for each data source, dropout, regularization, ICD level characters cut-off, biochemical percentile inclusion, top number of biochemical included, biochemical value.

    (PDF)

    S2 Table. Additional metrics for the 12 models at different prediction intervals [1, 2, 7 and 14 days] and different frequencies of assessment [6, 12 and 24 hours].

    The metrics are selected for different risk thresholds [1%, 5%, 10%, 20% and 50%]. Statistics are provided as median and 95% confidence interval: (median, 2.5% percentile, 97.5% percentile).

    (PDF)

    S3 Table. Number of patients, admissions, and positive cases at different times of prediction.

    Given the same prediction window, the statistics for the models at lower frequency of assessment (12 and 24 hours) would be the same as the model presented in this table (e.g., the model with a frequency of assessment of 12 hours and a prediction window of 14 days has the same statistics of the model having the same prediction window and a frequency of assessment of 6 hours).

    (PDF)

    S4 Table. TRIPOD statement.

    (PDF)

    Attachment

    Submitted filename: Response to reviewers.pdf

    Data Availability Statement

    The software is available online at https://github.com/daplaci/ClinicalDeteriorationNet. The authors do not have permission to share the data directly; following ethical approval data can be made available for use in secure, dedicated environments via application to the Danish Regions and the Danish Health Data Authority. Researchers wanting access to the data and to use them for research will be required to meet research credentialing requirements. This study was approved by the Danish Patient Safety Authority (3-3013-1731 and 3–3013–1723), the Danish Data Protection Agency (DT SUND 2016–48, 2016–50, 2017–57 and UCPH 514-0255/18-3000:) and the Danish Health Data Authority (FSEID 00003092, FSEID 00003724, FSEID 00004758 and FSEID 00005191).


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES