Abstract
Machine learning models in healthcare aim to predict critical outcomes but often overlook existing Early Warning Systems’ impact. Using data from King’s College Hospital, we demonstrate how current evaluation methods can lead to paradoxical results. We discuss challenges in developing ML models from retrospective data and propose a novel approach focused on identifying when patients enter a ‘risk state’ through latent health representations, potentially transforming clinical decision-making.
Subject terms: Outcomes research, Scientific data
The advent of machine learning (ML) models in healthcare has ushered in new means to predict critical patient outcomes, such as in-hospital ICU transfer, sepsis or death1,2. We highlight an often-overlooked pitfall in the design and evaluation of these models and the role of existing Hospital Early Warning Systems (EWS) in averting the very outcomes they seek to predict.
Traditional EWS are designed to detect patient deterioration in hospitals as a result of underlying physiological dysfunction affecting routinely measured vital signs3. While valuable, these systems have limitations: they give no indication of the reversibility of a patient’s state, can generate false alarms, and fail to identify some cases of deterioration4. A recent Cochrane Review found low-to-very low certainty evidence on the effectiveness of EWS5.
ML architectures aim to address these limitations by leveraging vast amounts of electronic health record (EHR) data. Existing implementations of ML EWS are predominantly supervised ML models that analyse historical data from patients who experienced adverse outcomes. These models identify patterns and indicators in the EHR data that preceded these events. The learned patterns can then be used to detect signs of deterioration before vital signs are sufficiently perturbed to be detected by an EWS.
However, this approach faces two significant impediments. First, there’s a risk of learning bias: if certain patient groups are systematically excluded from specific interventions (e.g. never taken to the ICU), the algorithm may learn and perpetuate this bias6. Second, and perhaps more insidiously, a paradox emerges when EWS successfully stabilise at-risk patients.
Consider cases where prompt traditional EWS activation prevents adverse outcomes like sepsis, death, or ICU admission. In these instances, these successfully treated patients become ‘controls’ in a supervised ML model’s training data. We run into the fundamental problem of causal inference: We can only observe one outcome for each patient, leaving us blind to potential alternate scenarios without EWS triggered intervention. Consequently, ML models might inadvertently focus on identifying patients who will deteriorate regardless of intervention, rather than those who could benefit from timely care. This runs counter to our intended goals.
Despite this, within the literature, ML EWS are frequently reported as having superior predictive ability compared with traditional EWS. This alleged superiority is demonstrated using retrospective data, but due to the paradox described above, this approach is flawed. To illustrate this paradox, we conducted an experiment using data from King’s College Hospital, London, where we deliberately compared two single features in isolation: haemoglobin and age. We evaluated their individual abilities to predict 24 h mortality. (Full methods, figures, and results, including ROC plots, are available in the supplementary material; Supplementary Fig. 1).
Low haemoglobin, known as anaemia, can indicate time-critical conditions such as haemorrhage for which effective treatments exist. In this intentionally simplified experiment haemoglobin showed predictive performance for mortality no better than chance. In contrast, age, non-modifiable as far as we know, proved to be a far stronger predictor. Should a clinician at King’s, when called about a patient with anaemia, instead focus on reviewing the oldest patient on the ward? Should a hospital replace their existing EWS with a ML EWS that shows superior predictive ability on retrospective data?
The apparent poor performance of haemoglobin as a predictor demonstrates a crucial point: features and systems linked to effective interventions which succeed in preventing adverse outcomes can appear to have low predictive accuracy in retrospective data. This is because successful interventions alter the outcomes the model aims to predict, weakening the association between predictors and outcomes.
This phenomenon, which Lenert et al. referred to as “prediction decay”, is also observed in well-implemented ML models7. Once deployed, they may appear to lose predictive accuracy because the interventions they trigger effectively prevent adverse events. In a utopian scenario, an EWS would show a negative retrospective association with mortality—it would identify only reversible cases of deterioration early enough to avert every preventable death.
To address this challenge and improve model performance and evaluation, one might consider incorporating intervention data. However, this approach, while intuitive, introduces significant complexities. Even in our simple haemoglobin model, including blood transfusion data would lead to confounding by treatment indication and time-dependent confounding. This issue is amplified for an EWS, where various vital signs can be deranged simultaneously and patients often receive several interventions at once. How do we determine which patients were truly at risk? Of the treatments administered, which were necessary, effective and timely?8
Again, we are presented with the fundamental problem of causal inference: it is challenging to determine the true effect of interventions from observational data. This challenge is exemplified by two prospective randomised controlled trials (RCTs) of pop-up alerts for patients with acute kidney injury (AKI)9,10. In both studies, the alerts measurably changed clinical behaviour: increasing usage of the AKI intervention bundle in the first study and boosting rates of discontinuation of potentially nephrotoxic medications in the second. However, neither study demonstrated a benefit in patient outcomes. More concerningly, the first study revealed significantly higher mortality rates in the intervention group at non-teaching hospitals (15.6% vs 8.6%, p = 0.003). These studies highlight that even when models prompt intervention, they may not improve patient outcomes and can potentially lead to harm. Given these challenges and the potential for unintended consequences, the utility of ML models intended for clinical implementation should be rigorously evaluated through RCTs before widespread adoption. This approach is not only feasible, as demonstrated by these AKI studies, but essential to ensure patient safety and clinical efficacy.
How else might we overcome the fundamental impediments in developing EWS using retrospective data and bridge the gap between statistical prediction and causal inference11? The outcomes we currently observe are the sequelae for a subset of patients who entered a ‘risk state’ due to physiological insults such as infections, adverse drug reactions, or haemorrhage. If we identify this ‘risk state’ early, we could facilitate timely intervention to prevent the adverse outcome.
To identify this ‘risk state,’ we propose developing a proxy for measuring health - a latent representation inferred from data in EHRs. For instance, Renc et al. create a latent representation of a patient within a transformer model, and can access this hidden representation by asking the model to generate a Sequential Organ Failure Assessment (SOFA) score at any given time for a patient12. By monitoring these latent health states, we can identify periods of stable equilibrium and recognize deviations.
This shift in focus from outcome prediction to ‘risk state’ identification addresses the limitations of outcome-based models. Rather than relying solely on supervised ML, we can use our latent representation model to prompt clinician assessment. Incorporating this approach within an RCT would enhance our understanding of clinical actions and help delineate the subset of patients who can benefit from early intervention.
This method allows us to move beyond mere outcome prediction to understanding the mechanisms behind these outcomes. Crucially, it may help distinguish between reversible and irreversible deterioration, enabling more informed decisions about when aggressive treatment is needed and when it might be futile or harmful.
Successful implementation of this novel approach could transform healthcare by enabling earlier, more precise clinical decisions, thereby improving patient outcomes and resource allocation. By deepening our understanding of how and why outcomes occur, we open new avenues for advancing clinical decision-making and patient care.
Supplementary information
Acknowledgements
This work was supported by the DRIVE-Health, King’s College London funded Centre for Doctoral Training (CDT) in Data-Driven Health, and the Dalhousie University Department of Medicine Research Fellowship Award, both awarded to H.L.E. Z.I. is supported by the NIHR Biomedical Research Centres at South London and Maudsley NHS Foundation Trust (SLaM) and University College London Hospitals, King's College London, and Innovate UK (grant 10104845).
Author contributions
H.L.E. conceptualised the article, wrote the initial draft, and led the writing process. E.P., J.T., K.R., M.W., and Z.I. contributed to refining the ideas, reviewing, and editing the manuscript. All authors read and approved the final manuscript.
Competing interests
H.L.E.'s PhD research focuses on developing data-driven latent measures of health status, which could potentially compete with or complement existing or future EWS. K.R. reports grants from Nova Scotia Health Research Fund, personal fees from Ardea Outcomes, the Chinese Medical Association, Wake Forest University Medical School Centre, the University of Nebraska - Omaha, the Australia New Zealand Society of Geriatric Medicine, the Atria Institute, Fraser Health Authority, McMaster University, and EpiPharma Inc, outside the submitted work; In addition, K.R. has licensed the Clinical Frailty Scale (CFS) (a latent measure of health) to Enanta Pharmaceuticals, Inc, Synairgen Research Ltd, Faraday Pharmaceuticals, Inc., KCR S.A., Icosavax, Inc, BioAge Labs Inc, Biotest AG, Qu Biologics Inc, AstraZeneca UK Ltd, Cellcolabs AB, Pfizer Inc, W.L. Gore Associates Inc, pending to Cook Research Incorporated and Rebibus Therapeutics Inc; has licensed the Pictorial Fit-Frail Scale (PFFS) to Congenica, and as part of Ardea Outcomes Inc has a pending patent for Electronic Goal Attainment Scaling. Use of both the CFS and PFFS is free for education, research and non-profit health care with completion of a permission agreement stipulating users will not change, charge for or commercialize the scales. For-profit entities (including pharma) pay a licensing fee, 15% of which is retained by the Dalhousie University Office of Commercialization and Innovation Engagement. The remainder of the license fees are donated to the Dalhousie Medical Research Foundation and the QEII Health Sciences Centre Research Foundation. In addition to academic and hospital appointments, K.R. is co-founder of Ardea Outcomes (DGI Clinical until 2021), which in the past 3 years has had contracts with pharma and device manufacturers (INmune, Novartis, Takeda) on individualized outcome measurement. J.T.T. has received research grant funding from National Institutes for Health Research (NIHR), Health Data Research UK (HDR), Innovate UK, Office of Life Sciences, Epilepsy Research Institute, British Heart Foundation, Responsible AI Adoption Unit, OneLondon Secure Data Environment, Kings Health Partners and Engineering & Physical Sciences Research Council (ESPRC). J.T.T. has also received research equipment support from Nvidia, Elastic and Scan Computing. J.T.T. is director and shareholder of CogStack Ltd. None of the funders had any say on the content of this work.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41746-024-01408-x.
References
- 1.Malycha, J., Bacchi, S. & Redfern, O. Artificial intelligence and clinical deterioration. Curr. Opin. Crit. Care28, 315–321 (2022). [DOI] [PubMed] [Google Scholar]
- 2.Boussina, A. et al. Impact of a deep learning sepsis prediction model on quality of care and survival. Npj Digit. Med.7, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Williams, B. The National Early Warning Score: from concept to NHS implementation. Clin. Med.22, 499–505 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pimentel, M. A. F. et al. Detecting Deteriorating Patients in the Hospital: Development and Validation of a Novel Scoring System. Am. J. Respir. Crit. Care Med.204, 44–52 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.McGaughey, J., Fergusson, D. A., Bogaert, P. V. & Rose, L. Early warning systems and rapid response systems for the prevention of patient deterioration on acute adult hospital wards. Cochrane Database Syst. Rev.10.1002/14651858.CD005529.pub3 (2021) [DOI] [PMC free article] [PubMed]
- 6.Harris, S. I Don’t Want My Algorithm to Die in a Paper: Detecting Deteriorating Patients Early. Am. J. Respir. Crit. Care Med.204, 4–5 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lenert, M. C., Matheny, M. E. & Walsh, C. G. Prognostic models will be victims of their own success, unless. J. Am. Med. Inform. Assoc. JAMIA26, 1645–1650 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Doust, J. & Mar, C. D. Why do doctors use treatments that do not work? BMJ328, 474–475 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wilson, F. P. et al. Electronic health record alerts for acute kidney injury: multicenter, randomized clinical trial. BMJ372, m4786 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wilson, F. P. et al. A randomized clinical trial assessing the effect of automated medication-targeted alerts on acute kidney injury outcomes. Nat. Commun.14, 2826 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Prosperi, M. et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat. Mach. Intell.2, 369–375 (2020). [Google Scholar]
- 12.Renc, P. et al. Zero shot health trajectory prediction using transformer. npj Digit. Med.7, 1–10 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
