Abstract
Background
Clinical deterioration is often preceded by subtle physiological changes that, if unheeded, can lead to adverse patient outcomes. The precision of traditional scoring systems in detecting these precursors has limitations, prompting the exploration of AI-based predictive models as a means to enhance predictive accuracy and, consequently, patient outcomes.
Methods
A systematic review and meta-analysis were conducted in accordance with PRISMA guidelines. Databases including PubMed, and Web of Science were searched for relevant studies as of April 8, 2024. Studies were selected based on predefined criteria, specifically targeting AI-based models designed to predict in-hospital clinical deterioration.
Results
A total of five studies met the inclusion criteria, all of which underwent prospective clinical validation. These studies demonstrated that AI-based models significantly reduced in-hospital and 30-day mortality rates. Although a downward trend in ICU transfers was observed, the results were not statistically significant. Additionally, the use of AI models shortened overall hospital stays but resulted in a significant increase in ICU length of stay.
Conclusion
The findings suggest that AI-based early warning models positively impact patient outcomes in real-world clinical settings. Despite the potential benefits, the effectiveness and real-world applicability of these models require further research. Challenges such as clinician adherence to AI warnings remain to be addressed.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-025-03048-x.
Keywords: Clinical deterioration, Mortality, Artificial intelligence , Patient outcomes
Introduction
Physiological indicators often undergo changes before clinical deterioration occurs [1]. Failure to timely identify these changes can result in poor prognosis [2]. The increasing attention given to AI in predicting clinical deterioration among hospitalized patients is notable. However, the effectiveness of artificial intelligence (AI)-based predictive models and their ability to improve patient prognosis remains undetermined.
Currently, efforts to detect clinical deterioration early in hospitalized patients have used artificially calculated scores, which are slow, inefficient, and require high investment. Conventional scoring systems such as EWS, NEWS, APACHE II, APACHE III, etc., commonly employ parameters like GCS (Glasgow Coma Scale), age and vital signs for evaluation during admission, changes in medical conditions, and other specific time points. Empirical evidence demonstrates that these routine scoring systems can indeed identify to some extent the deterioration of a patient’s condition [3, 4]. However, due to the limited data dimensions and the overgeneralization of clinical scenarios inherent in conventional scoring systems, they may overlook individual differences and the complexity of clinical situations, thus possessing certain limitations.
Nowadays, there is a growing interest in the use of AI to aid in the early identification of clinically deteriorating patients. Machine learning algorithms, such as logistic regression, neural networks, support vector machines, among others, have been proven effective in some retrospective studies based on databases [5–7]. They demonstrate high accuracy, specificity, and sensitivity in predicting hospitalization duration, 7-day/28-day mortality rates. Theoretically speaking, in comparison to traditional scoring systems, AI-based models exhibit high adaptability and provide more accurate predictions. They enable real-time continuous monitoring, allowing for the timely detection of changes in a patient’s condition. Benefiting from this, AI-based models can more accurately identify clinical deterioration, enabling timely interventions aimed at improving patient outcomes. Meanwhile, higher sensitivity and specificity can reduce false alarms, leading to a more rational allocation of limited medical resources. Therefore, accurately assessing the accuracy of AI-based models and their performance differences compared to traditional scoring systems is a crucial means of confirming the practicality of AI-based models.
A recent meta-analysis evaluated the representation of AI-based clinical models in predicting critical life events in non-ICU adult patients and found that artificial intelligence-based predictive models were overall more effective in predicting patient deterioration [8]. The previous research has primarily focused on retrospective validation of developed models rather than prospective clinical validation. In recent years, with the deepening of machine learning and artificial intelligence research, more clinical prospective studies have been published. However, it has not yet been determined whether AI can play a positive role in the clinical deterioration of hospitalized patients. The objective of this systematic review is to evaluate the effectiveness of clinically validated AI models in predicting in-hospital clinical deterioration and to assess their clinical significance in improving patient outcomes.
Methods
Literature search
This systematic review complied with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. A comprehensive systematic literature search was performed in PubMed and Web of Science databases at April 8th, 2024. The terms artificial intelligence, machine learning, and deterioration were used. The complete search strategy can be found in Supplementary File 1. Manually review eligible research references and identify any potentially omitted studies. This study has been registered with the registration number CRD42024556102 in the International Prospective Register of Systematic Reviews. Title abstract and full-text screening were performed independently by two senior authors (S.-X. Yuan, and C.-D. Wu). Any disagreements were resolved by a third researcher (S.-Q, Liu). For full-text filtering, the reason for exclusion for each article were recorded. Relevant data were extracted by the authors (S.-X. Yuan, C.-D. Wu, and S.-Q, Liu).
Eligibility criteria and study selection
Clinical deterioration was defined as any type of mortality, unplanned ICU transfer, prolonged length of stay (LOS) in hospital or ICU, rapid response team (RRT) activation, or respiratory cardiac arrest. AI-based models were defined as models created using machine learning, deep learning, or other self-learning technologies. We included studies focusing on non-ICU hospitalized adult patients (aged ≥ 18 years) across various medical conditions. Studies involving obstetric patients (pregnant or postpartum women) were excluded, as this population may have unique clinical trajectories and outcomes that differ from general adult patients. The intervention of interest was the use of AI-based models for predicting clinical outcomes in hospitalized patients. The comparator was the absence of AI-based model use in clinical decision-making, such as traditional early warning score or clinician judgment. The primary outcomes of interest included: (a) Mortality: All-cause in-hospital mortality. (b) ICU Transfer: Rate of transfer to the intensive care unit (ICU) during hospitalization. (c) Length of Stay (LOS): Total hospital stay duration, including time spent in the ICU (if applicable). (d) Rapid Response Team (RRT) Responce: Activation of RRT interventions in preventing clinical deterioration or adverse outcomes.
Exclusion criteria included studies conducted on healthy volunteers or non-hospitalized individuals. Studies focusing on COVID-19 were excluded to ensure the universality of the analysis results, as COVID-19 is a known disease that can lead to clinical deterioration.
Quality of evidence and risk of Bias
The risk of bias in nonrandomized intervention studies was assessed using the ROBINS-I tool, which evaluates the risk of bias in results from nonrandomized studies comparing the health effects of two or more interventions [9].
Data synthesis and analysis
All statistical analyses were performed using RevMan 5.3 software. Effect sizes were calculated for each study, utilizing odds ratios for categorical outcomes and mean differences for continuous outcomes, along with their respective 95% confidence intervals. For continuous variables, median and interquartile range (IQR) values were transformed to mean and standard deviation (SD), which were then pooled for analysis [10, 11]. The index of inconsistency (I²) was employed as a measure to assess the degree of heterogeneity among the studies. The I² values of 25%, 50%, and 75% correspond to mild, moderate, and severe levels of heterogeneity, respectively [12]. Given that the I² value exceeded 50%, indicating significant heterogeneity, a random-effects model was employed for the analysis.
Results
Search results
After reference checking for additional studies, abstracts of 3787 articles were screened for eligibility. The detailed literature screening process is shown in Fig. 1. A total of 6238 records were excluded during the screening phase, primarily because the majority of studies on machine learning and prognostic prediction focused on model development and validation using datasets, rather than real-world clinical applications. These studies did not meet our inclusion criteria, which specifically required clinically validated models tested in real-world inpatient settings.
Fig. 1.
Preferred reporting items for systematic reviews and meta-analyses (PRISMA) flow chart
To ensure comprehensive coverage of the literature, we adopted a broad search strategy and relied on thorough manual screening to identify eligible studies. This approach allowed us to capture all potentially relevant articles while minimizing the risk of excluding studies that met our criterion of real-world clinical validation.
Finally, a total of 5 studies were included in our systematic review [6, 13–16]. We screened literature within ten years (2013 and later), and among the included documents, 4 (80%) were published in 2020 and later.
Study characteristics
Table 1 presents the overview of the included studies. 3 studies were single-center studies [6, 13, 16], while 2 studies were conducted in 4 and 19 centers [14, 15]. Levin et al. implemented random forest (RF) in the model [13], while others used logistic regression (LR) as the model development algorithm. More detailed information are presented in the Supplement File 2.
Table 1.
Characteristics of included studies
| First Author | Year | Model name | Algorithm | Study design | Recruitment dates | Study setting | Country | Number of centers | Number of participants | Number of No-MLM group | Number of MLM group | Primary outcome | Results | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Outcome | MLM group | No-MLM group | Difference (95% CI) | |||||||||||||
| Levin et al. [13] | 2024 | Alert | Random forest | Nonrandomized Clustered Pragmatic Clinical Trial | July 2019 - March 2020 | Medical-surgical unit | The United States | 1 | 2740 | 1252 | 1488 | Rate of escalation | Inhospital mortality | 81 (5.4%) | 83 (6.6%) | -1.2% (-3.1–0.7%) |
| ICU transfer | 115(7.7%) | 100(8.0%) | -0.3% (-2.4–1.8%) | |||||||||||||
| Hospital length of stay, d | 5.78 (3.26–10.7) | 6.04 (3.49–12.0) | -0.31 (-1.00 to 0.00) | |||||||||||||
| Bassin et al. [6] | 2023 | DI plus the existing EWS (Between the Flags-BTF) | Logistic regression | Single hospital, pre-post study | January 2019 - April 2021 | All general wards | Australia | 1 | 28,639 | 14,099 | 14,540 | Major adverse event (MAE) | Inhospital mortality | 133(14540) | 151(14099) | 0.854 (0.673–1.084) |
| ICU transfer | 105(14738) | 132(14352) | 0.775 (0.595–1.008) | |||||||||||||
| Hospital length of stay, d | 3.74 (1.84–7.26; 14540) | 3.86 (1.86–7.86; 14099) | P = 0.002 | |||||||||||||
| Winslow et al. [14] | 2022 | eCART | Logistic regression | Multicenter clinical intervention trial | July 1, 2016 - April 30, 2018 | Medical-surgical ward. | The United States | 4 | 6681 | 3490 | 3191 | All-cause inhospital mortality | Inhospital mortality | 281(8.8%) | 485(13.9%) | p < 0.01 |
| ICU transfer | 676(21.2%) | 429(12.3%) | p < 0.001 | |||||||||||||
| Hospital length of stay, d | NI | NI | NI | |||||||||||||
| Escobar et al. [15] | 2020 | AAM | Logistic regression | Large, multicenter cohort | August 1, 2016 - February 28, 2019 | General medical–surgical ward | The United States | 19 | 37,071 | 23,797 | 13,274 | Mortality within 30 days after an AAM alert | Inhospital mortality | 1301(9.8%) | 3420(14.4%) | NI |
| ICU transfer | 2349(17.7%) | 4974(20.9%) | NI | |||||||||||||
| Hospital length of stay, d | 6.5(8.0) | 7.2(10.1) | NI | |||||||||||||
| Bailey et al. [16] | 2013 | NI | Logistic regression | Randomized, controlled crossover study. | July 2007 - December 2011. | General–medical ward | The United States | 1 | 19,761 | 10,120 | 9911 | Intensive care unit transfer | Inhospital mortality | 223 | 227 | NI |
| ICU transfer | 444 | 426 | NI | |||||||||||||
| Hospital length of stay, d | 3.90(3.55) | 3.82(3.58) | NI | |||||||||||||
Risk of bias assessment
Using the ROBINS-I tool, 2 studies was appraised as moderate level of bias because of selection of participants, deviations from intended interventions and missing data [6, 16]. Generally, the quality of the included literature is relatively high, with a low risk of bias (Fig. 2).
Fig. 2.
Risk of bias summary: review authors’ evaluation of risk of bias for each included study. Note: yellow circle equals moderate risk, green circle equals low risk of bias
Outcome measure analysis
Mortality
The forest plots in Fig. 3 presents in-hospital mortality for the five studies. This indicates that after using the ML-based clinical deterioration warning model, the in-hospital mortality rate of patients does significantly decrease (Odds Ratio [OR], 0.69; 95% confidence interval [CI], 0.60–0.79). Especially, Escobar et al. and Levin et al. reported the decreasing 30-d mortality (Figure S1), indicating that such ML-based model may also improve the long-term prognosis of inpatients [13, 15].
Fig. 3.
Forest plot of mortality
ICU transfer
Four articles reported the results of ICU transfers. Compared to the group without the AI-based model, patients in the AI-based group experienced fewer ICU transfers (Odds Ratio[OR], 0.90; 95% confidence interval [CI], 0.76–1.07), showing a downward trend in ICU transfer rates, although the result was not statistically significant (Fig. 4) [6, 13, 15, 16].
Fig. 4.
Forest plot of ICU transfer
LOS
Figure 5 illustrate the shorter length of stay in hospital after applying the ML-based deterioration warning model [6, 13, 15, 16]. The overall mean difference is -0.35 days with a 95% confidence interval of [-0.68, -0.01], indicating that the MLM group had a shorter LOS compared to the No-MLM group. The test for overall effect is statistically significant (Z = 2.01, P = 0.04), indicating that the reduction in LOS in the MLM group is unlikely due to chance. In addition, 2 studies also reported the length of stay in ICU, which showed a distinct effect [6, 14] (Figure S2).
Fig. 5.
Forest plot of lenth of stay in hospital
RRT response
Two studies expounded the frequency of RRT activation (Figure S3). The overall mean difference is -0.35 with a 95% confidence interval of [-0.68, -0.01], suggesting that the MLM group experienced fewer RRT activations than the No-MLM group [6, 14]. The overall effect test is statistically significant (Z = 2.01, P = 0.04), implying that the observed reduction in RRT activations in the MLM group is unlikely to be due to chance.
Discussion
Our review demonstrates that under real-world clinical validation conditions, the use of AI has a positive impact on improving patient prognosis. After deploying the AI model, in-hospital mortality and 30-day mortality were significantly reduced. The number of ICU transfers showed a decreasing trend, though the result was not statistically significant. While the overall hospital stay decreased, the length of ICU stay increased significantly. These findings highlight the multifaceted impact of AI in clinical settings, revealing both its benefits and areas requiring further scrutiny.
The reduction in mortality rates aligns with previous studies suggesting that AI-based models can enhance clinical decision-making and facilitate early intervention [6, 13–15]. However, the observed increase in ICU length of stay raises important questions about the broader implications of AI adoption. One possible explanation is that AI systems may identify high-risk patients more accurately, leading to earlier ICU admissions and prolonged care for individuals who might have been missed by traditional methods. Alternatively, this trend could reflect over-reliance on AI-generated alerts, potentially resulting in unnecessary ICU admissions. This underscores the need for a balanced approach to AI implementation, ensuring that it complements rather than replaces clinical judgment.
The clinical impact of AI must be considered within the broader context of healthcare systems and patient populations. For healthcare staff, the adoption of AI presents both opportunities and challenges. While AI can reduce workload by automating risk assessments, frequent alerts may lead to alarm fatigue and reduced trust in the system. This could undermine the effectiveness of AI tools and necessitate additional training to ensure clinicians understand and appropriately respond to AI-generated warnings. For patients, the potential benefits of improved prognosis must be weighed against the risks of overdiagnosis or unnecessary interventions. At the community level, the widespread adoption of AI could contribute to more efficient resource allocation, but it also raises ethical and logistical questions about equitable access to AI-enhanced care. Moreover, the issue of AI interpretability remains a critical challenge. While the observed clinical benefits, such as reduced mortality rates, are promising, the inability to fully understand the reasoning behind AI-generated predictions poses significant barriers to its adoption. Clinicians may hesitate to rely on AI recommendations if they cannot interpret the underlying logic, potentially limiting the technology’s utility. Addressing this issue requires the development of more interpretable AI models, as well as standardized frameworks for explaining AI-generated outputs to clinicians. This will not only enhance trust in AI systems but also facilitate their seamless integration into clinical workflows.
Our findings also highlight several gaps in the existing literature. While past research has shown that AI-based prediction models perform well in internal and external validation settings [8], the studies included in the systematic review by Veldhuis et al. were retrospective, and real-world clinical validation remains limited. For instance, Bailey et al.‘s 2013 study on clinical validation of an AI-based model found no improvement in clinical deterioratio n [16], while Evans et al.‘s 2016 study demonstrated reduced mortality and length of stay but was limited by a small sample size of only 175 patients [17]. Most studies to date have been single-center or non-randomized trials, which limits the generalizability of their findings [15]. Additionally, there is a lack of research exploring the long-term impact of AI on clinical workflows and patient outcomes. Future studies should prioritize multi-center randomized controlled trials (RCTs) to provide more robust evidence, as well as qualitative research to understand the human factors influencing AI adoption.
Based on these findings, we recommend several strategies for optimizing the clinical use of AI. First, healthcare systems should invest in clinician education and training to enhance understanding of AI tools and mitigate alarm fatigue. Second, AI models should be designed with user feedback in mind, ensuring that alerts are actionable and contextually relevant. Finally, further research is needed to evaluate the cost-effectiveness of AI implementation and its impact on healthcare disparities. These steps will help ensure that AI is integrated into healthcare systems in a way that maximizes its benefits while addressing its potential drawbacks.
Strengthens and limitations
The studies included in our review all underwent clinical prospective validation and evaluate the effectiveness and performance of AI-based models in the real world. To our knowledge, this review is the first systematic review and meta-analysis to comprehensively analyze the clinical significance of clinically validated inpatient prediction models. We revealed that the AI-based model could effectively improve patient outcomes.
Meanwhile, our review also has some limitations. First, due to insufficient publications in the current literature and other reasons, there are fewer studies meet the inclusion criteria. Moreover, most of these articles use logistic regression, which is considered a traditional AI model, and only one article uses a modern AI model, random forest, for modeling.
Another notable limitation is the variability in patient populations across the included studies. The types and severity of illnesses differed significantly among individual studies, which could independently influence mortality outcomes unrelated to AI interventions. For instance, some studies may have included patients with more advanced or complex conditions, while others focused on less severe cases. This heterogeneity in patient characteristics introduces potential confounding factors that may impact the generalizability of our findings.
However, we mitigated this issue to some extent by ensuring that all included studies were conducted in general medical and surgical wards without selective inclusion of specific diseases or conditions. This approach helps to align the clinical settings and care practices across studies, providing a more comparable baseline for evaluating AI-driven interventions. Nevertheless, the inherent variability in disease severity and patient demographics across studies underscores the need for cautious interpretation of the results. Future research should aim to standardize patient characteristics or stratify analyses by disease severity to reduce this heterogeneity and enhance the robustness of findings.
In conclusion, our study provides evidence that AI-powered early warning systems have the potential to significantly improve patient outcomes in real-world clinical settings. The integration of AI with existing monitoring systems has demonstrated promising results, highlighting the value of these advanced tools in the timely detection of clinical deterioration.However, while these findings are encouraging, it is important to recognize that the effectiveness and real-world applicability of AI-based models are still areas that require ongoing research and evaluation.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
The authors gratefully acknowledge the Department of Critical Care Medicine at Zhongda Hospital, Affiliated to Southeast University, for their invaluable technical assistance and for providing access to their research facilities. Their support was instrumental in the successful completion of this study.
Author contributions
Shixin Yuan: Designed the study, developed the search strategy, screened and selected studies, conducted the data analysis, and drafted the manuscript. Junjie Li: Assisted in study design, participated in data extraction and quality assessment, contributed to the data analysis, and helped draft the manuscript. Zihuan Yang: Participated in the selection of studies, assisted with data extraction, and provided critical input on the manuscript draft. Changde Wu: Contributed to the data extraction process, helped with the initial manuscript draft, and reviewed the final manuscript.Songqiao Liu: Provided oversight for the study design, assisted in the interpretation of results, and contributed to the revision of the manuscript.
Funding
Not applicable.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All authors have agreed to the publication of this manuscript.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Changde Wu, Email: njwu163@163.com.
Songqiao Liu, Email: liusongqiao@ymail.com.
References
- 1.Andersen LW, Kim WY, Chase M, et al. The prevalence and significance of abnormal vital signs prior to in-hospital cardiac arrest. Resuscitation. 2016;98:112–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Barwise A, Thongprayoon C, Gajic O, et al. Delayed rapid response team activation is associated with increased hospital mortality, morbidity, and length of stay in a tertiary care institution. Crit Care Med. 2016;44(1):54–63. [DOI] [PubMed]
- 3.Devita MA, Bellomo R, Hillman K, et al. Findings of the first consensus conference on medical emergency teams. Crit Care Med. 2006;34(9):2463–78. [DOI] [PubMed]
- 4.Jones D, Rubulotta F, Welch J. Rapid response teams improve outcomes: yes. Intensive Care Med. 2016;42(4):593–5. [DOI] [PubMed]
- 5.Kim SY, Kim S, Cho J, et al. A deep learning model for real-time mortality prediction in critically ill children. Crit Care. 2019;23(1):279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bassin L, Raubenheimer J, Bell D. The implementation of a real time early warning system using machine learning in an Australian hospital to improve patient outcomes. Resuscitation. 2023;188:109821. [DOI] [PubMed] [Google Scholar]
- 7.Pimentel MAF, Redfern OC, Malycha J, et al. Detecting deteriorating patients in the hospital: development and validation of a novel scoring system. Am J Respir Crit Care Med. 2021;204(1):44–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Veldhuis LI, Woittiez NJC, Nanayakkara PWB, et al. Artificial intelligence for the prediction of In-Hospital clinical deterioration: a systematic review. Crit Care Explor. 2022;4(9):e0744. [DOI] [PMC free article] [PubMed]
- 9.Sterne JA, Hernan MA, Reeves BC, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Luo D, Wan X, Liu J, et al. Optimally estimating the sample mean from the sample size, median, mid-range, and/or mid-quartile range. Stat Methods Med Res. 2018;27(6):1785–805. [DOI] [PubMed] [Google Scholar]
- 11.Wan X, Wang W, Liu J, et al. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol. 2014;14:135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Levin MA, Kia A, Timsina P, et al. Real-Time machine learning alerts to prevent escalation of care: a nonrandomized clustered pragmatic clinical trial. Crit Care Med. 2024;52(7):1007–20. [DOI] [PubMed]
- 14.Winslow CJ, Edelson DP, Churpek MM, et al. The impact of a machine learning early warning score on hospital mortality: A multicenter clinical intervention trial. Crit Care Med. 2022;50(9):1339–47. [DOI] [PubMed] [Google Scholar]
- 15.Escobar GJ, Liu VX, Schuler A, et al. Automated identification of adults at risk for In-Hospital clinical deterioration. N Engl J Med. 2020;383(20):1951–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bailey TC, Chen Y, Mao Y, et al. A trial of a real-time alert for clinical deterioration in patients hospitalized on general medical wards. J Hosp Med. 2013;8(5):236–42. [DOI] [PubMed] [Google Scholar]
- 17.Evans RS, Benuzillo J, Horne BD, et al. Automated identification and predictive tools to help identify high-risk heart failure patients: pilot evaluation. J Am Med Inf Assoc. 2016;23(5):872–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
No datasets were generated or analysed during the current study.





