Abstract
Objective
To identify factors influencing implementation of machine learning algorithms (MLAs) that predict clinical deterioration in hospitalized adult patients and relate these to a validated implementation framework.
Materials and methods
A systematic review of studies of implemented or trialed real-time clinical deterioration prediction MLAs was undertaken, which identified: how MLA implementation was measured; impact of MLAs on clinical processes and patient outcomes; and barriers, enablers and uncertainties within the implementation process. Review findings were then mapped to the SALIENT end-to-end implementation framework to identify the implementation stages at which these factors applied.
Results
Thirty-seven articles relating to 14 groups of MLAs were identified, each trialing or implementing a bespoke algorithm. One hundred and seven distinct implementation evaluation metrics were identified. Four groups reported decreased hospital mortality, 1 significantly. We identified 24 barriers, 40 enablers, and 14 uncertainties and mapped these to the 5 stages of the SALIENT implementation framework.
Discussion
Algorithm performance across implementation stages decreased between in silico and trial stages. Silent plus pilot trial inclusion was associated with decreased mortality, as was the use of logistic regression algorithms that used less than 39 variables. Mitigation of alert fatigue via alert suppression and threshold configuration was commonly employed across groups.
Conclusions
: There is evidence that real-world implementation of clinical deterioration prediction MLAs may improve clinical outcomes. Various factors identified as influencing success or failure of implementation can be mapped to different stages of implementation, thereby providing useful and practical guidance for implementers.
Keywords: clinical deterioration prediction, systematic review, AI implementation, machine learning, artificial intelligence, health informatics, digital health
Introduction
Clinical deterioration in hospitals is variously defined,1 most recently as, “An acute worsening of a patient’s clinical status that poses a substantial increase to an individual’s short-term risk of death or serious harm.”2(p4) Clinical deterioration prediction algorithms based on machine or deep learning methods (herein called Machine Learning Algorithms—MLAs)3,4 present an opportunity to identify deteriorating patients earlier than existing rule-based methods5–7 such as the National Early Warning Score (NEWS),8 Modified Early Warning Score (MEWS),9 and Queensland Adult-Deterioration-Detection-System (Q-ADDS).10 Although MLA investigations are mostly retrospective in silico studies, many healthcare organizations are looking to implement MLAs into routine care to reduce mortality and morbidity. Retrospective analyses are of limited practical value when it comes to real-world implementation within health services which, according to theoretical frameworks, is a multistaged process with many factors influencing success.11–15 Health service decision-makers need to understand the enablers, barriers, and uncertainties that exist within end-to-end MLA implementation, from MLA selection based on retrospective validation studies through to prospective silent mode studies and live mode clinical trials and eventually routine use and postdeployment evaluation. In acquiring this understanding, a synthesis of published studies of clinical deterioration MLA implementation involving all stages of the process is needed in highlighting the differences in MLA implementation and their impacts on performance and clinical outcomes. Such a synthesis is presently lacking. Blythe et al16 reviewed studies on the clinical impact of implemented early warning systems that utilized real-time automated alerts, of which only 3 comprised MLAs. Three other reviews that focused on MLAs for clinical deterioration prediction included predominantly retrospective studies.4,17,18 For example, Muralitharan et al4 reported just 1 implemented system from 25 studies and Christodoulou et al18 reported none among 71 studies.
Mapping the various modes of implementation to a validated end-to-end Artificial Intelligence (AI) implementation framework15 further helps in identifying where and when enablers, barriers, and uncertainties apply to each stage of implementation. The SALIENT framework is stage-based and derived from authoritative clinical AI evaluation reporting guidelines11,19–21 in conjunction with Stead et al’s22 multistage approach to translating medical informatics interventions from the lab to the field. Compared to prior frameworks,11–14 SALIENT makes fully visible all components of the end-to-end solution, how and when they integrate, and the underlying implementation tasks. It has also been validated on real-world Sepsis prediction MLAs, similar to this work.23
In this study, we aimed to systematically review studies reporting the implementation or trialing of MLAs predicting clinical deterioration in adult hospitalized patients and map their findings to the SALIENT implementation framework.15
Objectives
The first objective was to undertake a systematic review which identified and analyzed studies that implemented or trialed real-time clinical deterioration prediction MLAs. Analyses included (1) how MLA implementation was measured; (2) the impact of MLAs on clinical processes and patient outcomes; and (3) where and when barriers, enablers, and uncertainties apply within the implementation process. The second objective was to map the systematic review findings to the stages and elements of the SALIENT implementation framework.
Materials and methods
Search strategy
The systematic review was performed according to PRISMA guidelines.24 Five databases (PubMed/MEDLINE, EMBASE, Scopus, Web of Science, CINAHL) were searched between January 1, 2010 and April 1, 2023 for titles and abstracts published in English using keywords and synonyms for: (1) predict; AND (2) clinical deterioration; AND (3) machine learning; AND (4) trial; and NOT (5) child (see Appendix SA for complete search queries).
A forwards and backwards citation search (snowballing strategy) was then applied to identify additional articles reporting new MLAs, or, providing further information about MLAs described in previously included studies. The latter were labeled “linked” studies, describing the same MLAs at different stages of implementation, but not considered primary articles.
Study selection
Studies of any design were included if: MLAs were applied to adult patients in hospital settings in whom clinical deterioration was identified; used live or near-live data; and reported at least one or more algorithm performance metric (full details in Appendix SB). Excluded studies were those not related to implementation or providing insufficient information for analysis. Covidence software25 supported a 2-stage screening process with screening of articles by 4 independent reviewers (A.H.V., V.R.K., P.J.L., and J.M.), with conflicts agreed by 3-way consensus (A.H.V., V.R.K., and P.J.L.); and full-text review by 3 independent reviewers (A.H.V., J.S., and T.F.), with selection agreed by 3-way consensus (A.H.V., J.S., and T.F.). Snowballing was then applied to all included studies and any new or linked studies were identified by A.H.V. and verified by J.A.D., N.E., and C.-H.L.
Data extraction
Data were extracted independently by 4 authors (A.H.V., J.A.D., N.E., and C.-H.L.) using Excel templates, with disagreements resolved by consensus. Extracted data included study metadata, implementation stage, care setting, MLA details including training and validation datasets, performance metrics, outcome definitions and events (including mortality, cardiac arrest and unplanned transfer to intensive care units [ICUs]), and implementation barriers, enablers, and uncertainties (see Appendix SC for more details). Barriers were defined as pitfalls or problems hindering implementation success; enablers as tips or activities aiding implementation success. Uncertainties were identified when 2 or more studies chose different approaches for the same implementation decision. Consensus between authors (A.H.V., J.A.D., N.E., and C.-H.L.) determined which individual barriers, enablers, and uncertainties to include and which to consolidate under a common title to minimize overlap.
Mapping to AI implementation framework
The systematic review findings for each barrier, enabler, and uncertainty were mapped to at least 1-stage and 1-solution component or organization and policy factor within the SALIENT implementation framework (Figure 1). SALIENT element descriptions are provided in Table 1. Mapping was followed by a review by A.H.V. and V.C. and adjustments were made where discrepancies were found.
Figure 1.
Abridged version of the SALIENT clinical Artificial Intelligence (AI) implementation framework.23 The stages of implementation (I-V) are listed across the top in black and white, with a short description of each stage beneath. The AI solution components are provided in 4 bars, labelled on the left hand side that stretch across each stage. Key implementation tasks are identified in white boxes within each component underneath the stage they are likely to occur in, including: preparation, design, development and testing (dev/test), and update. The integrated solution is represented by the bar underneath the solution components. Tasks within the integrated solution are the problem definition in stage I, integration and evaluation of the solution in stages III and IV, and then routine use in stage V. Other crossstage organizational and policy factors are provided as 5 bars at the bottoms of the diagram.
Table 1.
Reference code (column 1), name (column 2), and descriptions (column 3) of each SALIENT stage, Artificial Intelligence (AI) solution component and organization, and policy factor (column 2) that barriers, enablers, and uncertainties were mapped to.
SALIENT element | Element description | |
---|---|---|
Stages of implementation | ||
I | Definition | When the clinical problem is defined, the rationale for change, background, context, and intended use of the potential Artificial Intelligence (AI). |
II | Retrospective study | When a retrospective, in silico evaluation is performed on the AI algorithm solution component. |
III | Silent trial | When a prospective, live-data evaluation is performed on the AI algorithm and data pipeline solution components. Also called a silent or shadow trial. |
IV | Pilot trial | When a small trial is conducted within clinical practice to evaluate the whole AI solution and to identify issues and problems before moving to a larger trial or roll-out of the solution. |
V | Large trial/roll-out | When the solution is run in its operational environment, within clinical practice and evaluated as a larger trial, such as a random controlled trial, or as a general roll-out across hospital wards. |
Implemented solution components | ||
DP | Data pipeline | The technology and infrastructure extending from where real-time clinical data is captured, stored, extracted, transferred, and transformed to where it is made available for use by the AI model and human-computer interface. |
AI | AI model | The MLA development, training, and deployment including any the algorithm employed, variables used as input and any configuration and tuning. |
HCI | Human-computer interface | The user interface (eg, dashboard) or mechanism employed (eg, mobile alert) to transfer the outputs of the AI model to the clinician. Includes content, layout, format, and interactivity. |
CW | Clinical workflow | The changes required to the existing clinical workflow that are designed to accommodate the AI model outputs and human-computer interface. |
Integrated solution | ||
This complete solution that integrates the system components (data pipeline, AI model, and human-computer interface) with the new clinical workflow. After integration the solution is evaluated before moving to routine use | ||
Organization and policy factors | ||
GOV | Governance | The governance of all aspects of implementation including the scope of the solution, model selection process, and the extent of oversight required. |
ICA | Implementation, change management, and adoption | The management of the implementation projects including identification of stakeholders, leadership, implementation roles and responsibilities, change process, and solution adoption approach |
RL | Regulatory and legal | The legal regulatory approval and compliance process for deploying AI solutions and other legal factors, such as legal responsibility and accountability |
ET | Ethics | The ethical aspects of implementing an AI solution including patient data privacy, cyber-security, transparency of the use of AI and interpretability of its outputs, auditability, equity of AI use including bias and fairness considerations |
QS | Quality and safety | The solution quality and safety considerations including patient risk, incident reporting and monitoring and maintenance of quality and safety indicators |
Quality assessment
Studies reporting hospital mortality underwent a risk of bias (RoB) assessment as mortality was the most frequently reported patient outcome measure and considered the most important. RoB assessment was performed independently by 2 authors (A.F. and N.M.), using the ROBINS-I tool26 for nonrandomized studies and the Cochrane RoB 2 tool27 for randomized studies.
Results
From 1337 retrieved abstracts, 497 duplicates were removed, leaving 840 for screening, from which 9 full-text studies were included for analysis (Figure 2).28–36 Most excluded studies were not full text or not implemented. An additional 5 articles found by snowballing were selected,37–41 yielding 14 studies as primary articles, with further snowballing yielding 23 linked studies,6,7,42–62 giving a total of 37 articles for analysis.
Figure 2.
PRISMA-ScR flowchart for study selection.
Study characteristics
The 37 studies were published between 2011 and 2023, with 14 algorithm groups (A to N) identified according to the common or named MLA that was the focus of study (Table 2); 10 were US-based (A, C, E-I, K, L, N) with one group each from Australia (B), Korea (D), Canada (J), and Singapore (M). Six groups (A, F, G, H, K, M) implemented live mode MLAs with a quantitative evaluation (before-after study,55,60 randomized controlled trial,28,44 controlled trial,40 difference-in-difference study,38 cohort study36). Seven groups (A, B, C, F, I, L, N) conducted silent trials with quantitative evaluations (prospective evaluation,29,31,33,35,37,41 simulation42). Two groups (H, J) conducted qualitative case studies during58,61,62 or after32,47,51,54 live mode implementation and one group (M) prior to implementation.56 Three groups (D, E, L) reported postimplementation retrospective studies.30,39,49 All except 3 groups reported retrospective in silico studies validating the MLA prior to implementation (L, M, N).
Table 2.
Study characteristics.
Group (MLA name) | Algorithm | Reference | SALIENT stage | Study design (no. of sites) | Outcome | Outcome count (prevalence) (%) | AUC | Sensitivity (%) | PPV (%) |
---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
0.92 0.88 0.73 0.88 |
61 49 41 41-54 |
38 31 30 10-15 |
|
|
|
|
|
icu, D, met, uot for all |
|
0.92 0.89 |
41 27-39 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
0.88 | 79 | 12 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 | 10 | |
|
|
|
|
|
1186 (9.6) |
|
|
|
|
|
|
Romero-Brufau et al (2021)38,53 |
|
|
|
|
|
73 | 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
icu, various | 17 (50) | 0.93 | 94 | 89 | |
|
|
Ye et al (2019)41 | III | P(2) | D | 255 (2.2) | 0.88 | 23 | 31 |
Abbreviations: P, prospective; R, retrospective; S, simulation; O, observational; CT, clinical trial; RCT, randomized controlled trial. Outcomes: D, death; icu, unplanned transfer to ICU; ca, cardiac arrest; met, medical emergency team call; uot, unplanned return to operating theatre; mv, mechanical ventilation; NR, not reported.
Subcohort of covid patients only.
Groups (F, H) included the only multicenter trials55,60 and group H included the only trial reporting more than 10 000 outcome events.60 Median silent trial length was 3.5 months (IQR 3.5) and live clinical trial length was 9 months (IQR 10.5).
The prevalence of deterioration outcomes varied from as low as 2.1%63 to as high as 22.7% in non-ICU settings,64 and from 11.3%65 to 32.8%66 in ICU settings.
Eighteen studies reported MLA evaluations at SALIENT implementation stage II (retrospective validation),6,7,28–30,33,34,37,42,43,45,46,48,50,52,53,57,59 6 at stage III (silent trial),29,31,33,35,37,41 6 at stage IV (pilot trial),28,34,36,38,40,44 2 at stage V (large trial/roll-out),55,60 and 3 reported postdeployment evaluations.30,39,49
Quality assessment
Six publications from 5 groups (A, F, G, H, K) were assessed for RoB (see Appendix SD). Overall, RoB was serious for 2 groups (G, K) and moderate for 3 (A, F, H). Major sources of bias, which no trial controlled for,38,55,60 were potential confounders from additional alert system cointerventions, such as trial staff involvement in the design and setup of the MLA within clinical workflows and special training and different patient care protocols associated with new clinical workflows, all potentially impacting hospital mortality and clinical process indicators at the trial sites.
Implementation and clinical impact evaluation
One hundred and seven distinct metrics were identified across 33 (89%) studies, grouped into 4 evaluation categories: (1) algorithm performance; (2) alert performance; (3) clinical process effects; and (4) patient outcome effects. All metrics reported are listed in Appendix SE.
Algorithm and alert performance
Of 33 algorithm performance metrics, sensitivity and area under the receiver operating curve (AUROC) were reported across all groups, with positive predictive value (PPV) (12 groups) and specificity (11 groups) the next most common. Most algorithm metrics (70%) were common to at-most 2 groups. Alert metrics (n = 20) were reported by 79% (n = 14) of groups, with 8 reporting median or average alert hours before the deterioration event. All other metrics were common to at-most 2 groups, however, 12 of the remaining 18 alert metrics reported were a variant of the mean alarm count per day (MACD).
All of 6 studies that evaluated stage III (silent) trials reported algorithm performance metrics29,31,33,35,37,42 and 4 of these also reported stage II (retrospective) results29,33,37,42; algorithm performance declined in all 4 between stages II and III for at least one of AUROC, sensitivity, or PPV. Only one study reported algorithm performance at both stage III (silent) and IV/V (Trial/roll-out), also reporting a decrease in AUROC.34 Dziadzko et al is the only study reporting comparable stage II and postimplementation algorithm performance in which AUROC improved (0.87-0.90) for a very small sample (35 patient outcomes). However, PPV fell by 24% affirming that AUROC stability across settings is only one marker of MLA quality.
Clinical impact
A total of 37 clinical process metrics, defined as measures of impact on clinical practice, were reported within 33 studies, 17 solely reported by Kollef et al,44 who evaluated a range of diagnostic and therapeutic interventions administered within 24 h of the alert, including antibiotics, vasopressors, and oximetry. Of the 20 remaining metrics, those reported by more than one group were ICU transfer rates (5 groups, 6 studies) and median hours between alert and clinical escalation (2 groups, 2 studies).
Seventeen patient outcomes were reported, the most common being hospital mortality (5 groups, 6 studies), hospital length of stay (LOS) (3 groups, 4 studies), and ICU LOS and 30-day mortality (both 2 groups, 2 studies).
Table 3 reports algorithm performance by SALIENT stage of implementation and clinical impact for the 5 groups reporting in-hospital mortality. Four of the 6 studies reporting hospital mortality showed numerical improvement,40,44,55,60 of which Winslow et al55 (group F) reported the only statistically significant reduction. Groups G and H also reported statistically significant reduction in mortality, but for combined in-hospital and 30-day mortality (G) and death within 30 days of first alert (H). All 4 also reported improved clinical process metrics, 3 being statistically significant (groups A, F, G).40,44,55 Although group G did not report a statistically significant reduction in hospital mortality alone, it did report a 2.5% statistically significant reduction in the combined metric of in-hospital and 30-day mortality. Groups A and F were the only groups to report stage III (silent trial) algorithm performance with AUROC of 0.73 and 0.80, respectively. Two studies reported no change or statistically insignificant hospital mortality.28,53 The largest study (group H, 36 233 outcomes60) reported nonsignificant improvement in clinical process metrics and mortality.
Table 3.
Evaluation results for each group that reported in-hospital mortality before and after the implementation of the MLA.
MLA evaluation results by SALIENT stage |
Clinical impact |
||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Study outcome count (prevalence %) | Stage II Retrospective |
Stage III Silent trial |
Stage IV/V Post-impl. |
Clinical processes |
Mortality change |
||||||||||||
Group and study | AUC | Sensitivity (%) | PPV (%) | AUC | Sensitivity (%) | PPV (%) | Sensitivity (%) | PPV (%) | No. of improve | No. of improve* | No. of decline | No. of decline* | In-hospital | Other | Risk of bias | ||
A | Mao et al (2011)48 | 1173 (4.1) | 0.92 | 61 | 38 | ||||||||||||
Hackmann et al (2011)42 | 1173 (4.1) | 0.88 | 49 | 31 | |||||||||||||
Hackmann et al (2011)42 | 0.73 | 41 | 30 | ||||||||||||||
Bailey et al (2013)28 | 1320 (3.4) | 41-54 | 10-15 | 0 | |||||||||||||
Kollef et al (2014)44 | 146 (26) | 1 | 2 | 1 | ] | M | |||||||||||
F | Churpek et al (2014)57 | 109 (0.2) | 0.88 | 60-77 | |||||||||||||
Churpek et al (2014)6 | 160 (0.2) | 0.83 | 54 | ||||||||||||||
Kang et al (2016)31 | 393 (7.1) | 0.80 | 53-100 | 0.8 | 53-100 | ||||||||||||
Bartkowiak et al (2019)52 | 1243 (3.8) | 0.79 | 75 | ||||||||||||||
Winslow et al (2022)55 | 6930 (12) | 6 | ]* | M | |||||||||||||
G | Kia et al (2020)7 | 1997 (3.4) | 0.88 | 79 | 12 | ||||||||||||
Levin et al (2022)40 | 292 (10.5) | 93 | 7 | 7 | 1 | ] | ]*,a | S | |||||||||
H | Escobar et al (2012)59 | 4036 (9.2) | 0.78 | 8 | |||||||||||||
Kipnis et al (2016)43 | 19 153 (2.9) | 0.82 | 49 | 16 | |||||||||||||
Escobar et al (2020)60 | 36 233(6.6) | 3 | ] | ]*,b | M | ||||||||||||
K | Romero-Brufau et al (2021)38,53 | 1547 (4.1) | 0.91 | ||||||||||||||
6909 (12) | 0.94 | 73 | 12 | ||||||||||||||
933(∼8.0) | 2 | 1 | Ζ | S |
Results for each group study include: (1) the outcome count and percent prevalence; (2) the SALIENT stage II, III, and combined IV/V MLA evaluation results, reported as area under the receiver operating curve (AUC), sensitivity, and positive predictive value (PPV); (3) the clinical process improvements, measured as the number of improved processes and the number of processes that declined ( represents significant change); (4) the mortality change (in-hospital and other) where ],Ζ,0 indicate decrease, increase, no change and where trailing
indicates significant result; and (5) the risk of bias assessment for the study reporting mortality outcomes.
Combined in hospital and 30-day mortality.
Death within 30 days of first alert.
Implementation factors and mapping to SALIENT framework
Barriers and enablers
We identified 24 barriers and 40 enablers from a total of 225 mentions across all studies. Tables 4 and 5 list the barriers and enablers as identified by at least 2 groups. The most common barriers (ie, those identified by at least 4 groups) were limitations in Electronic Health Record (EHR) data (B1), ICU transfer as a poor outcome for MLA training and evaluation (B2), alert fatigue (B3), EHR data entry delays (B4), and site-by-site prevalence differences in deterioration outcomes requiring MLA retraining (B5). Nine barriers (38%) were each found in just one group. The median barrier mentions per group was 3, with group H accounting for 25 (35%) and groups (I, M) contributing none.
Table 4.
Implementation barriers reported by at least 2 groups (see Table SF1 for full listing).
Group count (%) | Study count (%) | ID | Barriers (SALIENT stage) | SALIENT component or element |
---|---|---|---|---|
7 (58) | 8 (27) | B01 | Inherent limitations of EHR data, which can be plagued by missingness, inaccuracies, and changes in practice patterns over time; Manually collected vital sign readings resulting in very irregular time series and multiscale gaps. (II+) | DP; AI |
5 (42) | 8 (27) | B02 | Use of ICU transfer is not a good outcome for Artificial Intelligence (AI) development as admission to ICU criteria may differ between hospitals. (II/III) | AI; EV |
4 (33) | 5 (17) | B03 | Alert fatigue (IV/V) | AI; CW; ICA |
4 (33) | 5 (17) | B04 | Data entry delays, leading to delayed predictions (III+) | DP |
4 (33) | 4 (13) | B05 | Differences in outcome prevalence at different sites might require models to be retrained or at least new alert thresholds selected for those sites to maintain target PPV. (V) | AI |
3 (25) | 4 (13) | B06 | Lack of clinician trust. (IV+) | ICA |
3 (25) | 4 (13) | B07 | Lack of a specific and/or effective action for the clinician to take when alerted; Differential nurse/doctor role or perceptions of role and value; Variations in hospital governance for defining standardized response processes. (IV+) | CW; ICA; GOV |
3 (25) | 3 (10) | B08 | Lack of infrastructure to produce live EHR data pipelines. (III+) | DP |
3 (25) | 3 (10) | B09 | Major differences between retrospective data elements and prospective/trial data elements. (III) | DP; AI |
3 (25) | 5 (17) | B10 | Hard/impossible to assess in an implementation study whether a lack of clinical outcome is due to the algorithm performance or the downstream Rapid Response Team (RRT) system; Conducting RCT are often not possible. (IV+) | EV |
3 (25) | 3 (10) | B11 | Insufficient event samples to build a model that incorporates patient subgroups; biases in algorithm for different patient groups. (II) | AI; ethics; QS |
2 (17) | 2 (7) | B12 | Substantial cost involved for infrastructure, implementation personnel time, and ongoing maintenance. (II+) | ICA; GOV |
2 (17) | 2 (7) | B13 | Lack of individual proficiency of health professionals in the use of hardware and software. (IV+) | CW; ICA |
2 (17) | 2 (7) | B14 | No measurement of whether the alerted physician followed through with an action. (IV+) | EV |
2 (17) | 3 (10) | B15 | Differences in software versions between research and production environments. (IV+) | DP; AI |
Includes the number and percentage (n = 12) of groups (column 1) and the number and percentage (n = 30) of studies (column 2) reporting each barrier. The last column contains the mapping to SALIENT components and elements. Salient components are: HCI, human-computer interface; AI, artificial intelligence model; CW, clinical workflow; DP, data pipeline. SALIENT elements are: ICA, implementation, change management and adoption; EV, evaluation; RL, regulatory and legal; QS, quality and safety; Ethics, privacy, transparency and equity; GOV, governance.
Table 5.
Implementation enablers reported by at least 2 groups (see Table SF2 for full listing).
Group count (%) | Study count (%) | ID | Enablers (SALIENT stage) | SALIENT component or element |
---|---|---|---|---|
8 (61) | 16 (45) | E01 | Clinician involvement essential at all stages of model/HCI development and integration into clinical workflow. (II+) | AI; CW; HCI; GOV |
6 (46) | 13 (37) | E02 | Methods identified to reduce false alarms and alert fatigue. (II+) | AI; CW; HCI |
5 (38) | 10 (28) | E03 | Linking the EWS alert to specific clinician actions. Clarifying clinical decision points, who is responsible and the actions to take. (III/IV) | CW; RL |
5 (38) | 7 (20) | E04 | Using more EHR variables than just vital signs can improve accuracy. (II) | AI |
4 (30) | 6 (17) | E05 | Establishing a transdisciplinary team of data scientists, statisticians, hospitalists, intensivists, ED clinicians, RRT nurses, and information technology leaders and developing capabilities across domains. (I+) | ICA; GOV |
4 (30) | 5 (14) | E06 | Conducting external validations using datasets different in both time and geographical location may support models that require less updates and retraining. (II) | AI |
4 (30) | 5 (14) | E07 | Providing additional data with the alert for clinicians to help contextualize the information. (III+) | HCI |
3 (23) | 7 (20) | E08 | Frequent communications to increase awareness during and after trial, for example, weekly meetings, emails, educational sessions giving progress reports and setting next goals and highlighting urgent need. (IV+) | ICA |
3 (23) | 5 (14) | E09 | Iterative approach to design of clinical workflow, human-computer interface (HCI), and MLA model. (II+) | AI; CW; HCI |
3 (23) | 5 (14) | E10 | Performing postimplementation interview (study) and real-time feedback to identify improvements. (IV+) | CW; HCI; QS; ICA |
3 (23) | 7 (20) | E11 | Establishing a multidisciplinary governance committee to promote usage, track compliance, provide training and plan for post-trial sustainability; and an external data safety board to oversee safety and AI efficacy. (I+) | GOV; QS |
3 (23) | 6 (17) | E12 | Staggered deployment across sites. (V+) | ICA |
3 (23) | 3 (8) | E13 | A “Model Facts” sheet designed to convey relevant information about the model to clinical end users. (II) | ethics; AI; ICA; CW |
3 (23) | 4 (11) | E14 | Improving model training for imbalanced datasets. (II) | AI |
3 (23) | 5 (14) | E15 | A silent prospective trial conducted while existing RRT system is in place allows independent assessment of MLA performance vs existing approach. (III) | EV; ICA; DP |
2 (15) | 4 (11) | E16 | Conducting improvement initiatives (PDSA) cycles during implementation to quickly garner and act on clinical feedback. (III+) | CW; QS; ICA |
2 (15) | 4 (11) | E17 | Appointing clinical champions to advocate for the tool. (II+) | ICA |
2 (15) | 2 (5) | E18 | Implementing alternative workflows during peak hours and around staff times. (IV+) | CW; ICA |
2 (15) | 3 (8) | E19 | Teaching clinicians how to interpret risk scores. (IV+) | CW; ET; ICA |
2 (15) | 4 (11) | E20 | Strong support from senior leadership. (I+) | ICA; GOV |
2 (15) | 4 (11) | E21 | Increasing trust in the model as clinicians experienced the algorithm making correct predictions and detecting cases that clinicians miss. (IV+) | ICA |
2 (15) | 2 (5) | E22 | Creating a data dictionary to harmonize data for the model across different sites/EHR systems. (V) | DP |
2 (15) | 4 (11) | E23 | Integrating advance care planning into MLA-linked actions (palliative care built in). (III/IV) | ET; CW |
2 (15) | 3 (8) | E24 | Incorporating the patient into care decisions; for example, developing a clinician script to explain to patients why the clinician is suddenly evaluating them. (III/IV) | ET; CW |
2 (15) | 4 (11) | E25 | Quality tracking postimplementation and sustainability of solution. (IV/V) | EV; QS |
2 (15) | 2 (5) | E26 | Utilizing commonly collected EHR data for the model so that the model is transferable. (II) | DP; AI |
Includes the number and percentage (n = 13) of groups (column 1) and the number and percentage (n = 35) of studies (column 2) reporting each enabler. The last column contains the mapping to SALIENT components and elements. Salient components are: HCI, human-computer interface; AI, artificial intelligence model; CW, clinical workflow; DP, data pipeline. SALIENT elements are: ICA, implementation, change management, and adoption; EV, evaluation; RL, regulatory and legal; QS, quality and safety; Ethics, privacy, transparency, and equity; GOV, governance.
The most commonly reported enablers (ie, those identified by at least 5 groups) were clinician involvement throughout implementation (E01), methods identified to reduce false alarms (E02), linking the alert with clinician action (E03) and using more variables in the MLA than just vital signs (E04). Fourteen (35%) enablers were each identified in just one group, the median number of mentions per group being 5.5, with groups H and J accounting for 40% and 24% of mentions, respectively.
Overall, 89% of all barriers and enablers were AI task agnostic, with 2 barriers (B02, B10) and 5 enablers (E04, E15, E23, E30, E32) specific to clinical deterioration prediction. All barriers and enablers were mapped to the SALIENT AI implantation framework (see Figure 3) with most applicable (N = 30) to stage IV (pilot trial) and V (Roll-out) and least applicable (N = 13) to stage II (retrospective study). Most barriers were identified for AI (N = 7), data pipeline (N = 5) components and implementation, change and adoption (N = 5) element, with no barriers for the HCI component. Most enablers related to the implementation, change and adoption element (N = 12), and AI component (N = 8). Neither barriers, nor enablers were found for the regulatory and legal policy element.
Figure 3.
Mapping of enablers (green circles), barriers (red circles), and uncertainties (blue circles) to the SALIENT end-to-end clinical Artificial Intelligence(AI) implementation framework. The number in each circle refers to the number of that factor that is applicable at that point in the framework.
Uncertainties
Table 6 identifies the 14 most commonly reported process uncertainties (ie, reported by >10 groups) during implementation, grouped according to differences between studies within SALIENT components, that is, outcome definition, types of MLA used, data pipelines, clinical workflows, HCIs, and implementation evaluation methods.
Table 6.
Implementation uncertainties reported by at least 10 groups.
Group count (%) | Study count (%) | ID | Uncertainties (SALIENT stage) | SALIENT component or element |
---|---|---|---|---|
14 (100) | 31 (86) | U01 | What outcome basis for train/evaluate? | Definition |
14 (100) | 36 (100) | U02 | Which Artificial Intelligence (AI) model: machine vs deep learning (II) | AI |
14 (100) | 30 (83) | U03 | How many and which variables? (II) | AI |
13 (92) | 29 (80) | U04 | How early to target alerts? (too early—no symptoms/signs, too late, no clinical utility). (II) | AI |
14 (100) | 33 (91) | U05 | What data access approach to use: direct to EHR or from separate data warehouse data (II/III) | DP |
13 (92) | 17 (47) | U06 | What level of pipeline sophistication can be supported: model performance vs engineering effort including inter-admission variables. (II/III) | DP |
14 (100) | 23 (63) | U07 | Whether dedicated vs distributed model of alert handling (III) | CW |
13 (92) | 31 (86) | U08 | What determines the setpoint decision. Who/how is it set? (III) | CW |
13 (92) | 16 (44) | U09 | MLA output configuration: Binary vs continuous vs multitier (III) | CW |
13 (92) | 31 (86) | U10 | Whether integrated within EHR or not and if not, sent via tablets/phones/pager (III) | HCI |
12 (85) | 19 (52) | U11 | Whether individual notification (hard alert) or aggregated dashboard (soft alert) and what information is provided with the alert. (III) | HCI |
10 (71) | 16 (44) | U12 | Alert management: Which alert timing: suppression of alerts after first alert; one time or repeat {including frequency of alert generation}. (III) | HCI |
0 (0) | 0 (0) | U13 | Which metrics to use. (III) | EV |
14 (100) | 35 (97) | U14 | What process to follow: Silent trial or not and which trial method. Pilot or no pilot. (II) | EV |
Includes the number and percentage (n = 14) of groups (column 1) and the number and percentage (n = 36) of studies (column 2) reporting each uncertainty. The uncertainties are grouped beneath each Salient component and stage. Salient components are: HCI, human-computer interface; AI, artificial intelligence model; CW, clinical workflow; DP, data pipeline. SALIENT elements are: EV, evaluation.
Definition uncertainties (U01)
It remains unclear whether, and if so, how, chosen outcomes affect MLA effectiveness or clinical impact. Twenty-one different composite definitions of clinical deterioration were used with 11 individual outcomes identified (see Appendix SG, Tables SG1 and SG2). The most popular outcome measures were transfer to ICU (75%, N = 28), in-hospital death (61%), and cardiac arrest (36%) with each of the remaining 8 outcomes used in 3 or less studies. Specific outcome challenges included data limitations,7,33 inconsistencies with using ICU transfer30,31,43,44,49,52,57,59 and differences in how palliative cases were managed.31,34,43,50
AI model uncertainties (U02-U04)
The rationale for selecting a specific MLA (U02) included comparing different MLAs,7,40,41 discounting complex MLAs for lack of transparency43 and limiting MLAs to those supported by the group’s EHR.33 Half the groups (A, B, F, H, I, J, L) employed logistic regression models, of which 3 (A, F, H) were used by groups reporting decreased mortality after implementation; 1 (G) of 3 groups (E, G, N) using random forest showed similar results. Only group (D) used a deep learning model. The number of AI input variables (U03) ranged from 4 (D) to 526 (J) with a median of 43. All studies reporting decreased in-hospital mortality used less than 39 variables.40,44,55,60 Justification for variable selection included those commonly collected within the EHR29,33,44,45,48; not prone to missing or poor quality data57; based on prior reviews and clinician input7,41; purpose-built for the MLA, for example, nurse worry factor38; and reduced number using statistical methods such as recursive feature elimination.6,7,29,41,43 Targeting how early to predict deterioration (U04) involved reconciling (1) the sensitivity and PPV of the MLA; (2) the maximum time window (in number of hours) in which positive cases equaled positive alerts prior to the deterioration outcome, variously set to 12 h (H, I), 24 h (D, F, J, K, L, N), and 48 h (E); and (3) the clinical utility of the alert in providing additional time for clinicians to act in a directed way prior to clinicians suspecting deterioration but not so long before that clinicians could see no signs of deterioration and would not know how to respond.33,51,54
Data pipeline uncertainties (U05-U06)
Thinking was split (U05) over whether to use EHR data directly (A, B, D, I, K, L, N), or employ an external data warehouse (E, F, G, H, J, M) on the basis that, “existing inpatient EMRs were not designed with complex calculations in mind”59(p394) and do not universally support real-time data streaming.29 Many groups reflected on trade-offs associated with data pipeline sophistication (U06). More sophisticated pipelines involving complex calculations needed to be moved out of the EHR61 but could also allow higher prediction refresh rates, ranging from immediate updates (A, D, I) to quarter-hourly (K), hourly (F, H, J), 2-hourly (G), and 4-hourly (E). Incorporating inter-admission data, such as comorbidities, into MLAs could also improve performance, but render the data pipeline more complex.33,37,43
Clinical workflow uncertainties (U07-U09)
Group H alone centralized alerting processes (U07) by using dedicated off-site clinical personnel to monitor alerts, minimize alert fatigue and clinical burden on Rapid Response Team (RRT) staff, and enhance standardization and clinician acceptance.32,60 All other groups employed decentralized alerting of-ward nursing staff. Group A switched from decentralized to a more centralized approach after establishing that alerting the charge nurse had no impact on clinical outcomes,28 instead redirecting alerts to the RRT nurse.44 The MLA alert threshold or setpoint determines the numbers of alerts and represents a trade-off between sensitivity and PPV (U08), with nearly all groups deciding this based solely on ensuring a clinically manageable workload to minimize false alarms,32,34,37,42,44,47,53,58 yielding 3-10 alerts/day/100 patients (see Appendix SH). These thresholds resulted in widely ranging sensitivities (25%-63%), PPV (10%-40%), and specificities (78%-98%). MLA outputs could be configured within clinical workflows (U09) as continuous index readouts (E, L), binary alerts (A, C, H, M), or multitier cut-offs (B, F, G, I, J, K) such as red-amber-green33,55 or high-medium-low risk levels.38 Justification for configuration choice was usually absent, although influenced by clinicians in 2 studies.29,33 Groups reporting reduced mortality used both multitier (F, G) and binary thresholds (A, H)44; however, group H integrated their binary threshold within a multitier rapid response system51 and group A was moving from a binary to multitiered threshold.28,44
Human-computer interface uncertainties (U10-U12)
Location of MLA output was split (U10) between groups integrating the outputs into the EHR (H, I, L), or displaying or sending the outputs externally (A, E, G, J, K, M) or both (B, D, F). According to Nestor et al,50 EHR-Integration enabled nurses to allocate staff more efficiently and clinicians to monitor patients, but potentially requiring expensive EHR changes. The delivery interface (U11) for MLA outputs also varied between groups: hard alerts via pagers or phones (A, E, G, K); soft alerts within a dashboard or screen (H, I, L, M); or both (B, D, F, J). Soft alerts were used by group H, where a dedicated nurse could constantly monitor for changes32,47,51 and by other groups to provide enriched information using risk-based color coding, cross-patient views, and graphical displays.32,33,55,56 In preventing alert fatigue, alerts were commonly suppressed across groups (D, G, H, J, K): (1) 439, 840, 2153 or 4834 h after the first alert; (2) within 2 h39 or soon after,50 admission; (3) if later scores varied less than 10%40; (4) for patients moving from the ICU34; (5) where the risk level did not increase54; and (6) for other strategic reasons based on clinician feedback.7,32,51
Evaluation uncertainties (U13, U14)
Evaluation (U13) proved challenging, with a wide range of metrics being used within and across groups, with no standardization. Only 2 groups (E, J) reported pre- and postimplementation evaluations of MLA performance using the same metrics.30,34 Also, not all groups conducted evaluations at all stages of implementation (U14 and refer to Appendix SI): 71% reported silent or prospective evaluations and half reported small-scale clinical trials, with 71% of the latter also conducting silent evaluations. Silent evaluations ranged from 0.5 to 10 months, average 4.4 and trials ranged from 1 to 24 months, average 10.1. All groups reporting reduced hospital mortality conducted small-scale trials (∼10 months) and 75% conducted silent evaluations (∼2.7 months).
All uncertainties were mapped to the SALIENT framework (see Figure 3) and were AI task agnostic. All but one uncertainty (U01) was relevant to SALIENT stages II (retrospective evaluation) and III (prospective evaluation), but were otherwise fairly evenly spread across the SALIENT components (AI, clinical workflow, data pipeline, and HCI) and evaluation element.
Discussion
Our review identified 12 groups, predominantly US-based, who trialed or implemented clinical deterioration prediction MLAs within their hospital(s). Of 5 groups reporting hospital mortality, 4 saw a reduction after MLA implementation, although only statistically significant in the study by Winslow et al55 (group F), which also reported the most (n = 6) clinical process indicators with statistically significant improvements, including in median hours between alert and escalation, repeat vital signs taken and lactate orders made within 2 h. Winslow et al conducted a before-after study with a 10-month control period in which the MLA operated silently without efferent arm engagement, a 2-month implementation period, and then a 10-month intervention period. A target cohort was defined, based on high and medium risk MLA alert thresholds, which were the same for both control and intervention periods. While mortality for this target group declined significantly in the intervention period, the same was seen for the nontarget, nonalerted patient cohort, indicating possible confounder factors, such as clinician training, altered clinical workflows, and Hawthorne effects from project focus on clinical deterioration.
Other groups reporting reduced mortality were seriously confounded for the same reasons, which are difficult to control or adjust for. This problem is reflective of the dual nature of implementing MLAs or any kind of early warning system: MLAs provide the afferent arm, but achieving improvement in clinical outcomes relies on an effective medical response to the alert (efferent arm). In this sense the fidelity with which an efferent arm functions will influence or moderate the effects of the MLA. Although our longitudinal analysis attempted to identify causal steps between MLA evaluation results at each implementation stage and changes in clinical processes and ultimately in in-house mortality, ascertaining the contribution of the efferent arm to changes in outcome was not possible because of insufficient samples, poor reporting of MLA performance after the retrospective stage and different efferent arms.
Stage II (retrospective) MLA performance for groups reporting improved in-hospital mortality varied widely between 0.78 and 0.92 for AUROC, 0.49 and 0.93 for sensitivity, and 0.07 and 0.38 for PPV. Performance was rarely reported after this stage and when it was it degraded between stages II and III (prospective)29,33,37,42 and III and IV (trial),34 further challenging a convincing link between MLA performance and clinical outcomes. The highest retrospective MLA performance was reported by group K (AUC = 0.94), who reported increased in-hospital mortality, confirming that retrospective MLA performance is insufficient alone to affect positive clinical outcomes.
Older MLA technologies, such as logistic regression used in three-quarters of MLAs and using less than 39 variables, appeared sufficient for alerting purposes as they were used by the few groups reporting significantly reduced in-hospital mortality. However, as effector arms also influence outcomes, this may not constitute definitive evidence of the impact of the type of MLA or number of variables on clinical outcomes. Only one group (n = 12) used a deep learning model whose clinical impact was not reported.39
Two strategies were commonly used to combat alert fatigue: (1) nearly all groups configured their MLA alert threshold to a high level of precision sufficient to limit the number of alerts per patient per day, but at the expense of lower sensitivity, for example, Brajer et al37 had to reduce the sensitivity of their MLA by ∼20% to reduce alerts per day per 100 patients from 11.9 to 6; and (2) 5 groups used alert suppression after the first alert, although the suppression period varied markedly from 4 to 48 h and potential impact on clinical follow-through and care outcomes was not investigated.
Definitions of clinical deterioration outcomes are diverse, thereby preventing meaningful MLA performance comparison between groups. Eleven outcomes were identified, with 21 variants being used across groups to train and evaluate their MLAs. Transfer to ICU, the most frequently reported outcome, was particularly problematic as it is subject to different hospital admission protocols, clinician preferences and biases, and patient-level factors.30,31,44,52,57,59
Pilot trials (SALIENT stage IV) were employed by half the groups and 71% of the groups performed silent trials (SALIENT stage III). Silent trials were used for MLA threshold setting,40,53 final safety testing,54,59 identifying patient types reaching the threshold,58 finalizing response arm protocols,61 identifying unanticipated issues with models and data pipelines,34,54 and collecting feedback from users and building system trust.34
Strengths and limitations
To our knowledge, this study is the first to undertake a systematic review of clinical deterioration prediction algorithms deployed or trialed in clinical settings, identify barriers, enablers, and uncertainties relevant to implementation, and map these to a single end-to-end implementation framework. Unlike similar reviews,4,16,67–70 we conducted a novel 2-stage literature review where, in the second stage, we identified related studies before or after the principal deployment study, thereby providing evidence across the whole MLA implementation process. We also found the findings of each study could be mapped to one or more stages within the SALIENT implementation framework, thereby making explicit when and where these factors arise within the multistage implementation process. This novel approach helps close the gaps in current implementation guidance and offers a pragmatic overview for use by clinicians, informatics personnel, and managers engaged in AI implementation planning.
Limitations relate to the small number of empirical studies of deployed algorithms, heterogeneity of performance reporting, underreporting of postimplementation performance metrics, and potential publication bias. Although RoB for mortality reporting studies was moderate to high, 4 of 5 groups reported reductions in mortality, with one being statistically significant, underscoring the need to further evaluate this relationship in future work. Our study is limited by the scope of SALIENT. It does not include the full AI lifecycle, for example AI decommissioning and maintenance, and may be missing other pragmatic elements, such as might be found in stakeholder-based models.71
Conclusions
Implementing MLAs within adult hospital care settings to predict clinical deterioration can potentially change clinical practice and improve mortality. However, insufficient number of cases, moderate-high levels of bias, and lack of uniform MLA performance reporting across implementation stages prevents establishment of a causal link. Enablers and barriers to successful MLA implementation have been identified, in particular strategies for combatting alert fatigue, and the value of conducting both silent and live pilot trials. Noteworthy too was the finding that older and simpler logistic regression MLAs appeared sufficient to achieve acceptable levels of performance and enable clinical impact.
However, multiple implementation uncertainties throughout the multistage process require further research to quantify effect, with likely more yet to be identified as MLAs and their implementation evolve. Use of the SALIENT end-to-end implementation framework helps identify exactly where in the implementation pipeline these barriers, enablers, and uncertainties are located, providing a practical roadmap for stakeholders wishing to implement clinical deterioration prediction algorithms.
Supplementary Material
Acknowledgments
None.
Contributor Information
Anton H van der Vegt, Centre for Health Services Research, The University of Queensland, Brisbane, QLD 4102, Australia.
Victoria Campbell, Intensive Care Unit, Sunshine Coast Hospital and Health Service, Birtynia, QLD 4575, Australia; School of Medicine and Dentistry, Griffith University, Gold Coast, QLD 4222, Australia.
Imogen Mitchell, Office of Research and Education, Canberra Health Services, Canberra, ACT 2601, Australia.
James Malycha, Department of Critical Care Medicine, The Queen Elizabeth Hospital, Woodville, SA 5011, Australia.
Joanna Simpson, Eastern Health Intensive Care Services, Eastern Health, Box Hill, VIC 3128, Australia.
Tracy Flenady, School of Nursing, Midwifery & Social Sciences, Central Queensland University, Rockhampton, QLD 4701, Australia.
Arthas Flabouris, Intensive Care Department, Royal Adelaide Hospital, Adelaide, SA 5000, Australia; Adelaide Medical School, University of Adelaide, Adelaide, SA 5005, Australia.
Paul J Lane, Safety Quality & Innovation, The Prince Charles Hospital, Chermside, QLD 4032, Australia.
Naitik Mehta, Patient Safety and Quality, Clinical Excellence Queensland, Brisbane, QLD 4001, Australia.
Vikrant R Kalke, Patient Safety and Quality, Clinical Excellence Queensland, Brisbane, QLD 4001, Australia.
Jovie A Decoyna, School of Medicine and Dentistry, Griffith University, Gold Coast, QLD 4222, Australia.
Nicholas Es’haghi, School of Medicine and Dentistry, Griffith University, Gold Coast, QLD 4222, Australia.
Chun-Huei Liu, School of Medicine and Dentistry, Griffith University, Gold Coast, QLD 4222, Australia.
Ian A Scott, Centre for Health Services Research, The University of Queensland, Brisbane, QLD 4102, Australia; Department of Internal Medicine and Clinical Epidemiology, Princess Alexandra Hospital, Brisbane, QLD 4102, Australia.
Author contributions
A.H.V. conceptualized the review. A.H.V., V.R.K., P.J.L., J.M., J.S., and T.F. conducted the title/abstract screening and full-text review. A.F. and N.M. performed the risk-of-bias assessments. A.H.V., J.A.D., N.E., and C.-H.L. performed all data extraction and tabular data collation. A.H.V. drafted the manuscript with revisions and feedback from V.C., I.M., I.A.S., A.F., and J.M.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
A.H.V. was funded through a Queensland Government, Advanced Queensland Industry Research Fellowship grant. The Queensland Government had no role within this research.
Conflicts of interest
None declared.
Data availability
There are no new data associated with this article.
References
- 1. Jones D, Mitchell I, Hillman K, Story D.. Defining clinical deterioration. Resuscitation. 2013;84(8):1029-1034. 10.1016/j.resuscitation.2013.01.013 [DOI] [PubMed] [Google Scholar]
- 2. Mitchell OJL, Dewan M, Wolfe HA, et al. Defining physiological decompensation: an expert consensus and retrospective outcome validation. Crit Care Explor. 2022;4(4):e0677. 10.1097/cce.0000000000000677 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Al-Shwaheen TI, Moghbel M, Hau YW, Ooi CY.. Use of learning approaches to predict clinical deterioration in patients based on various variables: a review of the literature. Artif Intell Rev. 2022;55(2):1055-1084. 10.1007/s10462-021-09982-2 [DOI] [Google Scholar]
- 4. Muralitharan S, Nelson W, Di S, et al. Machine learning-based early warning systems for clinical deterioration: systematic scoping review. J Med Internet Res. 2021;23(2):e25187. 10.2196/25187 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Pimentel MAF, Redfern OC, Malycha J, et al. Detecting deteriorating patients in the hospital: development and validation of a novel scoring system. Am J Respir Crit Care Med. 2021;204(1):44-52. 10.1164/rccm.202007-2700OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Churpek MM, Yuen TC, Winslow C, et al. Multicenter development and validation of a risk stratification tool for ward patients. Am J Respir Crit Care Med. 2014;190(6):649-655. 10.1164/rccm.201406-1022OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Kia A, Timsina P, Joshi HN, et al. MEWS++: enhancing the prediction of clinical deterioration in admitted patients through a machine learning model. J Clin Med. 2020;9(2):343. 10.3390/jcm9020343 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Royal College of Physicians. National Early Warning Score (NEWS): Standardising the Assessment of Acute-illness Severity in the NHS—Report of a Working Party. Royal College of Physicians; 2012.
- 9. Subbe CP, Kruger M, Rutherford P, Gemmel L.. Validation of a modified early warning score in medical admissions. QJM. 2001;94(10):521-526. 10.1093/qjmed/94.10.521 [DOI] [PubMed] [Google Scholar]
- 10. Campbell V, Conway R, Carey K, et al. Predicting clinical deterioration with Q-ADDS compared to NEWS, Between the Flags, and eCART track and trigger tools. Resuscitation. 2020;153:28-34. 10.1016/j.resuscitation.2020.05.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Vasey B, Nagendran M, Campbell B, et al. ; DECIDE-AI Expert Group. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. BMJ. 2022;377:e070904. 10.1136/BMJ-2022-070904 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Van De Sande D, Van Genderen ME, Smit JM, et al. Developing, implementing and governing artificial intelligence in medicine: a step-by-step approach to prevent an artificial intelligence winter. BMJ Heal Care Informatics. 2022;29(1):1-8. 10.1136/bmjhci-2021-100495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gama F, Tyskbo D, Nygren J, Barlow J, Reed J, Svedberg P.. Implementation frameworks for artificial intelligence translation into health care practice: scoping review. J Med Internet Res. 2022;24(1):e32215. 10.2196/32215 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Crossnohere NL, Elsaid M, Paskett J, Bose-Brill S, Bridges JFP.. Guidelines for artificial intelligence in medicine: literature review and content analysis of frameworks. J Med Internet Res. 2022;24(8):e36823. 10.2196/36823 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. van der Vegt A, Scott I, Dermawan K, Schnetler R, Kalke V, Lane P.. Implementation frameworks for end-to-end clinical AI: derivation of the SALIENT framework. J Am Med Inform Assoc. 2023;30(9):1503-1515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Blythe R, Parsons R, White NM, Cook D, McPhail S.. A scoping review of real-time automated clinical deterioration alerts and evidence of impacts on hospitalised patient outcomes. BMJ Qual Saf. 2022;31(10):725-734. 10.1136/bmjqs-2021-014527 [DOI] [PubMed] [Google Scholar]
- 17. Lee TC, Shah NU, Haack A, Baxter SL.. Clinical implementation of predictive models embedded within electronic health record systems: a systematic review. Informatics. 2020;7(3):25. 10.3390/INFORMATICS7030025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B.. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22. 10.1016/j.jclinepi.2019.02.004 [DOI] [PubMed] [Google Scholar]
- 19. Collins GS, Reitsma JB, Altman DG, Moons KGM.. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Eur J Clin Invest. 2015;45(2):204-214. 10.1111/eci.12376 [DOI] [PubMed] [Google Scholar]
- 20. Moons KGM, Altman DG, Reitsma JB, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162(1):W1-W73. 10.7326/M14-0698 [DOI] [PubMed] [Google Scholar]
- 21. Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. BMJ. 2020;2(10):e537-e548. 10.1136/bmj.m3164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Stead WW, Haynes RB, Fuller S, et al. Designing medical informatics resource projects to increase what is learned. J Am Med Inform Assoc. 1994;1(1):28-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. van der Vegt A, Scott I, Dermawan K, Schnetler R, Kalke V, Lane P.. Deployment of machine learning algorithms to predict sepsis: systematic review and application of the SALIENT clinical AI implementation framework. J Am Med Inform Assoc. 2023;30(7):1349-1361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Moher D, Shamseer L, Clarke M, et al. ; PRISMA-P Group. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015;4(1):1. 10.1186/2046-4053-4-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. www.covidence.org.
- 26. Sterne JA, Hernán MA, Reeves BC, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919. 10.1136/bmj.i4919 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Sterne JAC, Savović J, Page MJ, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366:l4898. 10.1136/bmj.l4898 [DOI] [PubMed] [Google Scholar]
- 28. Bailey TC, Chen Y, Mao Y, et al. A trial of a real-time alert for clinical deterioration in patients hospitalized on general medical wards. J Hosp Med. 2013;8(5):236-242. 10.1002/jhm.2009 [DOI] [PubMed] [Google Scholar]
- 29. Bell D, Baker J, Williams C, Bassin L.. A trend-based early warning score can be implemented in a hospital electronic medical record to effectively predict inpatient deterioration. Crit Care Med. 2021;49(10):E961-E967. 10.1097/CCM.0000000000005064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Dziadzko MA, Novotny PJ, Sloan J, et al. Multicenter derivation and validation of an early warning score for acute respiratory failure or death in the hospital. Crit Care. 2018;22(1):286. 10.1186/s13054-018-2194-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Kang MA, Churpek MM, Zadravecz FJ, Adhikari R, Twu NM, Edelson DP.. Real-time risk prediction on the wards: a feasibility study. Crit Care Med. 2016;44(8):1468-1473. 10.1097/CCM.0000000000001716 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Martinez VA, Betts RK, Scruth EA, et al. The Kaiser Permanente Northern California Advance Alert Monitor Program: an automated early warning system for adults at risk for in-hospital clinical deterioration. Jt Comm J Qual Patient Saf. 2022;48(8):370-375. 10.1016/j.jcjq.2022.05.005 [DOI] [PubMed] [Google Scholar]
- 33. O’Brien C, Goldstein BA, Shen Y, et al. Development, implementation, and evaluation of an in-hospital optimized early warning score for patient deterioration. MDM Policy Pract. 2020;5(1):2381468319899663. 10.1177/2381468319899663 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Pou-Prom C, Murray J, Kuzulugil S, Mamdani M, Verma AA.. From compute to care: lessons learned from deploying an early warning system into clinical practice. Front Digit Heal. 2022;4:932123. 10.3389/fdgth.2022.932123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Singh K, Valley TS, Tang S, et al. Evaluating a widely implemented proprietary deterioration index model among hospitalized patients with COVID-19. Ann Am Thorac Soc. 2021;18(7):1129-1137. 10.1513/AnnalsATS.202006-698OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Un KC, Wong CK, Lau YM, et al. Observational study on wearable biosensors and machine learning-based remote monitoring of COVID-19 patients. Sci Rep. 2021;11(1):4388. 10.1038/s41598-021-82771-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Brajer N, Cozzi B, Gao M, et al. Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw Open. 2020;3(2):e1920733. 10.1001/jamanetworkopen.2019.20733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Romero-Brufau S, Rosenthal J, Kautz J, et al. Clinical implementation of a machine learning system to detect deteriorating patients reduces time to response and intervention. medRxiv, 2021;1-13. 10.1101/2021.10.10.21264823, preprint: not peer reviewed. [DOI]
- 39. Cho KJ, Kwon O, Kwon JM, et al. Detecting patient deterioration using artificial intelligence in a rapid response system. Crit Care Med. 2020;48(4):E285-E289. 10.1097/CCM.0000000000004236 [DOI] [PubMed] [Google Scholar]
- 40. Levin MA, Kia A, Timsina Phd P, et al. Real-time machine learning alerts to prevent escalation of care: a pragmatic clinical trial. medRXiv, 2022, preprint: not peer reviewed. https://www.medrxiv.org/content/10.1101/2022.12.21.22283778v1.full.pdf [DOI] [PubMed]
- 41. Ye C, Wang O, Liu M, et al. A real-time early warning system for monitoring inpatient mortality risk: prospective study using electronic medical record data. J Med Internet Res. 2019;21(7):e13719. 10.2196/13719 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Hackmann G, Chen M, Chipara O, et al. Toward a two-tier clinical warning system for hospitalized patients. AMIA Annu Symp Proc. 2011;2011:511-519. [PMC free article] [PubMed] [Google Scholar]
- 43. Kipnis P, Turk BJ, Wulf DA, et al. Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU. J Biomed Inform. 2016;64:10-19. 10.1016/j.jbi.2016.09.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Kollef MH, Chen Y, Heard K, et al. A randomized trial of real-time automated clinical deterioration alerts sent to a rapid response team. J Hosp Med. 2014;9(7):424-429. 10.1002/jhm.2193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Kwon J, Lee Y, Lee Y, Lee S, Park J.. An algorithm based on deep learning for predicting in‐hospital cardiac arrest. J Am Heart Assoc. 2018;7(13):e008678. 10.1161/JAHA.118.008678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Lee YJ, Cho K-J, Kwon O, et al. A multicentre validation study of the deep learning-based early warning score for predicting in-hospital cardiac arrest in patients admitted to general wards. Resuscitation. 2021;163:78-85. 10.1016/j.resuscitation.2021.04.013 [DOI] [PubMed] [Google Scholar]
- 47. Lisk LE, Buckley JD, Wilson K, et al. Developing a virtual nursing team to support predictive analytics and gaps in patient care. Clin Nurse Spec. 2020;34(1):17-22. 10.1097/NUR.0000000000000496 [DOI] [PubMed] [Google Scholar]
- 48. Mao Y, Chen Y, Hackmann G, et al. Medical data mining for early deterioration warning in general hospital wards. In: Proceedings—IEEE International Conference on Data Mining.ICDM; 2011:1042-1049. 10.1109/ICDMW.2011.117 [DOI] [Google Scholar]
- 49. Mou Z, Godat LN, El-Kareh R, Berndtson AE, Doucet JJ, Costantini TW.. Electronic health record machine learning model predicts trauma inpatient mortality in real time: a validation study. J Trauma Acute Care Surg. 2022;92(1):74-80. 10.1097/TA.0000000000003431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Nestor B, McCoy LG, Verma A, et al. Preparing a clinical support model for silent mode in general internal medicine. Proc Mach Learn Res. 2020;126:950-972. http://proceedings.mlr.press/v126/nestor20a.html [Google Scholar]
- 51. Paulson SS, Dummett BA, Green J, Scruth E, Reyes V, Escobar GJ.. What do we do after the pilot is done? Implementation of a hospital early warning system at scale. Jt Comm J Qual Patient Saf. 2020;46(4):207-216. 10.1016/j.jcjq.2020.01.003 [DOI] [PubMed] [Google Scholar]
- 52. Bartkowiak B, Snyder AM, Benjamin A, et al. Validating the Electronic Cardiac Arrest Risk Triage (eCART) score for risk stratification of surgical inpatients in the postoperative setting: retrospective cohort study. Ann Surg. 2019;269(6):1059-1063. 10.1097/SLA.0000000000002665 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Romero-Brufau S, Whitford D, Johnson MG, et al. Using machine learning to improve the accuracy of patient deterioration predictions: Mayo Clinic Early Warning Score (MC-EWS). J Am Med Inform Assoc. 2021;28(6):1207-1215. 10.1093/jamia/ocaa347 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Verma AA, Murray J, Greiner R, et al. Implementing machine learning in medicine. CMAJ. 2021;193(34):E1351-E1357. 10.1503/cmaj.202434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Winslow CJ, Edelson DP, Churpek MM, et al. The impact of a machine learning early warning score on hospital mortality: a multicenter clinical intervention trial. Crit Care Med. 2022;50(9):1339-1347. 10.1097/CCM.0000000000005492 [DOI] [PubMed] [Google Scholar]
- 56. Chen J, Yan M, Howe RLC, et al. Biovitals™: a personalized multivariate physiology analytics using continuous mobile biosensors. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2019:3243-3248. [DOI] [PubMed]
- 57. Churpek MM, Yuen TC, Park SY, Gibbons R, Edelson DP.. Using electronic health record data to develop and validate a prediction model for adverse outcomes in the wards. Crit Care Med. 2014;42(4):841-848. 10.1097/CCM.0000000000000038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Dummett BA, Adams C, Scruth E, Liu V, Guo M, Escobar GJ.. Incorporating an early detection system into routine clinical practice in two community hospitals. J Hosp Med. 2016;11(Suppl 1):S25-S31. 10.1002/jhm.2661 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Escobar GJ, Laguardia JC, Turk BJ, Ragins A, Kipnis P, Draper D.. Early detection of impending physiologic deterioration among patients who are not in intensive care: development of predictive models using data from an automated electronic medical record. J Hosp Med. 2012;7(5):388-395. 10.1002/jhm.1929 [DOI] [PubMed] [Google Scholar]
- 60. Escobar GJ, Liu VX, Schuler A, Lawson B, Greene JD, Kipnis P.. Automated identification of adults at risk for in-hospital clinical deterioration. N Engl J Med. 2020;383(20):1951-1960. 10.1056/nejmsa2001090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Escobar GJ, Turk BJ, Ragins A, et al. Piloting electronic medical record-based early detection of inpatient deterioration in community hospitals. J Hosp Med. 2016;11(Suppl 1):S18-S24. 10.1002/jhm.2652 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Granich R, Sutton Z, Kim YS, et al. Early detection of critical illness outside the intensive care unit: clarifying treatment plans and honoring goals of care using a supportive care team. J Hosp Med. 2016;11(Suppl 1):S40-S47. 10.1002/jhm.2660 [DOI] [PubMed] [Google Scholar]
- 63. Henry KE, Kornfield R, Sridharan A, et al. Human-machine teaming is key to AI adoption: clinicians’ experiences with a deployed machine learning system. NPJ Digit Med. 2022;5(1):97. 10.1038/s41746-022-00597-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Burdick H, Pino E, Gabel-Comeau D, et al. Effect of a sepsis prediction algorithm on patient mortality, length of stay and readmission: a prospective multicentre clinical outcomes evaluation of real-world patient data from US hospitals. BMJ Heal Care Informatics. 2020;27(1):1-8. 10.1136/bmjhci-2019-100109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Desautels T, Calvert J, Hoffman J, et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR Med Inform. 2016;4(3):e28. 10.2196/medinform.5909 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Shimabukuro DW, Barton CW, Feldman MD, Mataraso SJ, Das R.. Effect of a machine learning-based severe sepsis prediction algorithm on patient survival and hospital length of stay: a randomised clinical trial. BMJ Open Respir Res. 2017;4(1):e000234. 10.1136/bmjresp-2017-000234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Herasevich S, Lipatov K, Pinevich Y, et al. The impact of health information technology for early detection of patient deterioration on mortality and length of stay in the hospital acute care setting: systematic review and meta-analysis. Crit Care Med. 2022;50(8):1198-1209. 10.1097/CCM.0000000000005554 [DOI] [PubMed] [Google Scholar]
- 68. Veldhuis LI, Woittiez NJC, Nanayakkara PWB, Ludikhuize J.. Artificial intelligence for the prediction of in-hospital clinical deterioration: a systematic review. Crit Care Explor. 2022;4(9):E0744. 10.1097/CCE.0000000000000744 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Mann KD, Good NM, Fatehi F, et al. Predicting patient deterioration: a review of tools in the digital hospital setting. J Med Internet Res. 2021;23(9):e28209. 10.2196/28209 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Gerry S, Bonnici T, Birks J, et al. Early warning scores for detecting deterioration in adult hospital patients: systematic review and critical appraisal of methodology. BMJ. 2020;369:m1501. 10.1136/bmj.m1501 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Kim JY, Boag W, Gulamali F, et al. Organizational governance of emerging technologies: AI adoption in healthcare. In: ACM International Conference Proceeding Series. 2023:1396-1417. 10.1145/3593013.3594089 [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
There are no new data associated with this article.