Skip to main content
Springer logoLink to Springer
. 2026 Mar 31;185(4):229. doi: 10.1007/s00431-026-06895-6

Machine learning to predict hospital admission at triage in paediatric emergency care: A meta-analysis

Blanca Paola Pérez 1, Octavio Galindo Osorio 2, Mónica Arias-Colinas 3,5, José Manuel Moreno 2, Nerea Martín-Calvo 1,3,4,
PMCID: PMC13035534  PMID: 41912745

Abstract

Machine learning (ML) models have shown promise improving outcome prediction and early risk stratification in paediatric emergency department (ED) triage. This review aims to evaluate the diagnostic performance of ML in predicting hospital admissions from the data collected at triage in paediatric emergency departments (EDs). Searches were conducted in PubMed, Ovid, Scopus, and Web of Science. Two reviewers screened 264 abstracts after duplicate removal, excluding 239 not meeting inclusion criteria. Of the 25 full-texts assessed, 15 were excluded for outcome mismatch, leaving 10 for data extraction. Data were thereafter extracted including population characteristics, ML methods, and diagnostic metrics: area under the curve (AUC), sensitivity, and specificity. Most studies used retrospective cohorts from electronic records or national databases. Sample sizes ranged from 9,069 to over 2.9 million. AUCs ranged from 0.78 to 0.97, with top-performing models (AUC ≥ 0.94) using random forest algorithms and variables like age, heart rate, triage level. Meta-analysis of six studies showed pooled sensitivity of 0.78 and specificity of 0.76 (AUC = 0.84), though heterogeneity was high (I2 = 100%). Conclusion: ML models have potential for paediatric ED triage. Standardized methods, explainable AI, and prospective validation are essential for clinical use.

What is Known:

• Traditional triage in paediatric emergency departments may have limitations in accurately predicting hospital admissions.

• Machine learning models are increasingly applied to improve risk stratification in clinical settings.

What is New:

• This review shows ML models can predict paediatric ED admissions with high AUCs (up to 0.97).

• Random forest algorithms using vital signs and triage data performed best.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00431-026-06895-6.

Keywords: Machine learning, Deep learning, Paediatric emergency department, Triage, Hospitalization prediction

Introduction

Triage enables an initial structured patient assessment to determine clinical urgency and guide treatment prioritization. Beyond serving as the first point of contact, effective triage facilitates early identification of high-risk patients and efficient allocation of limited resources, which is essential for timely care and improved clinical outcomes [1]. Over time, paediatric emergency department (ED) visits have increased in both volume and acuity [2]. Of approximately 137 million annual ED visits in the United States, 30 million involve children [2, 3]. ED triage requires rapid clinical decisions under limited information and time. Common tools used for systematic patient assessment include the Emergency Severity Index (ESI) and Manchester Triage System (MTS) [1]. Within overloaded systems it is essential to accurately guide diagnosis and treatment from the first point of contact to improve efficiency, resource allocation and reduce negative outcomes.

Given these challenges, the integration of machine learning (ML) models into triage systems has emerged as a promising strategy to enhance early risk stratification and outcome prediction in pediatric emergency departments. Unlike conventional triage approaches based on predefined scoring rules, ML models can capture complex, nonlinear relationships among patient variables [1]. Deep learning (DL), a subset of ML, further extends these capabilities by enabling automated feature extraction and direct processing of raw, unstructured data through deep neural networks (DNNs) [46].

As ML increasingly informs clinical decision-making, it is essential to evaluate whether these models can be ethically and effectively integrated into ED triage workflows while preserving clinician oversight and autonomy. When applied cautiously, ML tools may complement clinical judgment by revealing previously unrecognized patterns and supporting decision-making in routine emergency care [4, 6].

Despite the expanding literature on ML applications in ED triage, no comprehensive systematic review and meta-analysis has evaluated the diagnostic performance of ML-based models applied at the point of triage in paediatric populations, with existing studies remaining fragmented due to heterogeneity in model design, data preprocessing, and evaluation metrics.

This systematic review and meta-analysis evaluate the performance of ML and DL models in predicting hospital admission among children presenting to paediatric ED using information available at the triage stage. By focusing on models developed exclusively from triage-level data, including low-dimensional approaches, we assess whether early probabilistic predictions of hospitalization and critical outcomes can provide actionable support for emergency clinicians and hospital administrators. This study does not aim to replace clinical judgment but rather to examine whether data-driven tools applied at the point of triage can complement clinical decision-making and support anticipatory resource planning in resource-constrained and overcrowded emergency care settings.

Methods

Literature search and selection

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidance was followed, and a review protocol was designed and registered in the international prospective register of systematic reviews (PROSPERO ID CRD42023445471).

Inclusion and exclusion criteria

This systematic review and meta-analysis included cross-sectional and retrospective cohort studies evaluating the accuracy of at least one ML algorithm in predicting hospital admission among children presenting to paediatric EDS at the point of triage, using physician admission or discharge decisions as the reference standard. Studies focusing on alternative outcomes (e.g., ICU admission, mortality, or diagnostic testing), those in which hospital admission could not be evaluated as a discrete outcome (e.g., composite endpoints or unavailable admission-specific performance metrics), and studies including individuals over 21 years of age were excluded. Studies restricted to specific conditions were included only when hospital admission was the primary outcome.

Only studies based exclusively on triage-level data were included, as a deliberate methodological choice to ensure clinical relevance, comparability, and interpretability. Triage represents a standardized and well-defined decision-making point, enabling direct comparison with established triage systems such as the Emergency Severity Index (ESI) and the Manchester Triage System (MTS). Focusing on this early stage minimizes methodological heterogeneity and limits variability introduced by institution-specific practices, clinician behaviour, and downstream diagnostic testing. Moreover, models incorporating data obtained later in the ED course, such as laboratory or imaging results, do not reflect real-world conditions in which rapid anticipatory resource allocation is required. By restricting the analysis to this clinically meaningful time point, we aimed to reduce confounding and enhance the validity and generalizability of the synthesized evidence.

Studies were searched in the medical bibliography databases PubMed, Ovid, Scopus, and Web of Science. Searches were conducted from database inception to 25 May 2024 and there were no temporal restrictions were applied to the search strategy.

The search strategies are available in Supplementary Table 1.

Two independent reviewers (BPP, NMC) conducted the search and article selection, resolving disagreements through discussion. A total of 492 records were identified, of which 228 duplicates were removed, leaving 264 records for title and abstract screening. After excluding 239 records, 25 full-text articles were assessed for eligibility. Fifteen articles were subsequently excluded, resulting in 10 studies included in the systematic review and meta-analysis. The study selection process is illustrated in Fig. 1.

Fig. 1.

Fig. 1

Flow diagram of the literature search and studies selection process

In addition to the systematic database search, backward snowballing of reference lists was performed, identifying four additional records. Following full-text assessment, all were excluded for not meeting the predefined inclusion criteria, due to the absence of a paediatric population, non-cohort study design, or reliance on variables collected beyond the triage stage, thereby limiting relevance to triage-based decision-making.

Data extraction and synthesis

Two independent reviewers (BPP, OG) extracted the relevant data from the selected articles following a standardized procedure. Extracted data included author, year of publication, country where the study was conducted, study population (sample size, age range, and sex distribution), sociodemographic factors, aims of the study, ML algorithm used, area under the curve (AUC), and diagnostic yield parameters (sensitivity, specificity, VPP, and VPN).

Hospitalization was defined as the final disposition recorded during an emergency department visit, regardless of the time elapsed from triage to that decision point.

Quality assessment

Three independent reviewers (BPP, OG, MAC) assessed methodological quality and risk of bias using the Prediction model Risk Of Bias Assessment Tool (PROBAST) framework [7], with disagreements resolved by a fourth reviewer (NMC). Domains evaluated included patient selection, index test, reference standard, and flow and timing, along with applicability concerns related to patient selection, index test, and reference standard.

Data analysis

Studies reporting true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values—or from which these could be derived—were included in the meta-analysis. Summary sensitivity, specificity, likelihood ratios, and odds ratios were estimated using meta-analytical methods for diagnostic accuracy studies, with corresponding heterogeneity statistics. To assess robustness, sensitivity analyses were conducted excluding studies rated as high risk of bias according to the PROBAST tool.

Analyses and figures were performed using Stata version 15.0 (Stata Corp., College Station, TX, USA).

Results

Following the full-text review, ten studies met all eligibility criteria and were included in the final assessment of the utility of ML in triage. Table 1 provides an overview of the geographical origin, study design, sample size, and demographic characteristics of the included studies. Table 2 summarizes the methodology employed in each study.

Table 1.

Descriptive characteristics of the selected studies

Country Design Information source Uni vs. multicenter Analyzed time period Sample size % HA % Male Age range
Median age1
IQR2
Hatachi et al. (2018) Japan Retrospective cohort SAKAI’s children emergency care center Unicenter August 2016 – October 2019 88,283 2 52.4

 < 16

3.91

Goto et al. (2019) USA Retrospective cohort CDC National health statitics Multicenter January 2007 – December 2015 52,037 5 52.0

 < 18

61

(2–14)2

Wolff et al. (2919) Chile Retrospective cohort Exequiel González Cortés Hospital database Unicenter August 2014 – October 2017 189,718 5 52.9

 < 18

2.91

Roquette et al. (2020) Brazil Retrospective cohort Sabara Children’s hospital database Unicenter January 2015 – August 2018 499,853 6 52.9

 < 18

2.91

(1.4–5.7)2

Heyming et al. (2021) USA Retrospective cohort EHRR Children’s Hospital Orange County California Unicenter March 2013—December 2020 585,142 9 53.55

 < 21

41

Kwon et al. (2021) South Korea Retrospective cohort Korean NEDIS database Multicenter January 2014 – June 2016 2,937,078 13 57.4  < 15
Hwang et al. (2022) South Korea Cross sectional observational study Korean NEDIS database Multicenter January 2016 – December 2017 262,171 12 57.2

 < 15

31

(1.0–7.0)2

Leonard et al. (2022) Ireland Retrospective Cohort EHRR Data Unicenter January 2017 – December 2018 72,229 15 55.1  < 18
Patel et al. (2018) USA Retrospective Cohort EHRR Data Unicenter January 2012 – December 2015 29,392 17 58 2–18
Sills et al. (2021) USA Retrospective Cohort EHRR Colorado EDs Multicenter January 2009 – December 2013 9069 22 62.7

2–21

61

(4–10)2

Centers for Disease Control (CDC), Electronic Health Record Registry (EHRR), Emergency Department (ED), Hospital Admision (HA), Interquartile Range (IQR), National Emergency Department Information System (NEDIS), United States of America (USA), and Sakai’s Children Emergency Care Center (SAKAI)

Table 2.

Information on ML models used in the selected studies

Author and Date Data pre-processing Random assortment of data in training and testing Validation of training sets Models developed Variables assessed High discrimination variables Diagnostic yield parameters
Hatachi et al. (2018)

Exclusion criteria applied

Handling missing values

Feature engineering

EHR re-labeling

Encoding categorical values

No

Preperiod cohort training set from August 2016 -Oct 2018

Postperiod validation cohort Oct 2018 to Oct 2019

Yes

tenfold cross validation with stratification

LR

SVM

RF

EGBx

Demographics

Vital signs

Triage level

Chief complaints

Revisit Data

Age

Triage level

O2 sat

HR

Revisit within 24 h

AUC-ROC
Goto et al. (2019)

Exclusion criteria

Handling missing values

Feature engineering

Data cleaning

Encoding categorical values

Yes

Random Sample training cohort Training set was 70% data set

Yes

tenfold cross validation applied in training

LR with lasso regularization

RF

GBM

DT

DNN

Demographics

Arrival mode

Vital signs

Chief Complaints

Comorbidities

ED revisit

Age

RR

O2 sat

Arrival by ambulance

Pule rate

AUC-ROC
Wolff et al. (2019)

Data cleaning

Class imbalance correction

Case Re-labeling

Encoding categorical values

Random stratified cross validation

Yes

fivefold cross-validation

Hold-out validation

RF

DL

NB

SVM

Demographics

Trauma

HR, RR, Oxygen saturation

BP

Temp

ESI

Prior ED visits

Pain VAS

LOC

Age

Sex

Trauma

HR

ESI

AUC
Roquette et al. (2020)

Exclusion Criteria

Feature engineering

Transformation of values

Normalization of numerical values

No

Time split

Training set Jan 2015- Apr 2018

Test set May 2018-Augut 2018

Yes

Train-validation split for DLM and Tree based models used 3 split time series cross validation

SVM

Elastic Net

DNN

GBM

Hybrid model (deep learning and gradient boosting)

Demographics

Triage variables

Vital signs

Past visit data

Triage text data

Triage text embeddings

MTS

Time interval between current and last visit

O2 sat

Age

AUROC
Heyming et al. (2021)

Exclusion Criteria

Encoding categorical value

Yes

Training set 50%

Validation set 10%

Test set 40%

Yes

10% data was used in validation process

XGBoost

Triage vital signs: Temp, BP, HR, O2 sat and RR

Nursing assessment: capillary refill, appearance, LOC, skin assessment

ESI

Demographics

Historical Data

History of IV medication use

ESI level

CRT

AUROC

Sensitivity

PPV

NPV

Kwon et al. (2021)

Exclusion criteria

Data normalization

Class imbalance correction

Encoding categorical values

No

Derivation data Jan 2014—June 2016

Test July 2016 – December 2016

No

Performance was assessed only on test set

DL

RF

LR

Demographics

Chief complaint

Symptom onset to arrival time

Arrival mode

Vital signs: HR, RR, Body temp, Mental status

HR

Arrival mode

Chief complaint

RR

Temp

AUROC
Hwang et al. (2022)

Exclusion criteria

Data normalization

Missing value handling

Yes

Data under sampling

Yes

Cross validation with AUROC

RF

Demographics

Vital signs

Mental status

Arrival mode

Trauma status

Time variable

Age

Body temp

Time from symptom onset to ED visit

HR

RR

AUROC
Leonard et al. (2022)

Exclusion criteria

Feature engineering

Yes

Training set 70%

Testing set 30%

Yes

Internal validation was conducted on test set

NB

LR

GBM

Presenting complaint

ICTS

Referral source

Registration month

ED location

Distance traveled

Admission History

Weekday

Presenting complaint

ICTS

Referral source

Registration month

ED location

AUC
Patel et al. (2018)

Exclusion criteria applied

Data normalization: categorical and continuous variable normalization

Yes

Training set 80%

Test set 20%

Yes

Three-fold cross validation

DT

RF

Lasso LR

GBM

DemographicSocioeconomic status SESVital signsTriage scoreWeight for ageWeather dataCommunity viral load

O2 sat

RR

Triage acuity core ESI level

HR

Weight for age Z score

AUROC
Sills et al. (2021)

Exclusion criteria applied

Missing data handling

Feature engineering

Yes

Yes

AutoML framework

AutoML

RF

LR

DemographicTriage variablesClinical variablesTreatment variables

Prior hospitalization

ESI

TT

Time to first medication

Age

AUROC

Area Under the Curve (AUC), Auto Machine Learning (AutoML), Capillary Refill Time (CRT), Centers for Disease Control (CDC), Critical Care Outcomes (CCO), Decision Tree (DT), Deep Learning (DL), Deep Neural Network (DNN), Electronic Health Record Registry (EHRR), Elastic Net Regularization (ENR), Emergency Department (ED), Emergency Severity Index (ESI), Gradient Boosting Machines (GB or GBM), Heart Rate (HR), Intensive Care Unit (ICU), Irish Children’s triage system (ICTS), Logistic Regression (LR), Manchester Triage System (MTS), National Health Center Statistics (NHAMCS), Negative Predictive Value (NPV), Oxygen saturation (O2 sat) Positive Predictive Value (PPV), Random Forest (RF), Respiratory Rate (RR), Retrospective Cohort (RC), Support Vector Machine (SVM), Temperature (Temp), Time to Triage (TT), True Negative (TN), and True Positive (TP)

Quality assessment of the selected studies

The models generally aligned well with real-world clinical settings and populations, indicating high applicability (Supplementary Fig. 1). However, selection and analysis biases require cautious interpretation before clinical implementation. PROBAST assessment showed that 6 studies [8, 1013, 16] had a high overall risk of bias, suggesting possible issues in study design, data handling, or model evaluation affecting reliability.

Despite these methodological issues, all reviewed artificial intelligence (AI) models demonstrated high clinical applicability, reinforcing their potential utility in ED triage systems. In this context, low-dimensional models are of particular interest, as they achieved AUCs comparable to those of high-dimensional models while offering superior applicability, transparency, and practicality for real-world system implementation [13].

Lastly, several models relied on clinician judgment as the gold standard for hospitalization decisions, which is problematic due to institutional variability and non-clinical influences. Only Patel et al. and Wolff et al. used standardized, objective criteria, improving reproducibility [9, 12]. Leonard et al., Sills et al. and Roquette et al. followed the ICTS, ESI and MTS, respectively—all internationally recognized—but many studies did not explain their hospitalization criteria, making comparisons between models difficult [1, 10, 13, 14].

Most discriminatory variables

Table 3 summarizes the performance of predictive models for early risk stratification. AUCs ranged from 0.78 to 0.97, reflecting strong performance across various ML architectures. Heyming et al.’s RF model achieved the highest AUC of 0.97 (95% CI: 0.93–0.98), identifying ESI level, capillary refill, and prior intravascular medication use as key predictors [11]. Hwang et al. reported an AUC of 0.94 with RF, highlighting age, temperature, symptom onset-to-ED time, and heart rate [16]. Roquette et al.’s hybrid model (DNN + XGBoost) reached 0.89 (95% CI: 0.885–0.90), relying on MTS, age, past visits, and oxygen saturation [10]. Sills et al. used AutoML (AUC 0.91) and Hatachi et al. XGBoost (AUC 0.86), both emphasizing age, medical history, and time-sensitive ED variables like time to triage or medication [8, 14]. DL-based models by Goto and Kwon reported AUCs of 0.80 and 0.78, relying mainly on vitals (respiratory rate, pulse, oxygen saturation) and arrival modality [15, 17]. Leonard et al. disclosed an AUC of 0.853 in their GBM model, obtaining as the highest yield variables the following: presenting complaint, ICTS, referral source, registration month, ED location to predict hospitalization risk [13]. Common high-discrimination variables across studies included age (7 of 10 studies), SaO₂, heart rate, triage level/system, and respiratory rate—showing strong prognostic value.

Table 3.

Summary of the variables with higher discrimination ability and the AUC (95% Confidence Interval (CI)) of the most discriminatory model in the selected studies

Best model AUC
(95% CI)
High discrimination variables
Hatachi et al. (2017) XGBoost

0.86

(0.84–0.88)

Age, triage level, oxygen saturation, heart rate, prior ED visits
Goto et al. (2019) DNN

0.80

(0.78–0.81)

Age, respiratory rate, oxygen saturation, ambulance arrival, pulse rate
Wolff et al. (2019) DL 0.79 Age, sex, trauma status, heart rate, ESI
Roquette et al. (2020) Hybrid model (DNN + XGBoost)

0.89

(0.89–0.90)

Manchester triage system, age, past visits, oxygen saturation
Heyming et al. (2021) RF

0.968

(0.967–0.969)

ESI level, capillary refill time, historical intravascular medication use
Kwon et al. (2021) DL

0.782

(0.780–0.783)

Heart rate, arrival mode, respiratory rate, temperature
Hwang et al. (2022) RF 0.94 Age, body temperature, symptom onset to ED time, heart rate
Leonard et al. (2022) GBM

0.853

(0.846–0.859)

Presenting complaint, ICTS, Referral source, Registration month, ED location
Patel et al. (2018) GBM

0.84

(0.83–0.85)

Oxygen saturation, respiratory rate, triage ESI, weight-for-age z-score
Sills et al. (2021) AutoML 0.91 Time to triage, time to first medication, age, prior hospitalization

Meta-analytical integration of diagnostic accuracy studies

The diagnostic performance of ML models for predicting pediatric ED hospitalizations was evaluated using a Summary Receiver Operating Characteristic (SROC) curve, as shown in Fig. 2. Studies with sufficient data to be analyzed were Wollf et al. (4 models), Patel et al. (1 model), Hatachi et al. (4 models), Goto et al. (4 models), Heyming et al. (1 model), Leonard et al. (3 models) and Roquette et al. (2 models) [813, 16]. Data on TP, TN, FP, and FN can be found in the supplementary Table 2. The summary operating point, represented by a red diamond, corresponds to the pooled sensitivity and specificity estimates across all included models. The summary sensitivity was 0.75 (95% CI: 0.68–0.80), indicating that the models correctly identified 75% of true hospitalization cases. The summary specificity was 0.79 (95% CI: 0.74–0.84), reflecting the ability of the models to correctly classify non-hospitalized patients in 76% of cases. Overall, these results yield that ML models exhibit high diagnostic accuracy (AUC = 0.84; 95% CI 0.80–0.87) in predicting ED hospitalizations.

Fig. 2.

Fig. 2

SROC with Prediction & Confidence contours of the included studies

The test for heterogeneity showed significant heterogeneity (p < 0.001) among studies. Furthermore, an I2 value of 100% suggests considerable inconsistency across studies.

The sensitivity analysis in Wolff et al. (4 models) and Goto et al. (4 models) showed similar diagnostic accuracy (AUC = 0.78; 95% CI 0.74–0.81) and heterogeneity among studies (I2 = 100%) (Supplementary Fig. 2) [9, 16].

Discussion

ML, DL and classical triage systems

This systematic review and meta-analysis synthesized findings from studies evaluating the diagnostic performance of ML models in predicting hospitalization outcomes from data collected at triage. The results indicate that ML-based triage models demonstrate high discriminative ability, with an overall AUC of 0.84 (95% CI: 0.80–0.87) (Fig. 2). This high discriminative performance has important implications for resource allocation and for alleviating overcrowding in emergency departments, particularly in underserved settings with high patient volumes. Models capable of providing an early, probabilistic estimate of admission risk before diagnostic workup is conducted may facilitate more efficient patient streaming and bed planning, thereby optimizing patient flow.

Conventional triage systems have long served as the primary tools for patient stratification and resource allocation in emergency departments. However, multiple studies have shown that machine learning–based models can achieve superior discriminative performance compared with established triage approaches such as the Korean Triage and Acuity Scale (KTAS), the ESI, and the MTS, supporting their potential role as an adjunct or alternative to traditional triage systems [11, 16, 17].

Interestingly, some included studies compared ML models with traditional triage systems. Hwang et al. found their RF model significantly outperformed the paediatric Korean Triage and Acuity Scale in predicting ICU admissions and hospitalizations, highlighting limitations of conventional systems [16]. Heyming et al. also supported ML-based triage, though their results may be unreliable due to unaddressed class imbalance [11].

Other studies compared different ML models. Patel et al. reported that their gradient-boosting machines (GBMs) had the highest predictive performance for hospitalization, reinforcing the value of ensemble-based methods in risk stratification [12]. Similarly, Hatachi et al. found GBMs performed best in their dataset but noted a tendency to over—triage, generating more false positives than expected [8]. Prioritizing sensitivity in emergency settings helps avoid missing critically ill patients, but excessive false positives can lead to unnecessary admissions, resource strain, and increased costs.

Sills et al. found that an AutoML approach improved hospitalization prediction compared to RF and LR, demonstrating the potential for automated model selection to optimize algorithmic performance without extensive manual tuning [14].

In several of the reviewed studies [9, 10, 15, 17] DL algorithms outperformed traditional triage and other ML models, suggesting they may be the most effective AI tools for evaluating complex relationships in large datasets. To provide context, ML and DL are subfields of AI. ML typically requires manual feature engineering to extract predictors, as it cannot capture complex relationships from raw inputs, while DL automates feature extraction through multi-layered DNNs and independently identifies complex patterns in unstructured data [4].

Roquette et al. was the only study to develop a hybrid ML model (DNN + XGBoost), which achieved the highest AUC (0.89, 95% CI: 0.88–0.90) among DL-based approaches. While potentially advantageous, it requires physicians to better understand how to generate and apply data for such complex models [10]. The study uniquely integrated unstructured clinical text via language normalization using DNNs, replacing the previously manual encoding needed for ML models. Surprisingly, this feature had one of the highest discriminatory capacities among all variables [10].

An outlier among DL algorithms was presented by Wolff et al., which found that RF outperformed DL algorithms and demonstrated characteristically high NPVs. High NPVs were a common trend among the models, which could be clinically significant, as they suggest that the ML model could reliably rule out critical conditions that would be useful in clinical practice, similar to the D-dimer, where a negative result could exclude diagnosis [9].

Despite promising results, these models cannot be fully relied on without external validation, as their performance on unseen data remains unknown—leaving the key ML goal of generalizability unmet. DL also raises concerns due to its low interpretability—the ‘black box effect’—making it inappropriate to base clinical judgment on algorithms that lack transparent reasoning [4, 20]. For ethical AI integration, clinicians must understand the logic behind any AI recommendations, as they remain ultimately responsible for clinical decisions [5].

There has been notable progress in the quality and transparency of ML development in paediatric ED triage, particularly through greater emphasis on data preprocessing, which is crucial for model reliability. Most studies in this review applied systematic preprocessing steps such as exclusion criteria, data normalization, categorical encoding, and handling missing data (discussed later). This marks a clear improvement over earlier research, where such steps were often omitted, limiting reproducibility and comparability of ML findings [20].

Clinical predictors of hospital admission from triage

In asthma-related studies, respiratory rate and oxygen saturation emerged as the strongest predictors of hospitalization. In contrast, no single variable consistently predicted hospital admission across studies addressing general paediatric pathologies, likely reflecting their multifactorial nature. Instead, combinations of variables—such as clinical history, mode of transport, and vital signs including capillary refill time—were repeatedly identified as relevant predictors.

Age has been widely discussed as a fundamental parameter for predicting hospitalization across most studies. However, Leonard et al. did not include age as an input variable in their model, despite its incorporation into the Irish Children’s Triage System (ICTS) [13].

Utility of low dimensional models

Leonard et al. demonstrate that high-performing ML models do not necessarily require high-dimensional inputs or complex architectures, as low-dimensional models achieved discrimination comparable to high-dimensional approaches (AUC 0.835 vs. 0.853) [13]. Notably, a GBM using only eight routinely collected triage variables showed strong predictive performance while preserving interpretability and ease of implementation. This is particularly relevant for settings with limited electronic infrastructure and supports the feasibility of simpler, transparent ML models over complex DL systems that contribute to the “black box” effect. Nevertheless, these findings should be interpreted with caution, as the study did not adequately address overfitting and was rated as high risk of bias under the PROBAST framework, limiting confidence in the reported results.

The advantages and disadvantages of the different ML models are described in Table 4.

Table 4.

General advantages and disadvantages of the ML and DL models

Model Advantages Disadvantages
DL

- Excels at detecting complex patterns in imaging and non-structured data (EHRs and unstructured clinical narratives (e.g., anamnesis))

- Heterogeneous data handling

- Improves with large data sets

- Outperforms classical models in natural language processing, computer vision and EHR analysis

- Minimal need for manual feature engineering

- Black box nature

- Requires large and labelled datasets

- Computationally expensive +  +  + 

LR

- Simple and interpretable

- Well-suited for structured data with clear linear relationships

- Suitable for clinical risk prediction and binary classification tasks

- Fast to train and deploy in clinical settings

- Sensitive to missing values

- Struggles with complex non-linear relationships (e.g., in imaging or genomic dat

RF

- Robust to missing data

- Favorable with structured tabular data

- Improved interpretability

- Avoids overfitting

- Computationally expensive +  + 

- Limited performance on unstructured data such as free-text EHRs or imaging

- Limited predictive power

GBM

- Strong predictive power with higher accuracy

- Favorable with structured tabular data

- Handles imbalanced datasets

- Prone to overfitting without careful regularization

- Poor performance on unstructured data

SVM

- Favorable in small to medium datasets

- Robust to avoid overfitting in small setting

- Poor performance with unstructured data

- Requires extensive feature engineering and parameter tuning

- Computationally expensive

- Limited interpretability for clinicians

NB

- Computationally efficient for small datasets

- Performs well with small sample sizes and high-dimensional data

- Requires feature independence

- Performs poorly with highly correlated features

Strengths and limitations

Our study has several strengths and limitations. One of the key strengths of this study is its rigorous methodology, adhering to PRISMA guidelines for transparency and reproducibility and employing the PROBAST framework to assess the quality and risk of bias in included studies [7]. Furthermore, this is the first systematic review and meta-analysis to evaluate the diagnostic accuracy of ML models in predicting hospital admission in the paediatric ED from data obtained at triage, providing a comprehensive synthesis of available evidence. A sensitivity analysis was conducted to assess the robustness of the findings, ensuring the reliability of the results.

Despite the promising findings, limitations must be acknowledged. The small number of included studies and the high degree of heterogeneity—stemming from variations in data preprocessing, variable selection, outcome definitions, and unspecified hospitalization timing—limit generalizability. Future meta-analyses should address these issues through study stratification or meta-regression. In addition, the absence of a standardized definition of hospital admission, often influenced by subjective clinical judgment, affects model reproducibility and warrants further investigation using more objective criteria. Furthermore, hospitalization timings were unspecified across studies and therefore will likely be a source of heterogeneity. Most models were developed using retrospective data with limited external validation, highlighting the need for prospective studies to confirm clinical utility. Incomplete reporting in some studies also impacted quality assessment. Finally, although restricting inclusion to studies using data available at triage facilitated comparability, it excluded models incorporating additional diagnostic data, potentially limiting variable identification. This was done to avoid the introduction of important disparities in training data that could lead to biased or overly optimistic estimates of diagnostic performance. Future research should incorporate a broader range of studies while accounting for differences in model development and training to support the advancement of ML implementation in paediatric emergency department triage.

Conclusion

This systematic review and meta-analysis showed that ML and DL models demonstrate high diagnostic performance and strong discriminative ability in paediatric ED settings. However, substantial heterogeneity limits definitive conclusions, and the limited interpretability of DL models highlights the need for explainable approaches to support clinical decision-making. Despite these limitations, ML-based triage has considerable potential to improve ED efficiency, optimize resource use, and enhance paediatric patient outcomes.

Supplementary Information

Below is the link to the electronic supplementary material.

ESM1 (130.3KB, docx)

(DOCX 130 KB)

ESM2 (552.2KB, docx)

(DOCX 552 KB)

ESM3 (13.1KB, docx)

(DOCX 13.0 KB)

ESM4 (13.5KB, docx)

(DOCX 13.5 KB)

Abbreviations

DNN

Deep neural network

ED

Emergency department

EHRs

Electronic health records

EHRR

Electronic health record registry

ENR

Elastic net regularization

ESI

Emergency severity index

XGBoost

Extreme gradient boosting machines

GBM or XGBoost

Gradient boosting machines

FN

False Negative

FP

False Positive

ICU

Intensive care unit

LR

Logistic regression

ML

Machine learning

MTS

Manchester triage system

NB

Naive Bayes

NEDIS

National emergency department information system

NHAMCS

National health centre statistics

RF

Random forest

SROC curve

Summary receiver operating characteristic

SVM

Support vector machine

TN

True Negative

TP

True Positive

Authors'  contributions

Dr Blanca Paola Pérez conceptualized and designed the study, conducted the literature search, extracted the data, drafted the initial manuscript and critically reviewed and revised the manuscript. Dr Octavio Galindo Osorio conceptualized and designed the study, conducted the literature search, extracted the data and critically reviewed and revised the manuscript. Dr Nerea Martín-Calvo conceptualized and designed the study, coordinated and supervised data collection, analysed and synthesized the study data and critically reviewed and revised the manuscript. Dr Mónica Arias-Colinas extracted the data, drafted the initial manuscript, critically reviewed and revised the manuscript and formatted it for publication. Dr José Manuel Moreno coordinated and supervised data collection, critically reviewed and revised the manuscript.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

All authors approved the final manuscript as submitted and agreed to be accountable for all aspects of the work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Baumann MR, Strout TD (2005) Evaluation of the emergency severity index (version 3) triage algorithm in pediatric patients. Acad Emerg Med 12(3):219–224 [DOI] [PubMed] [Google Scholar]
  • 2.Tang N, Stein J, Hsia RY, Maselli JH, Gonzales R (2010Aug 11) Trends and characteristics of US emergency department visits, 1997–2007. JAMA 304(6):664–670 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Greenwood-Ericksen MB, Kocher K (2019A) Trends in emergency department use by rural and urban populations in the United States. JAMA Netw Open 2(4):e191919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Quinn C (2024) Artificial intelligence in medicine: A practical guide for clinicians. Campion Quinn [Google Scholar]
  • 5.Xing L (2021) Artificial intelligence in medicine: Technical basis and clinical applications. Elsevier, Amsterdam, pp i–iii [Google Scholar]
  • 6.Rajkomar A, Dean J, Kohane I (2019Apr 4) Machine learning in medicine. N Engl J Med 380(14):1347–1358 [DOI] [PubMed] [Google Scholar]
  • 7.Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS et al (2019) PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 170(1):51–58. 10.7326/M18-1376 [DOI] [PubMed] [Google Scholar]
  • 8.Hatachi T, Hashizume T, Taniguchi M, Inata Y, Aoki Y, Kawamura A et al (2022) Machine learning-based prediction of hospital admission among children in an emergency care center [Internet]. Available from: https://pubmed.ncbi.nlm.nih.gov/36719388/
  • 9.Wolff P, Ríos SA, Graña M (2019–12) Setting up standards: A methodological proposal for pediatric triage machine learning model construction based on clinical outcomes. Expert Syst Appl Vol.138:  p.112788, Article 112788.
  • 10.Roquette BP, Nagano H, Marujo EC, Maiorano AC (2020) Prediction of admission in pediatric emergency department with deep neural networks and triage textual data. Neural Netw 1(126):170–177 [Google Scholar]
  • 11.Heyming TW, Knudsen-Robbins C, Feaster W, Ehwerhemuepha L (2021) Criticality index conducted in pediatric emergency department triage. Am J Emerg Med 1(48):209–217 [Google Scholar]
  • 12.Patel SJ, Chamberlain DB, Chamberlain JM (2018Dec 1) A machine learning approach to predicting need for hospitalization for pediatric asthma exacerbation at the time of emergency department triage. Acad Emerg Med 25(12):1463–1470 [DOI] [PubMed] [Google Scholar]
  • 13.Leonard F, Gilligan J, Barrett MJ (2022) Development of a low-dimensional model to predict admissions from triage at a pediatric emergency department. JACEP Open 3(4):e12779. 10.1002/emp2.12779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sills MR, Ozkaynak M, Jang H (2021) Predicting hospitalization of pediatric asthma patients in emergency departments using machine learning. Int J Med Inform 151:104468. 10.1016/j.ijmedinf.2021.104468 [DOI] [PubMed] [Google Scholar]
  • 15.Kwon JM, Jeon KH, Lee M, Kim KH, Park J, Oh BH (2021) Deep learning algorithm to predict need for critical care in pediatric emergency departments. Pediatr Emerg Care 37(12):e988–e994. 10.1097/PEC.0000000000001858 [DOI] [PubMed] [Google Scholar]
  • 16.Hwang S, Lee B (2022) Machine learning-based prediction of critical illness in children visiting the emergency department. PLoS One 17(2):e0264184. 10.1371/journal.pone.0264184 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Goto T, Camargo CA, Faridi MK, Freishtat RJ, Hasegawa K (2019Jan 1) Machine learning-based prediction of clinical outcomes for children during emergency department triage. JAMA Netw Open 2(1):e187754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M et al (2018) Scalable and accurate deep learning with electronic health records. NPJ Digit Med 1:18. 10.1038/s41746-018-0029-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K et al (2019) A guide to deep learning in healthcare. Nat Med 25:24–9 [DOI] [PubMed] [Google Scholar]
  • 20.Naemi A, Schmidt T, Mansourvar M, Naghavi-Behzad M, Ebrahimi A, Wiil UK (2021N) Machine learning techniques for mortality prediction in emergency departments: a systematic review. BMJ Open 11(11):e051920 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ESM1 (130.3KB, docx)

(DOCX 130 KB)

ESM2 (552.2KB, docx)

(DOCX 552 KB)

ESM3 (13.1KB, docx)

(DOCX 13.0 KB)

ESM4 (13.5KB, docx)

(DOCX 13.5 KB)

Data Availability Statement

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.


Articles from European Journal of Pediatrics are provided here courtesy of Springer

RESOURCES