Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Nov 1.
Published in final edited form as: Pediatr Crit Care Med. 2024 Jun 21;25(11):1047–1050. doi: 10.1097/PCC.0000000000003556

The 2024 Pediatric Sepsis Challenge: Predicting in hospital-mortality in children with suspected sepsis in Uganda

Charly Huxford 1, Alireza Rafiei 2, Vuong Nguyen 1, Matthew O Wiens 1,3,4, J Mark Ansermino 1,3,4, Niranjan Kissoon 1,4,5, Elias Kumbakumba 12, Stephen Businge 11, Clare Komugisha 6, Mellon Tayebwa 6, Jerome Kabakyenga 9,10, Nathan Kenya Mugisha 6, Rishikesan Kamaleswaran 7,8; Pediatric Sepsis CoLab Members
PMCID: PMC11534513  NIHMSID: NIHMS1994352  PMID: 38904442

Abstract

The aim of this Technical Note is to inform the pediatric critical care data research community about the 2024 Pediatric Sepsis Data Challenge. This competition aims to facilitate the development of open-source algorithms to predict in-hospital mortality in Ugandan children with sepsis. The challenge is to first develop an algorithm using a synthetic training dataset, which will then be scored according to standard diagnostic testing criteria, and then be evaluated against a non-synthetic test dataset. The datasets originate from admissions to 6 hospitals in Uganda (2017 to 2020) and include 3,837 children, aged 6 to 60-months, who were confirmed or suspected to have a diagnosis of sepsis. The synthetic dataset was created from a random subset of the original data. The test validation dataset closely resembles the synthetic dataset. The challenge should generate an optimal model for predicting in-hospital mortality. Following external validation, this model could be used to improve the outcomes for children with proven or suspected sepsis in low- and middle-income settings.

Keywords: Sepsis, algorithms, In-Hospital Mortality, competition, early detection and treatment, evaluation metrics, generalizability, open-source algorithms


Sepsis is the leading cause of death in children in low- and middle-income countries (LMICs). In 2017, there were an estimated 48·9 million cases of sepsis and 11 million sepsis-related deaths worldwide with 85% of cases and deaths occurring in LMICs [1]. Children accounted for about half of these cases and nearly 3 million deaths occurred in children under 5 years of age. In 2017 to 2020, a prospective epidemiological study was carried out in six Ugandan hospitals, examining children aged under 5 years who had been admitted with a suspected or confirmed diagnosis of sepsis [2]. Of note, the study found that in 3,837 children aged 6 to 60-months, 164 (4.3%) died in-hospital.

We believe that early risk-stratification of children with suspected or confirmed sepsis at the time of admission may improve outcomes. Such risk-stratification may also serve as a surrogate for late presentation that could inform future policies and community education initiatives for early recognition of sepsis by patients and caregivers. The 2024 Pediatric Sepsis Data Challenge was therefore developed to provide an opportunity to address this gap. We propose this challenge to the pediatric critical care research community at large, and it should be of interest to data science researchers involved with risk-prediction modeling based in North American [35] and LMIC [6] settings.

CHALLENGE DATA SOURCE

The data for the 2024 Pediatric Sepsis Data Challenge comes from a deidentified, curated research dataset of a study called “Smart discharges to improve post-discharge health outcomes in children: A prospective before-after study with staggered implementation”. The original study was approved by the Mbarara University of Science and Technology (MUST) Research Ethics Committee (REC 15/10-16, approved November 28, 2017), the Uganda National Institute of Science and Technology (HS 2207, approved April 12, 2017) and the University of British Columbia/Children & Women’s Health Centre of British Columbia (UBC/C&W) Research Ethics Board (REB, H16-02679, approved May 9, 2017). Caregivers/participants provided informed consent for depositing the curated dataset in an open data repository and all research procedures were followed in accordance with the ethical standards of the MUST REC and UBC/C&W REB, and the Helsinki Declaration of 1975.

The 2024 Pediatric Sepsis Data Challenge dataset is a subgroup of 2017 to 2020 Ugandan epidemiological data published in 2023 [2]. These prospective, multisite, observational cohort data were focused on children aged up to 60-months who were admitted to any of 6 Ugandan hospitals with suspected or confirmed sepsis. These hospitals have a catchment population of 1·4 million children younger than 60-months of age. The authors of the 2023 report have previously shown that “approximately 90% of children admitted to hospital with a proven or suspected infection in Uganda meet the International Pediatric Sepsis Consensus Conference definition for sepsis” [7]; that is “the presence of systemic inflammatory response syndrome combined with a confirmed or suspected infection” [8].

CHALLENGE DATA GENERATION

Out of the 6,545 children aged younger than 60-months in the 2017 to 2020 dataset, we selected the 3,837 children aged 6 to 60 months with 164 (4.3%) in-hospital deaths [2].

Synthetic training dataset

We created a synthetically generated training dataset to reduce the risk of re-identification. Even with deidentified data, there is a persistent risk of re-identification, especially in datasets with clinical variables. This problem is particularly important in the data in the proposed data challenge, since potential participants may be able to re-identify potential patients, or even the hospital itself, which are located primarily in low resource environments with distinct characteristics. Therefore, we have taken the precaution to ensure that our data contributing sites are minimally exposed to re-identification by adding synthetic data elements.

The synthetic training set was generated from a random 70% subset of the original data with 2,686 of the 3,837 records. This dataset was created in R Statistical Software using the synthpop package [9]. We used the non-parametric classification and regression tree (CART) method for synthesising all variables. Variables were synthesised sequentially, with the first variable (the outcome variable, in-hospital mortality) synthesised via sampling with replacement, and subsequent variables synthesised conditionally on all previous variables. The number of synthetic samples generated was equivalent to the size of the training set of the original data. The rationale behind this decision was to provide data challenge participants with an environment closely resembling our actual conditions, albeit without granting direct access to the real data. This approach aims to ensure that participants build their models under conditions that closely mimic the challenges posed by our real-world data. Missing data (31% of all data cells) were also synthesised as part of this process, and rules were specified to account for missing data due to branching logic. All direct identifiers were removed to reduce the risk of re-identification, and data collected during discharge or post-discharge from the facility were not included as they cannot be used to predict in-hospital mortality. The full training dataset contains 148 variables, including clinical, social, and laboratory values [10].

Model validation dataset

The remaining 30% (i.e., 1,151 records) withheld from original dataset will be used as a model validation dataset. Univariable distributions between the synthetic training data and test validation dataset are similar [10]. The bivariate distributions between all predictor variables compared against the outcome, in-hospital mortality, were also similar, with some exceptions where a categorical variable was poorly represented.

We have evaluated two measures of distribution divergence. Taken together, these divergence statistics suggest that the synthetic training dataset and test validation dataset are similar. The maximum mean discrepancy (MMD) for continuous variables – in which smaller values indicate more similar datasets – between the synthetic training dataset and test validation dataset was 0.030. The Kullback–Leibler (KL) divergence (i.e., normalized value between 0 and 1 where higher value indicates more similar datasets) for continuous and categorical variables, between the synthetic training dataset and the test validation dataset was 0.915 and 0.987, respectively [10].

THE 2024 PEDIATRIC SEPSIS DATA CHALLENGE

The 2024 Pediatric Sepsis Data Challenge aims to support global participants in building skills in model development for clinical risk prediction. It is anticipated to launch November 4th, 2024, and welcomes participants from all disciplines and expertise levels, from beginners in data science to veterans of the field. Participants are asked to design a working, open-source algorithm to predict in-hospital mortality using only the provided synthetically generated dataset [10]. Ideally, the model should be capable of running on a mobile device, considering environments with unreliable electrical supply and internet connectivity. The final model may eventually be used to increase the level of care for the most vulnerable children.

The challenge consists of two phases: an “unofficial” phase and an “official” phase. The first phase of the challenge serves as a testing ground for the data, scores, and submission system before the official phase begins. Participants will use the synthetically generated training dataset to train their models. This unofficial phase allows teams to start developing preliminary algorithms and the subsequent submission should include both the training code and the corresponding trained model. During this unofficial phase of the challenge, teams can submit up to 5 algorithms. In the official phase of the challenge, trained models will undergo evaluation on the test validation dataset, which will be kept confidential and not shared with participants at any stage. During the official phase of the challenge, teams can submit up to 10 algorithms.

Model evaluation

The challenge organizers will execute the submitted code within a contained environment on an Amazon Web Services (AWS) platform. Each team will be provided with a baseline model implemented in Python. This baseline model is a random forest classifier that inputs all available features. Categorical variables are transformed into a set of binary (0/1) variables, representing the distinct values within each categorical variable.

Model predictions will be evaluated using a specifically defined metric for this challenge. The models make mortality predictions for each patient record, and the challenge score is determined based on the true positive rate (TPR) for predicting mortality given a false positive rate (FPR) of less than or equal to 0.20. We define the numbers of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN), as seen in Table 1. Then, regarding the scoring metrics, we consider θ as the highest decision threshold, which is the TPR when FPR is fixed at 0.20 (Table 2). A perfect score is 1. The model with the highest challenge score on the full test validation dataset will be declared the winner of the 2024 Pediatric Sepsis Data Challenge.

Table 1.

Definitions of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN).

Outcome Observed
Death Survivor
Predicted Death TP FP
Survivor FN TN

Table 2.

Scoring metrics.

FPRθ=FPθ/(FPθ+TNθ) [the fraction of incorrect predictions of survivors at the decision threshold of θ]
TPRFPR=TPFPR/(TPFPR+FNFPR) [the fraction of correct predictions of in-hospital deaths at a fixed FPR]
Score1=TPRFRP<=0.20 [the TPR at an FPR of less than or equal to 0.20]

For those interested in participating, please visit the challenge website for more details, including registration (http://www.bcchildrens.ca/globalhealth/projects-priorities/project-highlights/2024-pediatric-sepsis-data-challenge).

Copyright Form Disclosure:

Ms. Huxford’s institution received funding from Grand Challenges Canada, Thrasher Research Fund, BC Children’s Hospital Foundation, and Mining4Life. Dr. Businge received funding from the Pediatric Sepsis Data CoLaboratory, the World Federation of Pediatric Intensive and Critical Care Societies, the University of British Columbia and BC Children’s Hospital Foundation. Dr. Komugisha received support for article research form Grand Challenges Canada. Dr. Tayebwa received funding from Mbarara University of Science and Technology. Dr. Kamaleswaran received support for article research from the National Institutes of Health. The remaining authors have disclosed that they do not have any potential conflicts of interest.

Funding:

Grand Challenges Canada, Thrasher Research Fund, BC Children’s Hospital Foundation, and Mining4Life.

Footnotes

Disclosures: All authors declare that they have no conflicts of interest or financial disclosures.

REFERENCES

  • 1.Rudd KE, Johnson SC, Agesa KM, et al. : Global, regional, and national sepsis incidence and mortality, 1990-2017: analysis for the Global Burden of Disease Study. Lancet. 2020; 395:200–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wiens MO, Bone JN, Kumbakumba E, et al. : Mortality after hospital discharge among children younger than 5 years admitted with suspected sepsis in Uganda: a prospective, multisite, observational cohort study. Lancet Child Adolesc Health. 2023; 7:555–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Trujillo Rivera EA, Chamberlin JM, Patel AK, et al. : Dynamic mortality risk predictions for children in ICUs: development and validation of machine learning models. Pediatr Crit Care Med 2022; 23:344–352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Steif J, Brant R, Sreepada RS, et al. : Prediction model performance with different imputation strategies: a simulation study using a North American ICU registry. Pediatr Crit Care Med 2022; 23:e29–e44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Heneghan JA, Walker SB, Fawcett A, et al. : The Pediatric Data Science and Analytics Subgroup of the Pediatric Acute Lung Injury and Sepsis Investigators Network: use of supervised machine learning applications in pediatric critical care medicine research. Pediatr Crit Care Med 2023. Dec 7 [online ahead of print doi: 10.1097/PCC.0000000000003425]. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chandna A, Keang S, Vorlark M, et l: A prognostic model for critically ill children in locations with emerging critical care capacity. Pediatr Crit Care Med 2024; 25:189–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wiens MO, Larson CP, Kumbakumba E, et al. : Application of sepsis definitions to pediatric patients admitted with suspected infections in Uganda. Pediatr Crit Care Med 2016; 17: 400–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Goldstein B, Giroir B, Randolph A: International pediatric sepsis consensus conference: definitions for sepsis and organ dysfunction in pediatrics. Pediatr Crit Care Med 2005; 6: 2–8. [DOI] [PubMed] [Google Scholar]
  • 9.Nowok B, Raab GM, Dibben C: synthpop: bespoke creation of synthetic data in R. Journal of Statistical Software. 2016; 74:1–26. [Google Scholar]
  • 10.Nguyen V, Huxford C, Rafiei A, et al. : Data Challenges: 2023 Pediatric Sepsis Challenge. Borealis, V1. 2023. Available from: 10.5683/SP3/TFAV36 (accessed April 2024). [DOI] [Google Scholar]

RESOURCES