Machine learning/AI for early neonatal complication detection in rural Ethiopia: A retrospective cohort study in the Sidama region

Amanuel Yoseph; Yohannes Seifu Berego; Mehretu Belayneh; Francisco Guillen-Grima

doi:10.1177/20552076261431481

. 2026 Mar 7;12:20552076261431481. doi: 10.1177/20552076261431481

Machine learning/AI for early neonatal complication detection in rural Ethiopia: A retrospective cohort study in the Sidama region

Amanuel Yoseph ^1,^✉, Yohannes Seifu Berego ², Mehretu Belayneh ¹, Francisco Guillen-Grima ^3,^4,^5,⁶

PMCID: PMC12967347 PMID: 41804354

Abstract

Introduction

Neonatal complications remain a leading cause of illness and death in low- and middle-income countries, particularly in rural areas. Early identification of high-risk neonates is crucial for timely interventions. This study assessed the incidence and determinants of neonatal complications and evaluated the predictive performance of machine learning algorithms using a unified risk framework encompassing both adverse birth outcomes and early postnatal complications.

Methods

We conducted a retrospective cohort study using routinely collected maternal and neonatal records. Five supervised machine learning models - logistic regression (LR), support vector machine (SVM), random forest (RF), artificial neural network (ANN), and extreme gradient boosting (XGBoost) were developed in R. Model performance was assessed with area under the curve (AUC), sensitivity, specificity, F1 score, and calibration. SHapley Additive Explanations (SHAP) identified key predictors. Sensitivity analyses evaluated the robustness of results by examining birth outcomes and postnatal complications separately.

Results

Of the neonates studied, 15.2% (95% CI: 14.0–16.5) experienced complications, with higher rates in rural (17.1%) than urban areas (11.2%, p < 0.01). Preterm birth occurred in 12.7% and low birth weight in 9.4%, while 4.1% developed postnatal complications. XGBoost achieved the highest predictive performance [AUC = 0.85; sensitivity = 78%; specificity = 80%; F1 = 0.76], followed by RF and ANN. LR and SVM showed moderate accuracy. SHAP analysis highlighted maternal age <20, previous neonatal complications, low education, unplanned pregnancy, <4 antenatal visits, anemia, and rural residence as significant predictors. Sensitivity analyses confirmed stable performance across separate outcomes.

Conclusion

Neonatal complications remain prevalent, with pronounced rural–urban disparities. XGBoost offers accurate and interpretable early risk prediction using routine maternal and antenatal data. Targeted interventions including expanded prenatal care, anemia management, and strengthened rural health services could reduce neonatal morbidity and mortality.

Keywords: Neonatal complications, maternal risk factors, antenatal care, machine learning, XGBoost, SHAP analysis, retrospective cohort, inequities in rural health, Ethiopia

Introduction

Neonatal complications continue to be a major global public health concern, accounting for over 47% of all fatalities among children under the age of five.¹ The burden is highest in low- and middle-income countries (LMICs), particularly in Sub-Saharan Africa, where inadequate access to prompt interventions, experienced birth attendants, and neonatal intensive care contributes to poor outcomes.^2,3 Ethiopia is no different; neonatal morbidity and mortality remain high, particularly in rural regions with little health-care infrastructure, despite dramatic reductions in under-five mortality over the last 20 years.⁴ Neonatal complication remain a pressing public health concern in the Sidama region, characterized by a preponderance of rural residents and logistical obstacles to hospital access.⁵

Early detection is associated with improved survival rates for high-risk newborns. Conventional risk classification systems rely heavily on clinician experience and paper-based monitoring, which can be resource costly, inconsistent, or delayed.⁶ There is growing evidence that early warning systems powered by artificial intelligence (AI) and machine learning (ML) can enhance forecast accuracy for poor neonatal outcomes.^7,8 In high-income countries, these methods have produced positive results, including accurate prediction of preterm birth (PTB), infant infections, and respiratory problems.^9,10 Nonetheless, AI-driven prediction models are largely underutilized in rural, low-resource areas, particularly in Sub-Saharan Africa, where challenges such as missing data, limited digital infrastructure, and contextual variables might hinder model performance.¹¹

Despite technological advances, research gaps remain in Ethiopia and other LMICs. First, rural newborn populations are underrepresented in research since most studies focus on hospitalized patients in urban areas.¹² Second, the use of current AI models for health-care providers in resource-constrained settings is limited, as they often focus on specific outcomes or lack interpretability.¹³ Third, prediction algorithms rarely include contextual factors known to influence neonatal issues, such as mother education, geographic accessibility, and antenatal care (ANC) utilization.¹⁴ These shortcomings highlight the critical need for interpretable, contextually appropriate AI systems that can reliably identify high-risk neonates in rural Ethiopia.

There are significant clinical and public health implications to closing these gaps. AI-driven early warning systems that prioritize high-risk cases may potentially help guide interventions to reduce neonatal morbidity and mortality.¹⁵ Furthermore, by providing practical insights into regional and national health-care planning for women and newborns, these models can help to build evidence-based policies.¹⁶ The use of these technologies in distant areas such as the Sidama region has the potential to transform newborn care delivery, improve health equity, and meet the Sustainable Development Goals for neonatal survival.¹⁷

The current research aims to develop and evaluate AI-driven early warning systems for neonatal complications in the Sidama region of rural Ethiopia to address these unmet needs. This study aims to assess model performance and interpretability, identify key maternal, neonatal, and contextual factors, and make recommendations for real-world implementation in resource-constrained settings. The study's goal is to increase the use of AI in neonatal health-care and provide recommendations for programs to reduce neonatal morbidity and mortality in rural Ethiopia by closing significant research gaps.

Methods

Study design and setting

A retrospective cohort study was conducted in the Northern Zone of the Sidama region, Ethiopia, from January 2021 to January 2025. The study included four rural districts—Boricha, Hawela, Shebedino, and Bilate Zuriya—that face significant impediments to modern neonatal care.¹⁸ Multiple government hospitals and health facilities contributed health data to ensure comprehensive coverage of deliveries in the region.

Study population

Eligible participants comprised all women who gave birth to live babies throughout the study period and had complete obstetric, maternal, and neonatal records. Exclusion criteria included multiple gestations other than twins, congenital abnormalities that were not compatible with life, and inadequate data. The final cohort included 3954 mother-infant pairs, providing sufficient power for predictive modeling.

Sample size and power calculation

The study cohort included 3954 mother–infant dyads, which provided sufficient statistical power for the planned analyses. Power calculations were based on the primary composite neonatal outcome (PTB, low birth weight (LBW), or postnatal complication), with an observed prevalence of 15.2%. Assuming a binary outcome, a two-sided significance level of 0.05, and an anticipated adjusted risk ratio of ≥1.5 for key maternal predictors, this sample size yields over 90% power to detect meaningful associations in multivariable regression and ML models. The large cohort also supports model training, validation, and sensitivity analyses while ensuring stable estimation of predictor effects.

Outcome definition

In this study, we defined overall neonatal risk as a composite outcome encompassing both birth outcomes and postnatal complications occurring within the first 28 days of life. The composite outcome includes:

Birth outcomes: PTB (birth before 37 completed weeks of gestation) and LBW (<2500 g).
Postnatal neonatal complications: neonatal sepsis, jaundice requiring phototherapy, hypoglycemia, and respiratory distress. Each complication was given a binary code (0 = nonexistent, 1 = present).

This composite measure captures the full spectrum of early neonatal complications, enabling timely risk stratification and intervention planning in low-resource settings. To ensure conceptual clarity, we also conducted sensitivity analyses using separate models for birth outcomes and postnatal complications, confirming that predictors identified in the primary analysis remained consistent across distinct outcome categories.

Predictor variable selection

Predictor variables were chosen based on clinical relevance, prior evidence, and data availability within the study setting. Maternal and antenatal factors such as maternal age, education, history of previous neonatal complications, anemia, ANC visits, pregnancy planning status, and place of residence were prioritized due to their established associations with neonatal outcomes in Ethiopia and similar low-resource contexts. Contextual variables, including distance to health facilities, were also considered to capture environmental influences. We focused on interpretable, clinically meaningful predictors and did not employ unsupervised learning to identify additional variables, ensuring that the model could be realistically implemented using routinely collected data.

Data collection, quality checks, and preprocessing

Data were extracted from both electronic medical records and paper charts using a standardized abstraction form. Approximately 40% of the records were obtained from paper charts, while 60% were from electronic sources. Data quality was assessed by cross-checking entries, identifying inconsistencies, and referencing primary records as necessary. Missing data were handled using multiple imputation and chained equations.¹⁹ Continuous values were normalized, while categorical variables were encoded once. To address potential class imbalance in neonatal complications, the training set was oversampled for minority outcomes using Synthetic Minority Over-sampling Technique (SMOTE). The data was randomly partitioned into training (70%) and validation (30%) sets. To address class imbalance in the primary composite outcome, the SMOTE was applied only to the training set. SMOTE used maternal and antenatal predictors exclusively. The test set remained untouched to ensure unbiased model evaluation. Although the class imbalance was moderate, SMOTE was used to improve the model's learning of the minority class. We also tested alternative approaches, including class weighting, which produced comparable results.

Machine learning modeling

We implemented five supervised ML algorithms—logistic regression (RL), random forest (RF), support vector machine (SVM), artificial neural network (ANN), and XGBoost to predict the composite neonatal risk outcome. Sensitivity analyses modeled birth outcomes and postnatal complications separately. Hyperparameters were tuned using a 10-fold cross-validation grid search; the final parameters are listed in Supplemental Table 1. The ANN architecture consisted of 1–2 hidden layers with 10–30 neurons per layer, using ReLU or sigmoid activation, trained with the Adam optimizer for 100–200 epochs. This approach ensures transparency and reproducibility, and enables replication of model training and evaluation in similar low-resource settings.

Model interpretability

To enhance interpretability, Shapley Additive Explanations (SHAP) were applied to each model to quantify the contribution of individual predictors and explore potential interactions.²⁰ This approach allows clinicians and policymakers to identify high-risk newborns using only pre-delivery maternal information while maintaining transparency and reproducibility.

Data preprocessing and missing data

Variables with more than 20% missing values were excluded, including maternal weight, mid-upper arm circumference, selected laboratory measures (hemoglobin, malaria, parasites, syphilis), and some socioeconomic indicators (income and occupation), which were inconsistently recorded. The 20% cutoff balanced the retention of important predictors with the need to minimize bias; a stricter threshold (e.g. 10%) would have excluded key variables, reducing model interpretability and generalizability in this low-resource setting. For the retained variables, the median proportion of missing values was 4% (range: 1–16%), which were handled using multiple imputation with chained equations to preserve statistical power and reliability.

Statistical analysis

Descriptive statistics summarized maternal, neonatal, and environmental characteristics. Categorical data were given as frequencies and percentages, whereas continuous variables were shown as means ± standard deviation (SD). The bivariate relationships were investigated using chi-square and t-tests. Statistical significance was determined at p < 0.05 for all analyses conducted in R. The composite outcome approach preserves statistical power and reflects the overall early neonatal risk burden, a clinically meaningful measure in low-resource rural settings. The separate sensitivity analyses serve to validate the conceptual integrity of the composite outcome, ensuring that findings are biologically plausible and consistent with individual outcome models.

Results

Study population characteristics

The final cohort included 3954 mother-infant dyads. The mean maternal age was 27.4 ± 5.8 years, with 18.6% under 20. The most of participants (67.3%) lived in rural areas, and 35.2% had less than a primary education. Approximately 28.9% of pregnancies were unplanned, and 41.5% of women received fewer than four ANC visits. Maternal anemia was diagnosed in 21.4% of the mothers. Table 1 presents the baseline maternal, neonatal, and environmental factors.

Table 1.

Baseline characteristics of the study population (n = 3954).

Characteristic	n (%) or mean ± SD
Maternal age (years)	27.4 ± 5.8
<20	735 (18.6)
20–34	2689 (68.0)
≥35	530 (13.4)
Maternal education
No formal education	1390 (35.2)
Primary	1420 (35.9)
Secondary and above	1144 (28.9)
Residence
Rural	2657 (67.3)
Urban	1297 (32.7)
Antenatal care visits <4	1641 (41.5)
Unplanned pregnancy	1141 (28.9)
Maternal anemia	846 (21.4)
Neonatal birth weight <2500 g	371 (9.4)
Preterm birth (<37 weeks)	503 (12.7)
Neonatal sepsis	162 (4.1)
Distance to nearest health facility >5 km	1870 (47.3)

Open in a new tab

Incidence of neonatal complications

In the final cohort of 3954 mother-infant dyads, 15.2% (95% CI: 14.0–16.5) of neonates experienced at least one event included in the composite neonatal risk outcome within the first 28 days of life. PTB (12.7%) and LBW (9.4%) were the most frequent birth outcomes, while postnatal complications, including neonatal sepsis, jaundice, hypoglycemia, and respiratory distress, occurred in 4.1% of neonates. The incidence of composite neonatal risk was significantly higher in rural districts (17.1%) than in urban areas (11.2%; p < 0.01), highlighting persistent rural–urban inequities. Figure 1 displays the incidence by district.

Figure 1. — Incidence of neonatal complications by district.

Machine learning model performance

Five supervised ML algorithms were applied to predict the composite neonatal risk outcome. Table 2 summarizes the comparative performance of these models, whereas Figure 2 depicts them. Among them, XGBoost had the highest predictive power, with an area under the curve (AUC) of 0.85 (95% CI: 0.82–0.88), sensitivity and specificity of 78% (95% CI: 73–82) and 80% (95% CI: 76–84), respectively, and an F1-score of 0.76 (95% CI: 0.72–0.80). The RF and ANN models were also effective, with AUC values of 0.83 (95% CI: 0.80–86) and 0.81 (95% CI: 0.78–0.84), respectively. In contrast, LR and SVM models offer acceptable predictive ability, with AUCs of 0.75 (95% CI: 0.72–0.78) and 0.77 (95% CI: 0.74–0.80), respectively. Calibration analysis demonstrated good agreement between predicted probabilities and observed outcomes for all models, with XGBoost showing the closest alignment (calibration slope = 1.02). Hyperparameters for each model, including ANN architecture, learning rates, and tree parameters, are detailed in Supplemental Table 2, ensuring transparency and enabling replication of the modeling process in similar low-resource settings. Given the moderate incidence of neonatal complications (15.2%), we also evaluated model performance using precision–recall curves. The area under the precision–recall curve for XGBoost was 0.68 (95% CI: 0.64–0.72), indicating strong positive predictive performance, and consistently outperformed the other models (random forest: 0.64; ANN: 0.62; SVM: 0.57; logistic regression: 0.55). These results are presented in Supplemental Table 3, which highlights the robustness of XGBoost in identifying high-risk neonates, even when the positive outcome is relatively rare.

Table 2.

Performance metrics of machine learning models for predicting composite neonatal complications.

Model	AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1-score (95% CI)	Calibration slope
Logistic Regression	0.75 (0.72–0.78)	0.70 (0.65–0.75)	0.72 (0.68–0.76)	0.66 (0.62–0.70)	0.93
Random Forest	0.83 (0.80–0.86)	0.76 (0.71–0.81)	0.78 (0.74–0.82)	0.73 (0.69–0.77)	0.98
XGBoost	0.85 (0.82–0.88)	0.78 (0.73–0.82)	0.80 (0.76–0.84)	0.76 (0.72–0.80)	1.02
SVM	0.77 (0.74–0.80)	0.72 (0.67–0.77)	0.74 (0.70–0.78)	0.68 (0.64–0.72)	0.95
ANN	0.81 (0.78–0.84)	0.74 (0.69–0.79)	0.76 (0.72–0.80)	0.72 (0.68–0.76)	0.99

Open in a new tab

AUC: area under the curve; ANN: artificial neural network; CI: confidence interval; SVM: support vector machine; XGBoost: extreme gradient boosting.

Figure 2. — ROC curves of the five machine learning models.

F1 score and calibration interpretation

F1 scores and calibration slopes provide complementary insights into model performance. The F1 score reflects the balance between precision and recall, whereas calibration slopes measure agreement between predicted and observed outcomes. For instance, XGBoost's F1 score of 0.76 and calibration slope of 1.02 demonstrate both high discriminative ability and well-calibrated predictions. Models with lower F1 scores, such as logistic regression (0.66) and slope 0.93, exhibited slightly less optimal calibration, consistent with reduced predictive performance.

Key predictors of neonatal complications

Figure 3 shows the SHAP analysis of the XGBoost model, which showed the significant predictors of neonatal complications. Individual predictors discovered included maternal age under 20 years, a history of past neonatal complications, and low maternal education. Unplanned pregnancy, fewer than four ANC visits, maternal anemia, and contextual factors such as rural residence and a long distance to the nearest health facility were also found as risks.

The highest compounded risk was observed among neonates whose mothers were both anemic and attended fewer than four ANC visits. This combination markedly increased predicted neonatal risk compared with either factor alone. Other potential interactions, such as young maternal age combined with low education, unplanned pregnancy combined with limited ANC, or rural residence combined with long travel distance, were carefully evaluated for effect modification, but none reached statistical significance. These findings highlight that, while several maternal and contextual factors contribute individually to neonatal risk, the primary compounding effect is driven by anemia combined with inadequate ANC (Figure 4).

Figure 4. — SHAP interaction plot illustrating the combined effect of maternal anemia and <4 ANC visits on predicted neonatal complication risk.

Sensitivity analyses results

Birth outcomes (PTB and LBW)

Sensitivity analyses were conducted using separate models to predict PTB and LBW. Among the cohort, PTB and LBW occurred in 12.7% and 9.4% of neonates, respectively. XGBoost again demonstrated the highest predictive performance for both outcomes (Table 3): PTB: AUC = 0.84, sensitivity = 76%, specificity = 79%, F1-score = 0.74 and LBW: AUC = 0.82, sensitivity = 74%, specificity = 77%, F1-score = 0.71.

Table 3.

Performance metrics for predicting preterm birth and low birth weight.

Model	PTB AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1-score (95% CI)	LBW AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1-score (95% CI)
XGBoost	0.84 (0.81–0.87)	0.76 (0.71–0.81)	0.79 (0.75–0.83)	0.74 (0.70–0.78)	0.82 (0.79–0.85)	0.74 (0.69–0.79)	0.77 (0.73–0.81)	0.71 (0.67–0.75)
Random Forest	0.82 (0.79–0.85)	0.74 (0.69–0.79)	0.77 (0.73–0.81)	0.71 (0.67–0.75)	0.80 (0.77–0.83)	0.72 (0.68–0.76)	0.75 (0.71–0.79)	0.69 (0.65–0.73)
ANN	0.80 (0.77–0.83)	0.73 (0.68–0.78)	0.76 (0.72–0.80)	0.70 (0.66–0.74)	0.78 (0.75–0.81)	0.71 (0.66–0.76)	0.74 (0.70–0.78)	0.67 (0.63–0.71)
SVM	0.76 (0.73–0.79)	0.70 (0.65–0.75)	0.73 (0.69–0.77)	0.66 (0.62–0.70)	0.74 (0.71–0.77)	0.69 (0.64–0.74)	0.72 (0.68–0.76)	0.65 (0.61–0.69)
Logistic Regression	0.74 (0.71–0.77)	0.68 (0.63–0.73)	0.71 (0.67–0.75)	0.64 (0.60–0.68)	0.72 (0.69–0.75)	0.67 (0.62–0.72)	0.70 (0.66–0.74)	0.63 (0.59–0.67)

Open in a new tab

AUC: area under the curve; ANN: artificial neural network; CI: confidence interval; SVM: support vector machine; XGBoost: extreme gradient boosting; PTB: predict preterm birth; LBW: low birth weight.

SHAP analysis identified the most important maternal and antenatal predictors for PTB and LBW as maternal age <20 years, low maternal education, history of previous neonatal complications, unplanned pregnancy, and fewer than four ANC visits. Rural residence and maternal anemia also contributed to higher predicted risks, consistent with findings from the primary composite outcome model (Figures 5 and 6).

Figure 5. — SHAP summary plot showing the top 7 predictors of preterm birth complications (XGBoost model).

Figure 6. — SHAP summary plot showing the top 7 predictors of low-birth-weight complications (XGBoost model).

Postnatal complications

Separate models were also developed for postnatal neonatal complications, including neonatal sepsis, jaundice, hypoglycemia, and respiratory distress, which collectively occurred in 4.1% of neonates. XGBoost again achieved the best performance: AUC = 0.83, sensitivity = 77%, specificity = 78%, F1-score = 0.72 (Table 4).

Table 4.

Performance metrics for postnatal complications.

Model	AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	F1-score (95% CI)
XGBoost	0.83 (0.80–0.86)	0.77 (0.72–0.82)	0.78 (0.74–0.82)	0.72 (0.68–0.76)
Random Forest	0.81 (0.78–0.84)	0.75 (0.70–0.80)	0.77 (0.73–0.81)	0.70 (0.66–0.74)
ANN	0.79 (0.76–0.82)	0.73 (0.68–0.78)	0.75 (0.71–0.79)	0.68 (0.64–0.72)
SVM	0.76 (0.73–0.79)	0.70 (0.65–0.75)	0.73 (0.69–0.77)	0.65 (0.61–0.69)
Logistic Regression	0.74 (0.71–0.77)	0.68 (0.63–0.73)	0.71 (0.67–0.75)	0.63 (0.59–0.67)

Open in a new tab

AUC: area under the curve; ANN: artificial neural network; CI: confidence interval; SVM: support vector machine; XGBoost: extreme gradient boosting.

Key predictors identified were largely consistent with the birth outcome models, with maternal anemia, limited ANC attendance, rural residence, and low maternal education showing the largest contributions. Notably, interaction effects were observed, highlighting that neonates exposed to multiple maternal risk factors were at disproportionately higher risk of postnatal complications (Figure 7).

Figure 7. — SHAP summary plot showing the top 6 predictors of postnatal complications (XGBoost model).

Comparison with primary composite outcome

Overall, the sensitivity analyses confirmed that the predictors of individual outcomes aligned closely with those identified in the primary composite outcome model, supporting the robustness of our primary analysis. The results demonstrate that maternal and antenatal features alone can effectively identify neonates at high risk for both birth outcomes and postnatal complications, providing a biologically plausible and clinically interpretable risk stratification framework. Sensitivity analyses using class weighting instead of SMOTE yielded similar model performance, supporting the robustness of our findings.