Comparative performance of machine learning models in predicting childhood diarrhea: implications for public health surveillance

Joseph O Ashaolu; Taiwo S Akanji; Victoria I Ayansola; Agbolade J Sunday; Omoyajowo A Esther; Sylvain YM Some

doi:10.1186/s42506-026-00212-7

. 2026 Mar 10;101:6. doi: 10.1186/s42506-026-00212-7

Comparative performance of machine learning models in predicting childhood diarrhea: implications for public health surveillance

Joseph O Ashaolu ^1,², Taiwo S Akanji ³, Victoria I Ayansola ⁴, Agbolade J Sunday ⁵, Omoyajowo A Esther ⁶, Sylvain YM Some ^7,^✉

PMCID: PMC12976237 PMID: 41806076

Abstract

Background

Diarrhea remains one of the leading causes of under-5 mortality in Nigeria. Traditional logistic regression has been used to assess risk factors; however, machine learning (ML) models complement the prediction capability by untying tangled interactions. This study presents an analysis of childhood diarrhea predictors in Nigeria using both approaches to inform strategic interventions.

Methods

The 2018 NDHS (Nigeria Demographic and Health Survey) data were assessed, and 33,924 under-5 children’s data were analyzed. Predictor variables include socioeconomic, environmental, behavioral, and household factors. Adjusted odds ratios (aORs) were estimated using logistic regression, while ML models (CatBoost, LightGBM, XGBoost) were used to evaluate non-linear relationships. Performance of the models was compared based on feature importance rankings.

Results

Notably, there was a striking regional variation, with the highest prevalence of diarrhea recorded in the North-East (aOR = 2.04, p < 0.001). Region and child’s age were the most important predictors in both regression and ML models, with ML importance of 100 and 76.1, respectively. Higher socioeconomic status had a strongly protective association, as evidenced by a clear wealth gradient (Very rich: aOR = 0.57, p < 0.001). Similarly, safe disposal of children’s feces (aOR = 0.92, p = 0.031) was associated with reduced odds of diarrhea, as was higher maternal education (aOR = 0.76, p = 0.010). Among the ML models, CatBoost provided the best balance of performance (AUC = 0.706, Balanced Accuracy = 0.660, sensitivity = 71.3%), suggesting its utility in ruling out the condition, although all models, including LR (AUC = 0.704), showed a good discriminative ability.

Conclusion

This study emphasizes the complementarity of logistic regression and ML to identifying diarrhea predictors. The principal findings indicate a strategic combination of geographically focused WASH (Water, Sanitation, and Hygiene) interventions, age-stratified prevention, and poverty alleviation strategies. These findings can inform Nigeria’s interventions to reduce diarrhea in children and achieve SDG 3.2.

Supplementary Information

The online version contains supplementary material available at 10.1186/s42506-026-00212-7.

Keywords: Diarrhea, Under-5 children, Machine learning, Logistic regression, Nigeria, WASH

Text box 1. Contributions to the literature

• Illustrates the empirical use of machine learning (CatBoost) and traditional logistic regression towards optimizing childhood diarrhea public health risk model prediction performance in low-resource settings.

• Picks out actionable context-specific intervention strategies like region-specific WASH interventions and age-stratified radio messaging to foster health and to maximize the utilization of resources and effectiveness of the program.

• Suggests the use of a high-sensitivity screening tool (CatBoost-based risk score) with potential to be scaled at the primary health centers for screening and early identification of high-risk children to enable targeted prevention intervention and improved equity of child health.

Open in a new tab

Introduction

Diarrhea remains a major challenge for under-5 children, especially in low- and middle- income countries. Globally, it’s been rated as a leading cause of mortality in children of that age group, causing about 446,000 deaths each year, with most of these deaths in Sub-Saharan Africa and South Asia [1]. In Nigeria, a high mortality rate of young children has been reported (104 deaths per 1,000 births), with about 10–12% of these deaths associated with diarrhea diseases, indicating the significance of the condition [2]. Even with the global efforts to prevent diarrhea-related mortality through the use of interventions like oral rehydration therapy (ORT), vaccination (e.g., rotavirus), and enhanced water, sanitation, and hygiene (WASH) behaviors, diarrhea continues to present a public health threat, especially in resource-limited settings [3].

Knowledge of predictors of diarrhea disease in under-5 children is significant to developing intervention measures that will fit their needs. Various predictors have been reported by previous research, such as maternal education, household income, availability of safe water, sanitation facilities, and geographic differences [4]. For instance, children living in a household with unimproved sanitation or unsafe feces disposal face increased risk for developing diarrhea [5]. Similarly, mother’s education level has been found to have a protective association because educated mothers adopt preventive health practices and seek timely medical care [6]. However, the relative contribution of these factors and their interplay is still poorly documented in the Nigerian context, where cultural, geographic, and economic diversity may influence disease patterns differently.

Whereas the individual risk factors for diarrhea are well known, the relative contribution of these factors and their complex, non-linear interplay remains less explored within the Nigerian context using methodological approaches that can explicitly capture such interactions. Previous studies have mostly relied on logistic regression, although interpretable, they are not able to fully capture intricate interaction associations [7]. However, the application and comparison of different advanced ML algorithms such a CatBoost, LightGBM, and XGBoost to NDHS data for this purpose is still limited, despite the potential for yielding novel insights into targeted interventions [8, 9]. Moreover, the use of ML in epidemiology may improve the predictive power of the risk assessment models by detecting complex patterns of data and allowing for early identification of high-risk groups through such sensitive and specific predictive tools in order to enable proactive rather than reactive public health measures [10].

This study, thus, contrasts traditional, interpretable regression models against advanced, non-linear machine learning (ML) models, thereby contributing to the literature by not only comparing and contrasting several high-performance GBDT models with traditional regression, but also explicitly translating ML results into actionable, context-specific policy strategies for Nigeria. It also illustrates how integrating ML-derived insights, such as non-linear interactions and high-sensitivity screening potential, into public health planning that extends beyond pure prediction to actionable intervention design.

In addition, the research outcome is designed to significantly impact policy and programmatic implications for Nigeria’s public health policy and programming as it informs the most influential predictors of diarrhea, and the application of the research outcome to guide intervention prioritization such as improving WASH interventions, maternal education interventions, and geographically targeted health campaigns. With Nigeria working towards the attainment of the Sustainable Development Goals (SDGs) and in particular, SDG 3 (good health and well-being) and SDG 6 (clean water and sanitation), evidence-informed practice based on thorough data analysis portends a key function to abate diarrhea disease burden in children [11].

This research, therefore, leverages data from the Nigeria Demographic and Health Survey (NDHS) to analyze predictors of diarrhea disease among under-5 children by combining conventional and machine learning methods. By comparing the performance of logistic regression to high-performing ML models, this study aims to address the following research gaps: (1) relative importance of socioeconomic, environmental, and behavioral predictors to diarrhea risk in Nigeria; (2) the consistency of important predictors across modelling paradigms; and (3) The potential of ML to offer complementary insights including high sensitivity for targeted public health strategies.

Methods

Study design and data source

The research used a cross-sectional study design to examine data from the NDHS, a nationally representative survey conducted by the National Population Commission (NPC), in partnership with ICF International [2]. NDHS uses a stratified two-stage cluster sampling technique to ensure it is representative of all 36 states in Nigeria and the Federal Capital Territory. The survey collects detailed data about maternal and child health, household characteristics, and environmental conditions, and makes an ideal dataset for the analysis of determinants of diarrhea disease in under-5 children. The survey included data from 33,924 households, and the target was children under five years old, with the primary dependent variable being diarrhea status (yes/no binary outcome).

Missing data were handled using Multiple Imputation by Chained Equations (MICE) with 10 imputations. We estimated the coefficients for the multivariable logistic regression on the training set, pooling the results across the 10 MICE imputations. The performance of all models, namely logistic regression, CatBoost, LightGBM, and XGBoost, was then evaluated and compared with regard to the predictions made on the same test set. More so, a multilevel logistic regression model to account for the hierarchical structure of the NDHS data (children nested within primary sampling units or clusters) was employed, while a random intercept was included for clusters to model within-cluster correlation. By adopting this approach, we obtain more robust and valid inferences for the identified risk factors. Adjusted odds ratios with 95% confidence intervals were used to measure the magnitude of associations.

Variables and operational definitions

Independent variables were categorized into four broad domains: (a) Socioeconomic Factors (Maternal education, household wealth index-very poor, poor, average, rich, and very rich, and access to electricity) (b) Environmental and Sanitation Factors (Source of drinking water: improved/unimproved, type of toilet facility: improved/unimproved, shared toilet: Yes/No, child feces disposal method: safe/unsafe, and water source distance) (c) Media and Technology Exposure (Frequency of listening to the radio, watching television, and internet use) and (d) Child and Household Characteristics (Child’s age in months, child’s gender, residential region: North-West, North-East, South-South, etc.), and the number of under-5 children living in a household. Diarrhea status was ascertained by maternal reporting of the child passing loose or watery stools during the two weeks before the survey according to World Health Organization (WHO) criteria [1, 12].

Conceptual framework

The conceptual framework guiding this analysis posits that childhood diarrhea is influenced by factors in four broad, interconnected domains: (1) Socioeconomic Factors, (2) Environmental and Sanitation Factors, (3) Media and Technology Exposure, and (4) Child and Household Characteristics. These factors have the potential to directly affect diarrhea risk and also interact with one another. The ML models are particularly suited to test the complex pathways and interactions implied by this framework.

Statistical and machine learning approaches

Data preprocessing and model training

Data cleaning, recoding, descriptive statistics, and logistic regression analysis were performed using IBM SPSS Statistics, Version 28. All machine learning modeling, including data splitting, hyperparameter tuning, model training, evaluation, and feature importance calculation, was conducted in R (Version 4.2.2), using CatBoost (1.2), LightGBM (3.3.5), xgboost (1.7.5.1), MICE (3.15.0) for multiple imputation, and caret (6.0–94) for general modeling infrastructure. The order of the analytical process was as follows: (1) Data cleaning and variable recoding in SPSS; (2) Export of the final dataset to R; (3) The data were initially split into training and test sets 70/30, respectively, through stratification on the outcome variable to preserve the class distribution. (4) MICE was then performed only on the training set to impute missing data. The imputation model, that is, its parameters, was used in a transformation step for the held-out test set to prevent leakage of information from the test set into the training process. (5) Pooling of results from imputed datasets for logistic regression; (6) Model training and hyperparameter optimization via grid search with 5-fold cross-validation on the training set; (7) Final model evaluation on the held-out test set.

Categorical variables, like region, education level, and wealth index, were one-hot encoded for the logistic regression analysis and across all the tree-based models for fairness in comparison. All numerical features, such as a child’s age in months, were standardized (z-score normalization) before being fed to the ML algorithms to make sure that all features were on comparable scales. This synergy of employing classical logistic regression and the current best ML methods concurrently ensured the stability of finding key diarrhea risk factors, along with maximizing predictive accuracy for public health outcomes. Detailed variable definitions and coding schemes are available in Supplementary Table S1.

Logistic regression analysis

A multivariable logistic regression model was used to investigate the relationship between predictor variables and diarrhea status. Adjusted odds ratios (aORs) with 95% confidence intervals (CIs) were calculated to measure the magnitude of associations. The model was adjusted for preselected confounders from the literature, such as region, type of residence (urban/rural), and household size, while variables with p-values < 0.05 were statistically significant. We also fitted a multilevel logistic regression with a random intercept for region to account for the hierarchical structure of the survey. A model comparison using null, individual/household, environmental, and full models was performed using likelihood ratio tests, AIC, and BIC. To check for multicollinearity among the predictors, the Variance Inflation Factor (VIF) was calculated. The mean VIF was below 2.0, and all individual VIFs were below 5, showing no significant multicollinearity that would bias the coefficient estimates.

Machine learning modeling

With the prevalence of diarrhea being ~ 12%, we explored a range of strategies for dealing with class imbalance: class weighting, optimal threshold selection, and Synthetic Minority Over-sampling Technique (SMOTE). These are summarized in Supplementary Table S2. The highest single-point sensitivity (74.0%) was obtained through optimal threshold selection on the baseline model, but the SMOTE-based approach provided a more robust and operationally relevant balance of sensitivity and specificity over a range of thresholds. Formal comparison using DeLong’s test showed no statistically significant difference in overall AUC between the SMOTE and class weighting strategies, at p = 0.124. Visual inspection of the ROC curves suggested that the SMOTE model consistently had better performance in the high-sensitivity region ≥ 0.70, critical for our goal of a screening tool. We therefore selected SMOTE as the primary strategy for all subsequent ML model training.

In order to improve predictability and assess non-linear relationships, three advanced GBDT models were applied to improve classification predictability, assess non-linear relationships, and compared with a conventional multivariable logistic regression model, which was our baseline. The GBDT models were chosen due to their better ability in capturing complicated, non-linear relationships and interaction associations from tabular data, which may not be captured by logistic regression. The models compared were: (a). CatBoost: a gradient boosting algorithm that manages categorical variables effectively and avoids overfitting through the use of ordered boosting [9]. (b). LightGBM: a high-performance, high-speed gradient boosting framework optimized for performance with speed, and founded on histogram-based learning [13]. (c). XGBoost: an ensemble method with scalability and regularization properties to avoid overfitting [8]. All models were trained on 70% of the data, reserving 30% for final testing. For hyperparameter optimization and internal validity, a 5-fold cross-validation was performed on the training set in a grid search. Then, the model with the best performance in cross-validation on the training set was selected to make the final predictions on the test set.

Model evaluation and feature importance

Model performance was assessed through the use of AUC to ensure discriminative ability. Accuracy, sensitivity, and specificity were estimated to measure classification performance and ROC Curves for comparing model performance through visualization. Feature importance was assessed from all ML models to determine the top predictors. While CatBoost employed feature change in prediction values to rank features, LightGBM employed gain-based importance, which is presented as a contribution to model accuracy, and XGBoost employed weight, coverage, and frequency of feature splits to quantify importance. Predictions on the same held-out test set were made using the final tuned versions of all models, including the logistic regression model. The performance metrics reported are from the test set.

Ethical considerations

The NDHS protocol was reviewed by the National Health Research Ethics Committee of Nigeria (NHREC) and the Institutional Review Board of the ICF. Since secondary, de-identified data were used in this study, further ethical clearance was not necessary. All analyses also adhered to the principles of confidentiality such that no individual or household would be identifiable.

Results

Descriptive Characteristics and Logistic Regression Analysis

The main demographic and socioeconomic characteristics reported demonstrated wide variation in diarrhea prevalence, with a diarrhea prevalence of 11.99% (4,066 cases), with 9.47% missing outcome (Table 1).

Table 1.

Descriptive statistics of diarrhea dataset

Variable	Count	Percentage (%)
MothersAge
15–19	1434	4.23%
20–24	6626	19.53%
25–29	9470	27.92%
30–34	7647	22.54%
≥ 35	8747	25.78%
Mother’s Education Level
Not Educated	15,391	45.37%
Primary (Incomplete)	1619	4.77%
Primary (Complete)	3655	10.77%
Secondary (Incomplete)	3927	11.58%
Secondary (Complete)	6696	19.74%
Higher Education	2636	7.77%
Region
North-Central	5875	17.32%
North-East	7211	21.26%
North-West	10,305	30.38%
South-East	3798	11.2%
South-South	3202	9.44%
South-West	3533	10.41%
Residence
Rural	22,225	65.51%
Urban	11,699	34.49%
Wealth Index
Very Poor	8066	23.78%
Poor	7743	22.82%
Average	7171	21.14%
Rich	6166	18.18%
Very rich	4778	14.08%
Child’s Age
≤ 5	3409	10.05%
6-11	3350	9.88%
12–23	6543	19.29%
24–35	6540	19.28%
36–47	7011	20.67%
48–59	7071	20.84%
Child’s Sex
Female	16,667	49.13%
Male	17,257	50.87%
Access to Electricity
No	17,235	50.8%
Yes	16,689	49.2%
Ever used the Internet
No	30,766	90.69%
Yes	3158	9.31%
Internet use Last month
Not in the past month	31,307	92.29%
< At least once a week	938	2.77%
< Once a week	450	1.33%
Almost everyday	1229	3.62%
Watches TV
No	19,733	58.17%
Once a week	5813	17.14%
> Once a week	8378	24.7%
Listens to the radio
No	16,552	48.79%
Once a week	8072	23.79%
> Once a week	9300	27.41%
Total Household Members
≤ 6 Persons	15,159	44.69%
≥ 7 Persons	18,765	55.31%
Total Under5 in Households
≤ 2	23,437	69.09%
≥ 3	10,487	30.91%
Drinking Water Source
Unimproved	14,418	42.5%
Improved	19,506	57.5%
Time to water
On Premises	10,206	30.08%
< 30 min	18,809	55.44%
> 30 min	4909	14.47%
Type of Toilet facility
Unimproved	17,710	52.2%
Improved	16,214	47.8%
Shared Toilet Facility
No	16,619	48.99%
Yes	8346	24.6%
Missing	8959	26.41%
Disposal of faeces
Safe	11,743	34.62%
Unsafe	9587	28.26%
Missing	12,594	37.12%
Diarrhea Status
No	26,647	78.55%
Yes	4066	11.99%
Missing	3211	9.47%

Open in a new tab

Multivariable logistic regression revealed various significant predictors of diarrhea (Table 3). The highest prevalence occurred among children living in the North-East (aOR = 2.04, 95% CI: 1.83–2.27, p < 0.001), whereas the lowest prevalence occurred among children living in the South-South (aOR = 0.38, 95% CI: 0.32–0.46, p < 0.001). Maternal education demonstrated a protective association, with higher educated mothers having 24% associated reduced odds of children with diarrhea (aOR = 0.76, 95% CI: 0.62–0.93, p = 0.010). Child’s age had a strong associated protective gradient, with older children aged 48–59 months having 19.0% reduced odds (aOR = 0.81, 95% CI: 0.69–0.96, p < 0.001) than infants (< 5 months). Media exposure demonstrated varied impacts as internet exposure had about 50% greater chances (aOR = 1.48, 95% CI: 1.09–1.99, p = 0.010), while the odds of diarrhea increased by 32% when listening to the radio more than once a week (aOR = 1.32, 95% CI: 1.21–1.45, p < 0.001) most likely because of socioeconomic confounding.

Table 3.

Comparative performance of machine learning models for childhood diarrhea prediction

Model	AUC	Accuracy	Sensitivity	Specificity	F1-Score	F2-Score	Balanced Accuracy	Precision	Kappa
CatBoost	0.706	0.620	0.713	0.606	0.327	0.484	0.660	0.212	0.159
LightGBM	0.705	0.642	0.672	0.638	0.328	0.473	0.655	0.217	0.164
Logistic Regression	0.704	0.629	0.691	0.619	0.325	0.476	0.655	0.213	0.159
XGBoost	0.703	0.637	0.680	0.631	0.327	0.474	0.655	0.215	0.162

Open in a new tab

All models were trained with 5-fold cross-validation and SMOTE for class imbalance handling

Table 2.

Logistic regression Analysis - Unadjusted and adjusted odds ratios

Variable	Unadjusted OR	95% CI (Unadjusted)	P-value (Unadjusted)	Adjusted OR	95% CI (Adjusted)	P-value (Adjusted)
Mother’s Age:
15–19	ref	ref		ref	ref
20–24	0.81	(0.70–0.94)	0.006	0.98	(0.84–1.15)	0.793
25–29	0.67	(0.58–0.77)	< 0.001	0.92	(0.79–1.08)	0.308
30–34	0.55	(0.47–0.64)	< 0.001	0.80	(0.68–0.94)	0.008
≥ 35	0.57	(0.50–0.67)	< 0.001	0.81	(0.69–0.96)	0.013
Mother’s Education Level
Not Educated	ref	ref		ref	ref
Primary (Incomplete)	0.93	(0.81–1.07)	0.339	1.09	(0.94–1.26)	0.268
Primary (Complete)	0.73	(0.66–0.82)	< 0.001	1.05	(0.93–1.19)	0.397
Secondary (Incomplete)	0.70	(0.63–0.78)	< 0.001	1.05	(0.92–1.18)	0.475
Secondary (Complete)	0.50	(0.45–0.54)	< 0.001	0.84	(0.74–0.95)	0.006
Higher Education	0.40	(0.34–0.47)	< 0.001	0.76	(0.62–0.93)	0.010
Region
North-Central	ref	ref		ref	ref
North-East	2.14	(1.95–2.35)	< 0.001	2.04	(1.83–2.27)	< 0.001
North-West	0.93	(0.84–1.02)	0.125	0.86	(0.77–0.97)	0.010
South-East	0.45	(0.39–0.52)	< 0.001	0.44	(0.37–0.52)	< 0.001
South-South	0.38	(0.32–0.45)	< 0.001	0.39	(0.33–0.47)	< 0.001
South-West	0.41	(0.35–0.48)	< 0.001	0.44	(0.37–0.52)	< 0.001
Residence
Urban	0.70	(0.65–0.75)	< 0.001	1.13	(1.04–1.24)	0.006
Wealth Index
Very Poor	ref	ref		ref	ref
Poor	0.81	(0.74–0.88)	< 0.001	0.89	(0.81–0.98)	0.013
Average	0.65	(0.59–0.71)	< 0.001	0.78	(0.69–0.88)	< 0.001
Rich	0.51	(0.46–0.57)	< 0.001	0.68	(0.58–0.80)	< 0.001
Very rich	0.34	(0.30–0.38)	< 0.001	0.57	(0.46–0.70)	< 0.001
Child’s Age
≤ 5	ref	ref		ref	ref
6–11	2.18	(1.89–2.51)	< 0.001	2.31	(2.00-2.67)	< 0.001
12–23	2.19	(1.93–2.49)	< 0.001	2.31	(2.03–2.63)	< 0.001
24–35	1.39	(1.22–1.58)	< 0.001	1.43	(1.25–1.64)	< 0.001
36–47	0.96	(0.84–1.10)	0.590	0.99	(0.86–1.14)	0.874
48–59	0.70	(0.61–0.81)	< 0.001	0.70	(0.60–0.81)	< 0.001
Child’s Sex
Female	ref	ref		ref	ref
Male	1.01	(0.95–1.07)	0.827	1.00	(0.94–1.07)	0.932
Electricity Access
No	ref	ref		ref	ref
Yes	0.68	(0.63–0.72)	< 0.001	1.04	(0.95–1.14)	0.409
Internet Use Ever
No	ref	ref		ref	ref
Yes	0.48	(0.42–0.55)	< 0.001	1.48	(1.09–1.99)	0.010
Internet Use Last Month
Not in the past month	ref	ref		ref	ref
< At least once a week	0.44	(0.30–0.63)	< 0.001	0.55	(0.34–0.88)	0.014
< Once a week	0.44	(0.33–0.56)	< 0.001	0.59	(0.40–0.87)	0.008
Almost everyday	0.43	(0.34–0.54)	< 0.001	0.58	(0.40–0.85)	0.005
Watches TV
No	ref	ref		ref	ref
Once a week	0.71	(0.64–0.77)	< 0.001	1.10	(0.98–1.22)	0.235
> Once a week	0.59	(0.55–0.64)	< 0.001	1.08	(0.95–1.21)	0.101
Listens to Radio
No	ref	ref		ref	ref
Once a week	0.75	(0.69–0.81)	< 0.001	1.11	(1.01–1.21)	0.034
> Once a week	0.83	(0.77–0.89)	< 0.001	1.32	(1.21–1.45)	< 0.001
Total Household Members
≤ 6 Persons	ref	ref		ref	ref
≥ 7 Persons	0.78	(0.74–0.84)	< 0.001	0.88	(0.81–0.95)	0.002
Total Under5 in Households
≤ 2	ref	ref		ref	ref
≥ 3	1.13	(1.06–1.21)	< 0.001	0.91	(0.84–0.99)	0.025
Drinking Water Source
Unimproved	ref	ref		ref	ref
Improved	0.80	(0.75–0.86)	< 0.001	1.00	(0.94–1.08)	0.891
Time to water				ref	ref
On Premises	ref	ref
< 30 min	1.17	(1.09–1.26)	< 0.001	0.98	(0.91–1.07)	0.694
> 30 min	1.19	(1.08–1.32)	< 0.001	1.02	(0.91–1.14)	0.736
Type of Toilet facility
Unimproved	ref	ref		ref	ref
Improved	0.86	(0.81–0.92)	< 0.001	1.03	(0.95–1.12)	0.473
Shared Toilet Facility
No	ref	ref		ref	ref
Yes	0.73	(0.68–0.79)	< 0.001	0.96	(0.89–1.03)	0.269
Disposal of faeces
unsafe	ref	ref		ref	ref
Safe	1.19	(1.12–1.27)	< 0.001	0.92	(0.85–0.99)	0.031

Open in a new tab

The comparison of multilevel models showed that adding individual/household factors provided the most significant improvement in model fit (χ² = 820.8, p < 0.001), whereas the addition of WASH variables did not provide a significant improvement (χ² = 5.8, p = 0.443) (Supplementary Table S3). Random effects showed significant regional variations; the North-East region had the highest intercept, while the South-South had the lowest.

Machine Learning Model Performance

All models showed good performance, with AUC scores between 0.703 and 0.706 (Table 3, Fig. 1). The best AUC (0.706) and the Balanced Accuracy (0.660) were achieved by CatBoost, closely followed by LightGBM. Logistic regression performed similarly to ML models, with an AUC of 0.704. Importantly, we note that the dataset suffered from serious class imbalance, with only 12% of children having diarrhea (prevalence was ~ 12%), leading to potentially misleading interpretation of overall metrics. For example, the CatBoost’s accuracy was 62.0%, with a sensitivity of 71.3% and a specificity of 60.6%. We therefore focus on the AUC and a full suite of classification metrics as depicted in Table 3. The modest accuracy values of 62–64% (Fig. 2), are indicative of the class imbalance issue since a naive classifier that would predict ‘No Diarrhea’ in all cases would result in 78.55% accuracy and zero sensitivity.

Fig. 2 — **Comparative performance of four machine learning models across five classification metrics.** This figure provides a comparative analysis of key performance metrics, Balanced Accuracy, F1-Score, and Sensitivity, for multiple machine learning models (CatBoost, LightGBM, Logistic Regression, and XGBoost) developed to predict childhood diarrhea, illustrating that while CatBoost achieved the highest balanced accuracy (0.660) and sensitivity (0.713), it was matched in F1-score (0.655) by both LightGBM and XGBoost, with all models demonstrating a notably lower and more consistent sensitivity compared to their accuracy and F1-score, highlighting a potential trade-off in the models’ ability to correctly identify positive cases versus their overall balanced performance and precision-recall characteristics

Feature Importance Across Models

The feature importance for all ML models was calculated in order to determine the top predictors. Relative importance scores for the top predictors from each model are shown in Table 4.

Table 4.

Combined analysis: feature importance for top 15 predicting diarrhea status

Feature	Tree-based Models				Logistic Regression
Feature	XGBoost (%)	CatBoost (%)	LightGBM (%)	Average ML (%)	McFadden R² (%)	Share of Total R² (%)
Region	100.0	100.0	100.0	100.0	100.0	33.6
Child’s Age	70.5	92.1	65.5	76.1	58.4	19.6
Wealth Index	29.6	21.5	22.1	24.2	32.9	11.0
Mother’s Level of Education	28.3	19.7	23.3	23.6	28.4	9.5
Mother’s Age	26.8	14.7	18.2	19.6	8.0	2.7
Listens to Radio	21.6	21.4	16.1	19.5	4.7	1.6
Watches TV	17.3	16.2	12.4	15.2	14.6	4.9
Drinking Water Source	11.9	8.1	8.8	9.5	3.5	1.2
Residence	13.2	6.7	8.0	9.1	8.6	2.9
Access to Electricity	10.0	7.7	7.9	8.5	12.0	4.0
Total under-5 in households	8.3	6.2	5.9	6.8	1.3	0.4
ChildSex	10.6	1.7	6.9	6.3	0.0	0.0
Total Household members	9.0	4.4	5.2	6.1	4.9	1.6
Internet use Last month	6.9	6.1	5.2	6.0	10.7	3.6
Ever used the Internet	4.8	1.1	2.2	2.6	10.1	3.4

Open in a new tab

The ‘Share of Total R²’ was calculated as the proportional decrease in McFadden’s pseudo-R² when each predictor was removed from the full logistic regression model, providing an estimate of its individual contribution to the model’s explanatory power. The table integrates the statistical explanation power (McFadden R²) with normalized importance scores from three individual machine learning algorithms and their average

For comparative purposes, the scores from XGBoost (Gain), CatBoost (Prediction Value Change), and LightGBM (Gain) were normalized to a 0-100% scale. The models had a strong consensus on the most influential predictors, and the feature importance rankings were consistent across methods. Region was the strongest predictor, with 100% normalized ML importance and a 33.6% share of the total R², representing its proportional contribution to the model’s explanatory power in the logistic regression framework, followed by Child’s Age at 76.1% ML importance, and Wealth Index at 24.2% ML importance. The stability of these results across cross-validation folds and the distribution of prediction probabilities are visualized in Figs. 3 and 4, respectively.

Fig. 3 — **Comparison of ROC AUC scores across four machine learning models.** This figure presents a comparative analysis of machine learning model performance by displaying the distribution of Receiver Operating Characteristic Area Under the Curve (ROC AUC) scores obtained from cross-validation, where each individual point on the plot represents the AUC value from a single cross-validation fold, allowing for an assessment of model stability and variance; the models evaluated include LightGBM and Logistic Regression, with their respective mean AUC scores reported as approximately 0.706, 0.705, 0.704, and 0.703, providing a quantitative comparison of their predictive accuracy and discriminatory power for the classification task at hand

Fig. 4 — **Prediction probability distributions for four machine learning models**, **stratified by actual clinical outcome.** Density plots for CatBoost, Logistic Regression, XGBoost, and LightGBM classifiers display the model-generated probabilities for a diarrheal event, with blue distributions representing cases where “No Diarrhea” was the true outcome and red distributions representing confirmed “Diarrhea” cases. Ideal model performance is characterized by a clear separation between the two distributions, where the blue density (true negative) is concentrated near a prediction probability of 0.0 and the red density (true positive) is concentrated near 1.0. The degree of overlap between the red and blue curves for each model visually indicates its classification confidence and calibration, with greater separation suggesting a superior ability to distinguish between the two clinical states and less overlap indicating more confident and accurate predictions

In Summary, regional inequity dominates diarrhea risk factors, with the North-East being most vulnerable. Child’s age and hygiene are universally significant between analytic models, while CatBoost demonstrated the best balance of accuracy and sensitivity among ML models. Media use has multidimensional roles, suggesting health messaging can be conveyed through deliberate and targeted radio shows, while internet use needs more socioeconomic stratification (Table 4).

Sensitivity Analysis

Sensitivity analysis, comparing MICE with CCA, confirmed that several core predictors were robust: region, child’s age, and wealth index (Table 5).

Table 5.

Sensitivity analysis of factors associated with diarrhea: comparison of multiple imputation (MICE) and complete case analysis (CCA) models

Variable	Category	Unadjusted OR (95% CI)	Adjusted OR (95% CI)	Change in Significance	Direction Consistent?	Magnitude Change
Region	North-East / Region 2	2.14 (1.95–2.35) / 3.07 (2.58–3.68)	2.04 (1.83–2.27) / 3.04 (2.49–3.73)	No change (p < 0.001)	Yes	Minimal
	North-West / Region 3	0.93 (0.84–1.02) / 1.56 (1.31–1.87)	0.86 (0.77–0.97) / 1.50 (1.22–1.84)	Became significant (Adj)	No (direction flipped)	Moderate
	South-East / Region 4	0.45 (0.39–0.52) / 0.60 (0.46–0.77)	0.44 (0.37–0.52) / 0.55 (0.42–0.71)	No change (p < 0.001)	Yes	Minimal
Mother’s Age	Older / Category 4	0.55 (0.47–0.64) / 0.56 (0.46–0.69)	0.80 (0.68–0.94) / 0.81 (0.65–1.01)	Became insignificant (Adj)	Yes	Moderate
Education	Higher Education	0.40 (0.34–0.47) / 0.40 (0.33–0.49)	0.76 (0.62–0.93) / 0.81 (0.61–1.07)	Weakened significance	Yes	Large
Wealth Index	Very Rich / Category 5	0.34 (0.30–0.38) / 0.32 (0.27–0.38)	0.57 (0.46–0.70) / 0.50 (0.38–0.67)	No change (p < 0.001)	Yes	Moderate
Child’s Age	Young / Category 3	2.19 (1.93–2.49) / 2.55 (2.19–2.98)	2.31 (2.03–2.63) / 2.86 (2.44–3.36)	No change (p < 0.001)	Yes	Minimal
Internet Use Ever	Yes	0.48 (0.42–0.55) / 0.49 (0.41–0.59)	1.48 (1.09–1.99) / 1.59 (1.06–2.30)	Reversed direction	No	Large
TV Watching	> Weekly	0.59 (0.55–0.64) / 0.65 (0.58–0.72)	1.08 (0.95–1.21) / 1.38 (1.16–1.63)	Reversed direction	No	Large
Radio Listening	> Weekly	0.83 (0.77–0.89) / 0.92 (0.82–1.02)	1.32 (1.21–1.45) / 1.36 (1.19–1.55)	Reversed direction	No	Large
Residence	Urban	0.70 (0.65–0.75) / 0.70 (0.64–0.77)	1.13 (1.04–1.24) / 1.08 (0.95–1.23)	Reversed direction	No	Large
Improved Water Source	Yes	0.80 (0.75–0.86) / 0.80 (0.73–0.87)	1.00 (0.94–1.08) / 1.00 (0.90–1.11)	Became insignificant	No	Moderate
Safe Waste Disposal	Yes	1.19 (1.12–1.27) / 1.32 (1.20–1.47)	0.92 (0.85–0.99) / 0.84 (0.75–0.95)	Reversed direction	No	Large

Open in a new tab

Bold text indicates statistical significance (p < 0.05). Reference categories are as defined in the original tables

aOR adjusted Odds Ratio, CI Confidence Interval, MICE Multiple Imputation by Chained Equations, CCA Complete Case Analysis

*The protective effect for specific frequencies of internet use (e.g., < Once a week, almost every day) was consistent and significant in both models, with aORs approximately ranging from 0.45 to 0.60, creating a paradox with the “Ever Used” variable

However, it also conveyed important nuance. Perhaps most notable is the switched relationship of urban residence with diarrhea risk after control for confounders. Whereas unadjusted analyses suggested a protective association, the models adjusted for confounding consistently identified urban residence as a significant risk factor (aOR = 1.13, p = 0.006). Media/technology exposure variables (‘Ever used the Internet’, ‘Listens to Radio’) demonstrated high sensitivity and switched directions, suggesting that their relationships are likely to be confounded and should be interpreted carefully.

Discussion

This study introduces a comprehensive characterization of Nigeria’s childhood diarrhea determinants by combining conventional logistic regression with sophisticated machine learning (ML) methods to uncover risk factors and improve prediction modeling. Our results indicate appreciable regional variation, age and sanitation-related hazard, and the unexpected influence of exposure to media, each with substantial implications for policy-making in public health and intervention planning.

Key Findings in Context

The associated strong regional disparities observed in our analysis, with the North-East being a high-risk region, as portrayed by a significantly increased odds ratio-corroborate existing literature on Nigeria’s North-South health divide [14]. The increased diarrhea risk in North-East children may have been contributed to by the cumulative factors, including increased insecurity limiting access to health, lower WASH facilities, and lower education levels of mothers [2]. On the other hand, preventive associations observed in southern states likely reflect improved urban sanitation coverage and health literacy. These results indicate that geographically targeted interventions are needed, with prioritized inclusion for WASH programs in northern states and the enforcement of preventive care in densely populated urban areas.

Child’s age was the second most robust predictor in all models. The strongly reduced risk associated with older child’s age, as evidenced by a strong protective gradient in both regression and ML models, parallels post-infancy dramatic risk reduction indicated by immunological and behavioral trends in sub-Saharan Africa [1], with older children benefiting from completed vaccine series (e.g., rotavirus), enhanced hygiene practices, and decreased weaning food contamination consumption. This suggests that interventions aimed at infant caregivers (< 12 months), for example, nutrition promotion and provision of safe water, can lead to disproportionate gains. Moreover, the protective association of maternal education, in which the odds were significantly reduced, as observed in this study, was stronger than that of wealth in the regression analysis, underscoring the need for investment in female education.

Sanitation and Socioeconomic Determinants

Notably, the multilevel model comparison indicated that WASH variables did not contribute significantly to an improved fit over and above the core socioeconomic and demographic factors. This would suggest that, in the contemporary context, socioeconomic status and region are more fundamental determinants, while WASH is a mediating or correlated factor. Both Logistic regression and ML models converged on sanitation as a modifiable risk factor, albeit with subtle differences. Safe faeces disposal showed a protective association (8% reduction in odds) under regression, with CatBoost listing it as one of the top-10 predictors (importance = 3.73). This is consistent with the global evidence associating diarrheal pathogens’ presence with poor sanitation [5]. Our ML analysis also identifies complex interactive associations difficult to pre-specify in a regression framework. The models identified, for instance, a synergistic interaction between long time spent in fetching water (> 30 min) and poor sanitation, whereas the joint effect was much greater than the sum of their individual effects (Supplementary Figure S1). Although logistic regression relies on explicit specification of such interactions, novel machine learning models such as CatBoost are able to learn these complex, non-linear relationships from the data directly. The capacity to discover such nuanced risk profiles, for example, highlighting that many households are particularly burdened by both poor access to water and inadequate sanitation, allows for better targeting of the new WASH policy in Nigeria [15], thus advocating for a whole-household-level sanitation upgrade over standalone water-source interventions.

Similarly, a strong socioeconomic gradient was observed, where children from the richest households had 43% lower odds of diarrhea compared to the poorest, supporting Nigeria’s Demographic and Health Survey tendencies [2]. ML feature importance, such as LightGBM, established poverty effect as non-linear and regionally mediated. This challenges the assumption that uniform cash transfers would be sufficient; thus, poverty reduction programs may need to be supplemented instead by context-based health education, particularly since the protective association of maternal schooling (higher education) exceeded wealth in the regression model.

Media Exposure: Paradoxes and Opportunities

The apparently paradoxical finding that ‘Ever used the Internet’ was associated with increased diarrhea risk, whereas specific frequencies of use were protective, reflects a complex relationship that is likely to be driven by socioeconomic confounding. As demonstrated in the sensitivity analysis (Table 5), this finding was consistent across both MICE and CCA models, indicating that it is not an artifact of imputation. Although wealth and education were controlled for in our models, Internet access in Nigeria remains stratified by wealth, and this variable may act as a proxy for unmeasured factors related to urban, wealthier lifestyles that paradoxically increase exposure risks, such as different childcare practices, dietary changes, under-reporting of minor diarrhea episodes, or reduced health-seeking behavior for diarrhea [16, 17]. Therefore, this association should not be considered causal but a statistical artifact reflecting complex interplay between modern technology, wealth, and health outcomes in a rapidly changing society.

Similarly, the association of frequent radio listening with increased diarrhea risk underscores the complex relationship existing between media exposure and health outcomes. This may be a proxy and not necessarily a causal relationship for other unmeasured regional, cultural, or behavioral characteristics among those who frequently tune into radio. This also suggests that as a channel of health messaging, radio should be better targeted to specific groups of listeners rather than a blanket strategy, and alternatively, it might be that digital media shifted away from preventive care time. However, the protective association of specific usage frequencies calls for qualitative exploration regarding the type of content consumed. Future research is suggested to explore qualitative dimensions like content type (such as entertainment apps vs. health information) and caregiver device-sharing behavior.

Methodological Contributions: ML and the Classical Model

Our comparison yields a nuanced view of the value of ML versus traditional models. The high and nearly identical AUC values (all > 0.70) demonstrate that both logistic regression and the ML models provide robust discriminative ability and remain an invaluable, interpretable tool. Though the absolute AUC differences were small, the consistent superiority of CatBoost across several metrics-AUC, balanced accuracy, and sensitivity-all suggest its practical utility for screening applications where identifying true positives is prioritized. The consistency of leading predictors (region, child’s age, sanitation) between models is in line with competitive inference, while inconsistencies (for example, changing importance of type of toilet facility) suggest context-dependent variable encoding. In this study, the principal value of ML was not in out-predicting a well-specified regression model but in its capacity to: (1) autonomously validate the dominance of key predictors, such as region and age; (2) identify complex non-linear interactions, for example, water-fetching time and sanitation, that are hard to pre-specify; and (3) deliver a model (CatBoost) with a high-sensitivity profile directly applicable for public health screening [9], indicating that the two approaches are best used in tandem. This suggests that for the overall task of ranking children by their risk of diarrhea, a well-specified traditional model performs almost as well as more complex algorithms. This finding supports the continued utility of logistic regression for prediction in this domain. Thus, the main strengths of ML in this investigation were not a substantial improvement of discrimination over all ranges but rather its offering as a high-sensitivity tool for targeted screening and its data-driven capability of establishing critical non-linear interaction effects.

Policy Implications

Four policy implications could be derived and suggested from this study. (a) Focused WASH investments in the North-East/North-West with pit latrines and point-of-use water treatment subsidies are recommended, given that time to water source (> 30 min) substantially increases risk (b) Interventions promoting scale-up of maternal education across age groups must carefully consider channel selection. Radio is a pervasive medium, but our results suggest its use is associated with an increased risk for diarrhea, likely reflecting confounding factors or audience segmentation. Hence, the content and context of radio-based health messaging must be carefully designed and evaluated, or other channels should be considered. (c) Poverty-reduction synergies by integrating health education into Nigeria’s social protection programs, using the National Social Register, for instance, to address the non-linear wealth-risk gradients uncovered by ML models and (d) Surveillance maximization using CatBoost-based risk score instruments at primary health care centers. This tool, selected for its high sensitivity (71.3%), which was further optimized with SMOTE can help identify high-risk children for preventive treatment.

While the CatBoost-based risk score is promising for screening at primary health centers, its operational feasibility needs consideration. The major challenges are the requirement for digital infrastructure, training of staff, and integration with the Health Management Information System (HMIS). A cost-effectiveness analysis that compares this targeted approach with universal interventions is essential before wide-scale deployment. Such a tool, at the initial stage, would be tested in high-risk zones such as North-East and North-West to better understand its real-life applications and operational requirements. Future studies could seek to develop a CatBoost-based risk score for piloting in high-risk regions, exploiting its high sensitivity to efficiently allocate resources, though this would require considerable efforts towards validation and integration.

Limitations and future directions

This work improves diarrhea risk modeling but has some limitations. First, the cross-sectional nature of the NDHS data prohibits causal inference; the associations identified are correlational. Second, despite adjusting for many covariates, residual confounding from unmeasured variables, including rotavirus vaccination status, household food security, and more detailed cultural or behavioral factors, may remain. Third, the fact that WASH variables did not add significant predictive power in our model comparison should not be interpreted as evidence against their causal importance but may reflect measurement limitations or the dominant influence of socio-economic confounders. Last, while we undertook multiple imputation and conducted a sensitivity analysis, Missing Not At Random (MNAR) data could still introduce bias. Although we empirically validated our choice of SMOTE over class weighting, other resampling strategies or ensemble methods could result in marginally better performance in other contexts. Future studies should prospectively validate these models, integrate environmental and clinical data, including vaccination records, and explore explainable AI techniques to better understand the complex mechanisms identified by the ML models.

5. Conclusion

Our hybrid approach serves as a template for how public health researchers can harness the complementary strengths of interpretable regression and powerful ML algorithms. The specific policy levers identified, such as combining WASH investments with poverty, poverty alleviation programs, and deploying a high-sensitivity screening tool, offer a concrete path for Nigeria to optimize its resource allocation for diarrhea control, contributing to the precision public health agenda for achieving the SDG goal 3.2. Table 2: Logistic Regression Analysis - Unadjusted and Adjusted Odds Ratios.

Supplementary Information

Supplementary Material 1.^{(1.3MB, docx)}

Acknowledgements

The first author is grateful to Duy Tan University for providing a conducive environment for this study.

Abbreviations

aOR: Adjusted Odds Ratio
AU: Area Under the Curve
CCA: Complete Case Analysis
CI: Confidence Interval
GBDT: Gradient Boosting Decision Tree
HMIS: Health Management Information System
LR: Logistic Regression
MICE: Multiple Imputation by Chained Equations
ML: Machine Learning
MNAR: Missing Not at Random
NDHS: Nigeria Demographic and Health Survey
NPC: National Population Commission (Nigeria)
NHREC: National Health Research Ethics Committee (Nigeria)
ORT: Oral Rehydration Therapy
ROC: Receiver Operating Characteristic
SDG: Sustainable Development Goal
SMOTE: Synthetic Minority Over-sampling Technique
T2DM: Type 2 Diabetes Mellitus
U5MR: Under-5 Mortality Rate
VIF: Variance Inflation Factor
WASH: Water, Sanitation, and Hygiene
WHO: World Health Organization
XGBoost: Extreme Gradient Boosting

Authors’ contributions

Conceptualization: JOA, AJS, and SYMS; Data collation and analysis: JOA, TSA, OAE and SYMS; Writing- JOA, VIA, and TSA. Visualization: AJS, OAE, VIA and JOA; Review, editing, and final draft: JOA. All authors read and approved the final manuscript.

Funding

Funding for this study was solely provided by the authors.

Data availability

Data extracted and used in this study are available on request.

Declarations

Ethics approval and consent to participate

The study employed the use of data extraction (secondary analysis) of an already collected dataset, as the survey personnel obtained ethical approval from the National Ethic Committee of the Federal Ministry of Health, Abuja, Nigeria, and ICF International, Rockville, MD, USA. Informed consent was obtained from study participants prior to participation in the survey. Permission to use and analyze the data set was obtained by registering the study on the Demographic and Health Survey (DHS) website.

Consent for publication

NA.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.World Health Organization. Diarrhoeal disease, Geneva WHO. 2024 Mar 4. https://www.who.int/news-room/fact-sheets/detail/diarrhoeal-disease. Accessed 10 Oct 2024.
2.National Population Commission (NPC). [Nigeria] and ICF. Nigeria demographic and health survey 2018. Nigeria, and Rockville, Maryland, USA: NPC and ICF;: Abuja; 2019. [Google Scholar]
3.Troeger C, Blacker BF, Khalil IA, Rao PC, Cao S, Zimsen SR, et al. Estimates of the global, regional, and National morbidity, mortality, and aetiologies of diarrhoea in 195 countries: a systematic analysis for the global burden of disease study 2016. Lancet Infect Dis. 2018;18(11):1211–28. 10.1016/S1473-3099(18)30362-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fink G, Günther I, Hill K. The effect of water and sanitation on child health: evidence from the demographic and health surveys 1986–2007. Int J Epidemiol. 2011;40(5):1196–204. 10.1093/ije/dyr102. [DOI] [PubMed] [Google Scholar]
5.Wolf J, Hunter PR, Freeman MC, Cumming O, Clasen T, Bartram J, et al. Impact of drinking water, sanitation and handwashing with soap on childhood diarrhoeal disease: updated meta-analysis and meta-regression. Trop Med Int Health. 2018;23(5):508–25. 10.1111/tmi.13051. [DOI] [PubMed] [Google Scholar]
6.Gakidou E, Cowling K, Lozano R, Murray CJL. Increased educational attainment and its effect on child mortality in 175 countries between 1970 and 2009: a systematic analysis. Lancet. 2010;376(9745):959–74. 10.1016/S0140-6736(10)61257-3. [DOI] [PubMed] [Google Scholar]
7.Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15(4):233–4. https://www.nature.com/articles/nmeth.4642. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785–94. 10.1145/2939672.2939785
9.Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:6639–49. 10.48550/arXiv.1706.09516. [Google Scholar]
10.Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis. 2018;66(1):149–53. 10.1093/cid/cix731. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.United Nations (UN). Transforming our world: the 2030 Agenda for Sustainable Development. New York: United Nations. 2015. https://sdgs.un.org/2030agenda Accessed 5 Feb 2025.
12.World Health Organization (WHO). Diarrhoeal disease. Geneva: WHO. 2017. https://www.who.int/news-room/fact-sheets/detail/diarrhoeal-disease. Accessed 10 Oct 2024.
13.Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54. 10.48550/arXiv.1603.02754. [Google Scholar]
14.Bolarinwa OA, Ahinkorah BO, Seidu AA, Ameyaw EK, Okyere J, Hagan JE, et al. Spatial disparities and socio-demographic predictors of childhood diarrhoea in nigeria: insights from the 2018 demographic and health survey. Int J Environ Res Public Health. 2021;18(11):5876. 10.3390/ijerph18115876. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Federal Ministry of Water Resources. National Water Policy. 2021. Abuja, Nigeria: Federal Government of Nigeria; 2021. https://waterresources.gov.ng/wp-content/uploads/2021/10/National-Water-Policy-2021.pdf. Accessed 16 Oct 2024.
16.Laverack G, Manoncourt E. Key experiences of community engagement and social mobilization in the Ebola response. Glob Health Promot. 2016;23(1):79–82. 10.1177/1757975915606674. [DOI] [PubMed] [Google Scholar]
17.Pew Research Center. Internet connectivity seen as having positive impact on life in sub-Saharan Africa. Washington, DC: Pew Research Center. 2018. Available from: https://docs.edtechhub.org/lib/H282TBDA. Accessed 15 Feb 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(1.3MB, docx)}

Data Availability Statement

Data extracted and used in this study are available on request.

[CR1] 1.World Health Organization. Diarrhoeal disease, Geneva WHO. 2024 Mar 4. https://www.who.int/news-room/fact-sheets/detail/diarrhoeal-disease. Accessed 10 Oct 2024.

[CR2] 2.National Population Commission (NPC). [Nigeria] and ICF. Nigeria demographic and health survey 2018. Nigeria, and Rockville, Maryland, USA: NPC and ICF;: Abuja; 2019. [Google Scholar]

[CR3] 3.Troeger C, Blacker BF, Khalil IA, Rao PC, Cao S, Zimsen SR, et al. Estimates of the global, regional, and National morbidity, mortality, and aetiologies of diarrhoea in 195 countries: a systematic analysis for the global burden of disease study 2016. Lancet Infect Dis. 2018;18(11):1211–28. 10.1016/S1473-3099(18)30362-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Fink G, Günther I, Hill K. The effect of water and sanitation on child health: evidence from the demographic and health surveys 1986–2007. Int J Epidemiol. 2011;40(5):1196–204. 10.1093/ije/dyr102. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Wolf J, Hunter PR, Freeman MC, Cumming O, Clasen T, Bartram J, et al. Impact of drinking water, sanitation and handwashing with soap on childhood diarrhoeal disease: updated meta-analysis and meta-regression. Trop Med Int Health. 2018;23(5):508–25. 10.1111/tmi.13051. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Gakidou E, Cowling K, Lozano R, Murray CJL. Increased educational attainment and its effect on child mortality in 175 countries between 1970 and 2009: a systematic analysis. Lancet. 2010;376(9745):959–74. 10.1016/S0140-6736(10)61257-3. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15(4):233–4. https://www.nature.com/articles/nmeth.4642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785–94. 10.1145/2939672.2939785

[CR9] 9.Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:6639–49. 10.48550/arXiv.1706.09516. [Google Scholar]

[CR10] 10.Wiens J, Shenoy ES. Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin Infect Dis. 2018;66(1):149–53. 10.1093/cid/cix731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.United Nations (UN). Transforming our world: the 2030 Agenda for Sustainable Development. New York: United Nations. 2015. https://sdgs.un.org/2030agenda Accessed 5 Feb 2025.

[CR12] 12.World Health Organization (WHO). Diarrhoeal disease. Geneva: WHO. 2017. https://www.who.int/news-room/fact-sheets/detail/diarrhoeal-disease. Accessed 10 Oct 2024.

[CR13] 13.Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54. 10.48550/arXiv.1603.02754. [Google Scholar]

[CR14] 14.Bolarinwa OA, Ahinkorah BO, Seidu AA, Ameyaw EK, Okyere J, Hagan JE, et al. Spatial disparities and socio-demographic predictors of childhood diarrhoea in nigeria: insights from the 2018 demographic and health survey. Int J Environ Res Public Health. 2021;18(11):5876. 10.3390/ijerph18115876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Federal Ministry of Water Resources. National Water Policy. 2021. Abuja, Nigeria: Federal Government of Nigeria; 2021. https://waterresources.gov.ng/wp-content/uploads/2021/10/National-Water-Policy-2021.pdf. Accessed 16 Oct 2024.

[CR16] 16.Laverack G, Manoncourt E. Key experiences of community engagement and social mobilization in the Ebola response. Glob Health Promot. 2016;23(1):79–82. 10.1177/1757975915606674. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Pew Research Center. Internet connectivity seen as having positive impact on life in sub-Saharan Africa. Washington, DC: Pew Research Center. 2018. Available from: https://docs.edtechhub.org/lib/H282TBDA. Accessed 15 Feb 2025.

PERMALINK

Comparative performance of machine learning models in predicting childhood diarrhea: implications for public health surveillance

Joseph O Ashaolu

Taiwo S Akanji

Victoria I Ayansola

Agbolade J Sunday

Omoyajowo A Esther

Sylvain YM Some

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Introduction

Methods

Study design and data source

Variables and operational definitions

Conceptual framework

Statistical and machine learning approaches

Data preprocessing and model training

Logistic regression analysis

Machine learning modeling

Model evaluation and feature importance

Ethical considerations

Results

Descriptive Characteristics and Logistic Regression Analysis

Table 1.

Table 3.

Table 2.

Machine Learning Model Performance

Fig. 1.

Fig. 2.

Feature Importance Across Models

Table 4.

Fig. 3.

Fig. 4.

Sensitivity Analysis

Table 5.

Discussion

Key Findings in Context

Sanitation and Socioeconomic Determinants

Media Exposure: Paradoxes and Opportunities

Methodological Contributions: ML and the Classical Model

Policy Implications

Limitations and future directions

5. Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases