Predicting and identifying factors associated with undernutrition among children under five years in Ghana using machine learning algorithms

Eric Komla Anku; Henry Ofori Duah

doi:10.1371/journal.pone.0296625

. 2024 Feb 13;19(2):e0296625. doi: 10.1371/journal.pone.0296625

Predicting and identifying factors associated with undernutrition among children under five years in Ghana using machine learning algorithms

Eric Komla Anku ^1,^*,^#, Henry Ofori Duah ^2,^#

Editor: Benojir Ahammed³

PMCID: PMC10863846 PMID: 38349921

Abstract

Background

Undernutrition among children under the age of five is a major public health concern, especially in developing countries. This study aimed to use machine learning (ML) algorithms to predict undernutrition and identify its associated factors.

Methods

Secondary data analysis of the 2017 Multiple Indicator Cluster Survey (MICS) was performed using R and Python. The main outcomes of interest were undernutrition (stunting: height-for-age (HAZ) < -2 SD; wasting: weight-for-height (WHZ) < -2 SD; and underweight: weight-for-age (WAZ) < -2 SD). Seven ML algorithms were trained and tested: linear discriminant analysis (LDA), logistic model, support vector machine (SVM), random forest (RF), least absolute shrinkage and selection operator (LASSO), ridge regression, and extreme gradient boosting (XGBoost). The ML models were evaluated using the accuracy, confusion matrix, and area under the curve (AUC) receiver operating characteristics (ROC).

Results

In total, 8564 children were included in the final analysis. The average age of the children was 926 days, and the majority were females. The weighted prevalence rates of stunting, wasting, and underweight were 17%, 7%, and 12%, respectively. The accuracies of all the ML models for wasting were (LDA: 84%; Logistic: 95%; SVM: 92%; RF: 94%; LASSO: 96%; Ridge: 84%, XGBoost: 98%), stunting (LDA: 86%; Logistic: 86%; SVM: 98%; RF: 88%; LASSO: 86%; Ridge: 86%, XGBoost: 98%), and for underweight were (LDA: 90%; Logistic: 92%; SVM: 98%; RF: 89%; LASSO: 92%; Ridge: 88%, XGBoost: 98%). The AUC values of the wasting models were (LDA: 99%; Logistic: 100%; SVM: 72%; RF: 94%; LASSO: 99%; Ridge: 59%, XGBoost: 100%), for stunting were (LDA: 89%; Logistic: 90%; SVM: 100%; RF: 92%; LASSO: 90%; Ridge: 89%, XGBoost: 100%), and for underweight were (LDA: 95%; Logistic: 96%; SVM: 100%; RF: 94%; LASSO: 96%; Ridge: 82%, XGBoost: 82%). Age, weight, length/height, sex, region of residence and ethnicity were important predictors of wasting, stunting and underweight.

Conclusion

The XGBoost model was the best model for predicting wasting, stunting, and underweight. The findings showed that different ML algorithms could be useful for predicting undernutrition and identifying important predictors for targeted interventions among children under five years in Ghana.

Introduction

Undernutrition (wasting, stunting, and underweight) among children under five years of age is a public health issue with serious implications [1]. Undernutrition contributes to nearly half of the mortality in children under five years of age, with the highest burden in low- and middle-income countries [2, 3]. The burden of undernutrition is not limited to the clinical or socioeconomic outcomes [3]. Global statistics indicate a decreasing trend in undernutrition; however, it remains high in developing countries [4], including Ghana, especially for stunting.

The global prevalence of stunting among children under five is 22%, and wasting is 7% according to The United Nations Children’s Fund (UNICEF) [2]. Ghana has made strides to reduce the rate of undernutrition among children under five years old. Prevalence rates from the 2014 Ghana Demographic and Health Survey (GDHS) reported stunting at 19%, wasting at 5%, and underweight at 11% [5], with the highest burden in the northern regions, similar to a study conducted in Ethiopia [6]. The rates of stunting, wasting, and underweight were 28%, 9%, and 14%, respectively in 2008 [7]. Stunting, wasting, and underweight were reported to be 18%, 7%, and 13%, respectively, in a recent Multiple Indicator Cluster Survey [8]. The trends show a decrease in the rates of stunting but not wasting or underweight.

The use of machine learning (ML) algorithms is key to identifying the factors associated with undernutrition and driving decisions to reduce it. Previous studies have relied on logistic regression to determine the factors associated with undernutrition [9, 10], which may not always be sufficient to identify patterns in data [11]. Machine learning approaches have shown promising outcomes in identifying factors associated with undernutrition, including previously undiscovered factors [1]. Several studies have been conducted in Bangladesh [10, 12, 13], Ethiopia [1, 6], and India [14, 15] and have shown the usefulness of machine learning algorithms. Factors predicted to be significantly associated with undernutrition vary based on location, including time to water source, anaemia history, child age > 30 months, low birth weight and maternal underweight [1], urban-rural settlement, literacy factors of parents, and place of residence [6, 12].

To the best of our knowledge, evidence on machine learning algorithms and undernutrition among children under five years of age in Ghana is limited. Thus, this proof-of-concept study aimed to provide evidence for the use of machine learning algorithms to predict undernutrition among children under five years of age in Ghana and to identify associated factors.

Methods

Data source

Data from the 2017 Multiple Indicator Cluster Survey (MICS) was used in this study. The MICS survey was conducted from October 2017 to January 2018. The original dataset for children under five years of age had 8906 data points. Data collection was performed in ten 10 administrative regions of Ghana. In each administrative region, the main sampling units were rural and urban areas. Thereafter, two-stage sampling was used to select households for the interviews.

Data preparation

Data were downloaded from the MICS website. Data wrangling was completed in R version 4.3.0 using tidyverse packages [16]. Study variables and measurements

Outcome

The main outcomes of interest were three nutritional indicators in children under five years of age: stunting, wasting, and underweight. The z-scores of the anthropometric measures were used to assess nutritional indicators: weight-for-height (wasting), height-for-age (stunting), and weight-for-age (underweight) were used to evaluate nutritional status. Based on the World Health Organization (WHO) criteria, a child was classified as wasted if the weight-for-height z-scores were < -2 SD, stunted if the height-for-age z-scores were < -2 SD, and underweight if the weight-for-age z-scores were < -2SD. The three nutritional outcomes were coded as 0 for the absence of nutritional indicators and 1 for the presence of nutritional indicators. Therefore, children who were wasted, stunted, and underweight were all given a code of 1; otherwise, they were coded as 0 for normal indicators.

Covariates

We considered a set of covariates to be predictors of malnutrition in Ghana. We included a set of variables based on the literature and availability in the dataset, excluding those with missing cases of > 50%. The following covariates were considered: age, sex, region, area, length/height, weight, child ill with cough for two weeks, child ill with fever for two weeks, child had diarrhoea for two weeks, health insurance, mother’s educational level, ethnicity, and combined wealth score.

Analytic strategy

R and Python programming languages were used for the analysis. The tidyverse packages [16] from R were used for data wrangling and the scikit-learn package [17] from Python for machine learning. The survey [18] package in Rwas used to create a survey object that accounted for the primary sampling unit, stratification, and sample weight of the in the univariate and bivariate analysis. Seven ML algorithms (Linear Discriminant Analysis (LDA), Logistic Model, Support Vector Machine (SVM), Random Forest, least absolute shrinkage, selection operator (LASSO) regression, Ridge Regression, and Extreme Gradient Boosting (XGBoost)) were trained for each nutritional indicator. A summary of each algorithm is provided below:

Linear discriminant analysis (LDA)

Linear discriminant analysis (LDA) is a dimensionality reduction technique that is used for classification. It is a supervised machine learning technique used to find a linear combination of features for the optimal classification of known groups. The goal of LDA is to find the linear axis in the dimensional space that maximizes the distance between the means of classes while minimizing the variability within the classes to ensure the optimal separation of classes [19].

Logistic regression

Logistic regression is a supervised machine learning technique used to classify binary outcomes or classes based on prediction probabilities. Logistic regression is similar to a linear regression model, but it uses a complex cost function called the ‘Sigmoid function’ or the ‘logistic function’ instead of the linear function in linear regression [20]. The logistic function helps transform predicted probabilities to lie between zero and one [20]. A decision boundary (threshold) is then used to create an optimal classification based on the probability score. Therefore, when the predicted probability is above the threshold, it is grouped into one class, and those with predicted probabilities below the threshold are grouped into a separate class [20].

Support vector machine (SVM)

Support vector machine is a supervised ML technique used for classification and regression. The goal of SVM is to identify a hyperplane that ensures the optimal separation of classes. The support vectors represent the data points that are nearest to the other sides of the hyperplane, which are critical for their removal to change the position of the dividing hyperplane [21]. The margin is the distance between the hyperplane and nearest data point from each side. The goal of SVM is to select the ideal hyperplane that has the maximum margin between the hyperplane and any data point in the training data to ensure good separation. For complex classification, SVM can use 3-Dimensional (3-D) space to enhance separation through the process of kernelling [21].

Random forest

Random forest (RF) is a supervised ML technique that leverages insights from several decision trees for classification. Unlike a decision tree, which uses only one tree to make predictions, a random forest fits several uncorrelated classification trees and uses the average of all individual trees to ensure an improved prediction accuracy compared with individual trees. Random forest uses bootstrap aggregation or bagging to select a random sample (with replacement) of the training dataset for each decision tree of the decision trees used in the RF. Moreover, the splitting of nodes in the RF model is based on a random set of features for each tree. Therefore, the features of each tree could vary from one another. The bagging process and variations in feature selection for each tree contribute to the robustness of RF prediction accuracy [22].

Least absolute shrinkage and selection operator (LASSO) regression

LASSO regression is a supervised ML technique that employs L1 regularization to control overfitting of data during regression. The L1 regularization method applies a penalty to the magnitude of the coefficients associated with each independent variable in the model. The penalty shrinks the less important coefficients towards zero, essentially eliminating them from the model. The tuning parameter (λ) is used to control the strength of the penalty in LASSO [23].

Ridge regression

Ridge regression is a supervised ML technique that uses L2 regularisation to overcome overfitting of the training data during regression. L2 regularisation in ridge regression penalises the loss function by adding the squared absolute values of the magnitudes of the coefficients as penalty terms [23].

XGBoost. XGBoost is a scalable tree-boosting system designed to improve the performance of machine learning models. It combines the strengths of the gradient boosting and regularisation techniques. XGBoost uses a novel regularisation term that penalises complex models and allows better control over model complexity. It employs a parallel and distributed computing framework to efficiently handle large-scale datasets [24].

ML approach

We trained the algorithms to identify features that predict nutritional indicators in children under five years in Ghana. First, the data were divided into training and test datasets. We used 70% and 30% for the training and testing datasets, respectively. The training dataset for the wasting was oversampled to deal with class imbalance. We trained all seven ML algorithms, including the LDA, SVM, logistic, RF, LASSO, ridge models, and XGBoost on the training set separately for wasting, stunting, and underweight. We used a 5-fold cross validation to tune the hyperparameters of the models. Moreover, the same random seed was applied to ensure that the same training and validation sets were obtained while training different ML algorithms.

Algorithm evaluation

We evaluated the accuracy of the models on the test dataset using a confusion matrix and area under the curve receiver operating characteristic (AUC-ROC) plots. The accuracy, sensitivity, and specificity of the models were evaluated using a confusion matrix. Assuming a confusion matrix for a standard binary classifier, the four possible outcomes are listed in Table 1.

Table 1. A sample confusion matrix of binary classifier.

		Predicted
		Positive	Negative
Observed	Positive	True Positive (TP)	False Negative (FN)
Observed	Negative	False positive (FP)	True Negative (TN)

Open in a new tab

Accuracy

The accuracy is the ratio of the total number of correct predictions (TP + TN) to the total number of predictions (TP + TN + FP + FN). Therefore, the accuracy is an estimate of the overall ability of the classifier to make correct predictions. Mathematically,

Accuracy = TP + TN / TP + TN + FP + FN

Sensitivity

The sensitivity of a classifier is defined as the ratio of the number of positive cases correctly classified by the model to the total number of positive cases. Sensitivity refers to the ability of a classifier to designate an individual with disease as positive. A highly sensitive classifier has a small proportion of false negatives, resulting in only a few missed cases. Mathematically,

Sensitivity = TP / TP + FN

Specificity

The specificity of the classifier is the ratio of the number of negative cases correctly classified by the model to the total number of negative cases. Specificity refers to the ability of a classifier to designate an individual without disease as negative. A highly specific classifier has a small proportion of false positives, resulting in only a few noncases being incorrectly diagnosed. Mathematically,

Specificity = TN / TN + FP

Area under the curve receiver operating characteristics (AUC-ROC)

The receiver operating characteristics (ROC) curve is a unit square plot that shows the diagnostic ability of a classifier. It is produced by plotting the true positive rate (sensitivity) and false positive rate (1-specificity). The area under the curve (AUC) estimates the area under the ROC curve by computing the aggregate performance of a classifier under different thresholds. The AUC values range from 0 to 1. An AUC value of 0.5 implies that the classifier has no ability to discriminate and is no better than chance. An AUC of 1 implies that the classifier has a perfect discrimination. Therefore, the closer the AUC value of a classifier is to 1, the better is the classifier, which provides a basis for evaluating and comparing the discrimination abilities of competing classifiers.

Variable importance

We estimated the variable importance of the algorithms using Python. Variables with high values implied that they made an important contribution to the overall model accuracy.

Ethical considerations

Ethical approval was not required for this secondary analysis. Verbal consent was obtained from each adult participant and children aged between 15 and 17 years during the primary data collection. For the younger children, consent was obtained from their parents or caregivers. Participants were informed of their right to voluntary participation, confidentiality, and anonymity of the information obtained. Personal identifiable information was removed from the data set.

Results

Sample characteristics

A total of 8564 children under five years of age were included in the analysis. The mean age of the patients was 926 days; 51% were females, and 57% resided in urban areas. Thirty-seven percent of the mothers had attained junior secondary education Table 2.

Table 2. Sociodemographic and anthropometric characteristics of children.

Characteristic	N = 8,564¹
Age (days)	926 (520)
Sex
Female	4,331 (51%)
Male	4,233 (49%)
Region
Ashanti	2,055 (24%)
Brong Ahafo	803 (9.4%)
Central	907 (11%)
Eastern	890 (10%)
Greater Accra	826 (9.7%)
Northern	996 (12%)
Upper East	279 (3.3%)
Upper West	209 (2.4%)
Volta	686 (8.0%)
Western	911 (11%)
Area
Rural	4,866 (57%)
Urban	3,698 (43%)
Length/Height (cm)	86 (14)
Weight (kg)	11.6 (3.5)
Child ill with cough
No	5966 (70%)
Yes	2,598 (30%)
Child ill with fever
No	6358
Yes	2,206 (26%)
Child had diarrhoea
No	7095 (83%)
Yes	1,469 (17%)
Health insurance
With insurance	5,036 (59%)
Without insurance	3,527 (41%)
Mother’s educational level
Pre-primary or none	2,303 (27%)
Primary	1,729 (20%)
JSS/JHS/Middle	3,176 (37%)
SSS/SHS/Secondary	924 (11%)
Higher	432 (5.0%)
Ethnicity
Akan	3,959 (46%)
Ewe	877 (10%)
Ga/Damgme	625 (7.3%)
Gruma	375 (4.4%)
Grusi	188 (2.2%)
Guan	382 (4.5%)
Mande	36 (0.4%)
Mole Dagbani	1,455 (17%)
Other	666 (7.8%)
Combined wealth score	-0.02 (0.91)
Wasting
Normal	7,985 (93%)
Wasted	578 (6.8%)
Stunting
Normal	7,080 (83%)
Stunted	1,483 (17%)
Underweight
Normal	7,514 (88%)
Underweight	1,050 (12%)
¹ n (%); Mean (SD)

Open in a new tab

Weighted prevalence of nutritional indicators and associated factors

The weighted prevalence rates of wasting, stunting, and underweight were 7%,17% and 12%, respectively (Table 2). The children who were wasted were younger than the non-wasted children (p < 0.001). Most of the children who wasted were males (57%) rather than females (43%) (p = 0.029). Children who were wasted had significantly lower current weight (8.1 ± 2.6 kg) than children without wasting (11.8 ± 3.4 kg), p < 0.001. No differences were observed in the burden of wasting in terms of region of residence, area of residence, whether a child had diarrhoea or not, mothers’ educational level and the combine wealth score of the household.

Majority of the children who were stunted were males (55%) than females (45%), p = 0.007. A higher proportion of children who were stunted were from rural areas (67%) (p < 0.001). Children who were stunted were older than those who were not stunted (p < 0.001). There was a significant association between stunting and the following variables: region of residence, current weight and height, child who had fever or diarrhoea 2 weeks ago, the educational level of the mother, ethnicity, and combined wealth score (p < 0.05).

There was a significant association between underweight and the following variables: age of the child, sex of the child, region of residence, current weight and height/length of the child, child ill with fever two weeks ago, child had diarrhoea two weeks ago, whether child had health insurance, ethnicity, and combined wealth score (p < 0.05) (Table 3).

Table 3. Weighted prevalence of nutritional outcomes by sociodemographic factors.

	Wasting			Stunting			Underweight
Characteristic	Normal	Wasted	p-value²	Normal	Stunted	p-value²	Normal	Underweight	p-value²
Characteristic	N = 7985¹	N = 578¹	p-value²	N = 7080¹	N = 1483¹	p-value²	N = 7514¹	N = 1050¹	p-value²
Age (days)	948 (516)	624 (482)	< 0.001	910 (533)	1004 (450)	< 0.001	935 (523)	864 (497)	0.002
Sex			0.029			0.007			0.030
Female	4080 (51%)	251 (43%)		3665 (52%)	665 (45%)		3861 (51%)	469 (45%)
Male	3905 (49%)	328 (57%)		3415 (48%)	818 (55%)		3652 (49%)	581 (55%)
Region			0.3			< 0.001			< 0.001
Ashanti	1924 (24%)	131 (23%)		1738 (25%)	318 (21%)		1808 (24%)	247 (24%)
Brong Ahafo	745 (9.3%)	57 (9.9%)		695 (9.8%)	108 (7.3%)		735 (9.8%)	67 (6.4%)
Central	840 (11%)	67 (12%)		743 (10%)	164 (11%)		805 (11%)	102 (9.7%)
Eastern	850 (11%)	40 (7%)		753 (11%)	137 (9.3%)		810 (11%)	81 (7.7%)
Greater Accra	780 (9.8%)	46 (8%)		726 (10%)	100 (6.7%)		754 (10%)	72 (6.9%)
Northern	910 (11%)	86 (15%)		711 (10%)	284 (19%)		814 (11%)	182 (17%)
Upper East	259 (3.2%)	20 (3.4%)		230 (3.2%)	49 (3.3%)		237 (3.2%)	42 (4%)
Upper West	198 (2.5%)	12 (2%)		179 (2.5%)	30 (2%)		189 (2.5%)	21 (2%)
Volta	631 (7.9%)	55 (9.5%)		543 (7.7%)	144 (9.7%)		579 (7.7%)	108 (10%)
Western	846 (11%)	65 (11%)		762 (11%)	149 (10%)		782 (10%)	129 (12%)
Area			0.7			< 0.001			0.13
Rural	4542 (57%)	323 (56%)		3879 (55%)	987 (67%)		4231 (56%)	634 (60%)
Urban	3443 (43%)	255 (44%)		3202 (45%)	496 (33%)		3282 (44%)	416 (40%)
Length/Height (cm)	86 (14)	78 (13)	< 0.001	87 (14)	82 (10)	<0.001	87 (14)	80 (11)	< 0.001
Weight (kg)	11.8 (3.4)	8.1 (2.6)	< 0.001	11.8 (3.6)	10.6 (2.5)	<0.001	11.9 (3.5)	9.1 (2.4)	< 0.001
Child ill with cough			0.12			0.10			0.2
No	5583 (70%)	382 (66%)		4969 (70%)	996 (69%)		5264 (70%)	702 (67%)
Yes	2402 (30%)	196 (34%)		2111 (30%)	487 (31%)		2250 (30%)	348 (33%)
Child ill with fever			0.005			<0.001			< 0.001
No	5977 (75%)	380 (66%)		5337 (75%)	1020 (69%)		5654 (75%)	703 (67%)
Yes	2008 (25%)	198 (34%)		1743 (25%)	463 (31%)		1860 (25%)	347 (33%)
Child with diarrhoea			0.4			0.002			0.001
No	6628 (83%)	466 (81%)		5931 (84%)	1164 (78%)		6281 (84%)	814 (78%)
Yes	1357 (17%)	112 (19%)		1149 (16%)	319 (22%)		1233 (16%)	236 (22%)
Health insurance			< 0.001			0.045			0.018
With insurance	4769 (60%)	268 (46%)		4219 (60%)	818 (55%)		4478 (60%)	558 (53%)
Without insurance	3216 (40%)	311 (54%)		2861 (40%)	666 (45%)		3035 (40%)	492 (47%)
Mother’s educational level			0.092			< 0.001			< 0.001
Pre-primary or none	2127 (27%)	176 (30%)		1756 (25%)	547 (37%)		1954 (26%)	349 (33%)
Primary	1596 (20%)	134 (23%)		1431 (20%)	298 (20%)		1480 (20%)	249 (24%)
JSS/JHS/Middle	2994 (37%)	182 (31%)		2696 (38%)	480 (32%)		2861 (38%)	314 (30%)
SSS/SHS/Secondary	854 (11%)	69 (12%)		786 (11%)	137 (9.3%)		805 (11%)	119 (11%)
Higher	414 (5.2%)	17 (3%)		411 (5.8%)	21 (1.4%)		413 (5.5%)	18 (1.8%)
Ethnicity			0.031			0.001			0.048
Akan	3714 (47%)	245 (42%)		3305 (47%)	654 (44%)		3492 (46%)	467 (45%)
Ewe	830 (10%)	47 (8.1%)		756 (11%)	121 (8.1%)		783 (10%)	94 (8.9%)
Ga/Damgme	576 (7.2%)	48 (8.3%)		530 (7.5%)	95 (6.4%)		566 (7.5%)	58 (5.5%)
Gruma	358 (4.5%)	17 (2.9%)		290 (4.1%)	85 (5.7%)		332 (4.4%)	42 (4%)
Grusi	173 (2.2%)	15 (2.7%)		162 (2.3%)	27 (1.8%)		171 (2.3%)	17 (1.6%)
Guan	338 (4.2%)	44 (7.6%)		272 (3.8%)	110 (7.4%)		306 (4.1%)	76 (7.2%)
Mande	35 (0.4%)	1 (0.2%)		36 (0.5%)	1 (<0.1%)		33 (0.4%)	3 (0.3%)
Mole Dagbani	1334 (17%)	122 (21%)		1166 (16%)	289 (19%)		1234 (16%)	222 (21%)
Others	627 (7.8%)	39 (6.8%)		564 (8%)	102 (6.9%)		596 (7.9%)	70 (6.7%)
Combined wealth score	-0.02 (0.91)	-0.11 (0.88)	0.2	0.04 (0.93)	-0.31 (0.78)	<0.001	0.00 (0.92)	-0.22 (0.79)	< 0.001

Open in a new tab

¹ n (%); Mean (SD)

² chi-squared test with Rao & Scott’s second-order correction; Wilcoxon rank-sum test for complex survey samples

Predictive algorithms for child undernutrition indicators and associated receiver operator characteristics on the test data

Wasting

The under-five wasting prediction accuracies were found to be high for all algorithms on the test data (84–98%). Accuracy of the XGBoost model was highest (98%) for predicting wasting followed by LASSO, Logistic model and RF (Table 4). The sensitivity of LDA, Logistic model, LASSO and Ridge were higher compared to the other ML models. RF, however, was the highest predictive model in terms of specificity followed by XGBoost and SVM. Based on AUC values, XGBoost and Logistic model had the highest performance (Fig 1).

Table 4. Accuracy of Predictive algorithms for child undernutrition indicators on the test data.

	LDA	Logistic Regression	SVM	RF	LASSO	Ridge	XGBoost
Wasting
Accuracy	84%	95%	92%	94%	96%	84%	98%
Sensitivity	100%	100%	7%	12%	100%	100%	90%
Specificity	83%	94%	99%	100%	96%	83%	99%
Stunting
Accuracy	86%	86%	98%	88%	86%	86%	98%
Sensitivity	54%	43%	93%	39%	45%	27%	90%
Specificity	92%	96%	99%	99%	95%	98%	99%
Underweight
Accuracy	90%	92%	98%	89%	92%	88%	98%
Sensitivity	31%	51%	90%	20%	55%	5%	88%
Specificity	99%	98%	99%	100%	98%	100%	100%

Open in a new tab

Stunting

The accuracy of stunting prediction for all algorithms ranged between 86% and 98% for the test data. XGBoost and SVM had the highest value in terms of accuracy, with an optimal balance of sensitivity and specificity relative to the other models (Table 4). The best model for predicting stunting in children under five years based on ROC-AUC values were XGBoost and SVM with an AUC of 100%, which indicates that they had the greatest discrimination compared to the other models (Fig 2).

Underweight

Under-five underweight prediction accuracies for all algorithms ranged between 88% and 98% for the test data. The XGBoost and SVM models had the highest value in terms of accuracy, with an optimal balance of sensitivity and specificity (Table 4). Based on AUC values, XGBoost and SVM were the best predictive models followed by LASSO and logistic model (Fig 3).

The important determinants of childhood undernutrition indicators

As previously described, the XGBoost model was the best for wasting, stunting, and underweight nutritional indicators in terms of accuracy and AUC-ROC characteristics. The top 20 most important variables that contributed to the model’s accuracy are presented in Figs 4–6. Top five important variables for wasting were weight of the child, length/height of the child, sex of the child (female), region of residence (Greater Accra), and ethnicity (Gruma). With respect to stunting, the top five variables were age of the child, length/height of the child, sex of the child (female), region of residence (Volta), and the weight of the child. The top five variables for underweight were age of the child, weight of the child, sex of the child (female), ethnicity (Grusi) and region of residence (Brong Ahafo) (Figs 4–6).

Discussion

Undernutrition in children can result in grave ramifications in their physical and cognitive development. The authors sought to identify the best predictive model and factors associated with undernutrition in children aged five years and younger using the ML approach. In total, 8564 children were included in the final dataset analysis. The average age of the children was 926± 520 days, and most resided in rural areas. Slightly over half of the children were females, with an average current weight of 11.6±3.5kg. Approximately 37% of the mothers had attained juniors high school education with only 16% having attained at least senior high school education. A greater proportion of the children were recruited from the Ashanti Region of Ghana. Approximately 30%, 26%, and 17% of the children were ill with cough, ill with fever, and had diarrhoea, respectively. A greater proportion of 59% had health insurance coverage.

The weighted prevalence rates of stunting, wasting, and underweight were 17%, 6.8%, and 12%, respectively. Most wasted and underweight children were younger on average than normal children, whereas stunted children were, on average, older than normal children [8]. The current weight of the children was significantly lower in all malnourished groups than that in the normal group. The proportion of children with fever was significantly higher in all malnourished groups than for children in their respective normal groups.

Seven ML algorithms were used to predict undernutrition and to identify factors associated with undernutrition: RF, logistic model, LDA, SVM, ridge regression, XGBoost and LASSO.

The accuracy of all seven ML algorithms with respect to wasting was between 84% and 98%, with a specificity ranging between 83% and 99%, whereas the sensitivity was between 7% and 100% for all algorithms used. The XGBoost model was the most accurate in in predicting wasting and exhibited an optimal balance of sensitivity and specificity. In addition, the accuracy with respect to stunting was between 86% and 98%, with the highest accuracy obtained using the SVM and XGBoost models. The specificity ranged between 92% and 99%, whereas the sensitivity ranged from 27 to 93%, with the SVM and XGBoost models being the most sensitive. Both the SVM and XGBoost models recorded the highest accuracy and showed an optimal balance of sensitivity and specificity to underweight prediction. Previous studies have reported that the accuracy of ML algorithms in predicting undernutrition is between 35.6% and 99.95% [1, 6, 12–14].

The six factors shown to be important for all indicators of undernutrition were age, weight, length/height, sex, region of residence and ethnicity, which was similar to a previous study that employed a logistic regression model to identify these factors [9]. The top five identified factors associated with wasting, stunting, and underweight using the XGBoost model were the weight of the child, age of the child,sex of the child, region of residence and ethnicity. The similarity of the important features was not surprising given that they were generated using an XGBoost model.

XBoost technique has been found to be superior to other machine-learning models in terms of accuracy [24–26]. A study in Ethiopia showed that the XGBoost algorithm is the most accurate [1]. Another study from Bangladesh reported the highest accuracy when using an artificial neural network [10]. We may attribute the observed superiority of the XGBoost to its ability to leverage the outputs of weak sequential decision trees, where each new tree builds on the weaknesses of the previous trees to make accurate predictions and its ability to effectively handling complex, high-dimensional data for classification [27]. The SVM models followed the XGBoost model with regards to discrimination under the ROC curves in all the three undernutrition indicators. The strengths of the SVM models may be attributable to the radial basis function (RBF) kernel that enables the SVM to capture relationships between features without explicitly mapping data into high dimension space. A major challenge in comparing the various studies is the evaluation of the different ML algorithms. Regardless, the use of XGBoost model to predict undernutrition among children under five years of age has been demonstrated to be accurate across studies.

Important features associated with wasting in this study were the weight of the child, length/height of the child, sex of the child, region of residence and ethnicity. A similar study in Ethiopia using the XGBoost algorithm identified children aged > 30 months, wealth index (poorest), time to water intake, ethnicity (Somalia), and small child size as the five main features associated with wasting [1]. Another study in India identified mother’s BMI, toilet facility, state, and child’s age and religion as important features of wasting [14]. The age of the child is an important factor to consider with respect to wasting among children.

The important features associated with stunting in this study were age of the child, length/height of the child, sex of the child, region of residence, and the weight of the child. A previous study from Ethiopia reported the time to water, child age greater than 30 months, number of children under five in a household, household possession of television, and small child size [1]. Another study from India reported that children’s age, toilet facilities, wealth index, mother’s education, and breastfeeding duration were important features associated with stunting [14]. Common features identified across studies were weight of the child, age of the child, sex of child, region of residence, and ethnicity.

Important features associated with being underweight in this study were similar to that of stunting and wasting. The important features of a previous study were time to water, lack of education of mothers, small child size, children older than 30 months, and underweight mothers [1]. Common features identified included the age and birth weight of the child. The different features identified for all malnutrition indicators imply that factors associated with undernutrition vary between countries, and as such, the solutions provided should be driven by data to ensure appropriate use of resources.

Limitations

Factors identified in this study to be associated with undernutrition do not imply causality. Future studies should explore additional factors to help predict malnutrition among children under five years in Ghana. Also, future studies should consider testing the feasibility of machine learning algorithms as potential screening tools for children under five years in Ghana.

Conclusion

This study highlighted the usefulness of the ML approach in predicting and identifying factors associated with undernutrition in children under five years in Ghana. Weight of the child, age of the child, sex of child, region of residence, and ethnicity are important features associated with undernutrition. Policies targeting the decrease in undernutrition in children should consider these factors. Other factors specific to all nutritional indicators have also been reported to help drive public health actions at the identified factors. The XGBoost models followed by the SVM models were best for predicting wasting, stunting and underweight among children under five years in Ghana. The findings from this study also indicate that different ML models may be useful in predicting undernutrition.

Data Availability

This study was based on a publicly available dataset with no personal identifiers, and is freely available upon request from the Ghana Statistical Service website (https://www.statsghana.gov.gh/gssdatadownloadspage.php) and the MICS website (https://mics.unicef.org/surveys) after an online registration. The code used for the analysis can be found in this repository (https://github.com/KomlaRD/machine_learning_undernutrition).

Funding Statement

The author(s) received no specific funding for this work.

References

1.Bitew FH, Sparks CS, Nyarko SH. Machine learning algorithms for predicting undernutrition among under-five children in Ethiopia. Public Health Nutr. 2022;25(2):269–80. Available from: doi: 10.1017/S1368980021004262 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.United Nations Children’s Fund (UNICEF). Malnutrition [Internet]. 2022 [cited 2022 Dec 24]. Available from: https://data.unicef.org/topic/nutrition/malnutrition/
3.WHO. Malnutrition [Internet]. 2021 [cited 2023 Jan 31]. Available from: https://www.who.int/news-room/fact-sheets/detail/malnutrition
4.Mkhize M, Sibanda M. A Review of Selected Studies on the Factors Associated with the Nutrition Status of Children Under the Age of Five Years in South Africa. Int J Environ Res Public Health. 2020. Oct 30;17(21):7973 Available from: doi: 10.3390/ijerph17217973 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.GSS; GHS; ICF International. Ghana demographic health survey. Demogr Heal Surv 2014 [Internet]. 2015;530. Available from: https://dhsprogram.com/pubs/pdf/FR307/FR307.pdf [Google Scholar]
6.Fenta HM, Zewotir T, Muluneh EK. A machine learning classifier approach for identifying the determinants of under-five child undernutrition in Ethiopian administrative zones. BMC Med Inform Decis Mak [Internet]. 2021;21(1):1–12. Available from: doi: 10.1186/s12911-021-01652-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ghana Statistical Service Ghana Demographic Health Survery. Ghana Demographic and Health Survey 2008: Ghana Statistical Service, Ghana Health Service, Ghana AIDS Commission [Internet]. Ghana Statistical Service (GSS) Ghana Demographic and Health Survey. 2008. 1–512 p. Available from: http://www.dhsprogram.com/pubs/pdf/FR221/FR221[13Aug2012].pdf [Google Scholar]
8.Ghana Statisical Service. Snapshots on key findings Ghana Multiple Indicator Cluster Survey 2017/18. 2018;1–74. Available from: https://www2.statsghana.gov.gh/docfiles/publications/MICS/Ghana%20MICS%202017-18.%20Summary%20report%20-%20consolidated%20Snapshots%2023.11.%202018%20(1).pdf [Google Scholar]
9.Boah M, Azupogo F, Amporfro DA, Abada LA. The epidemiology of undernutrition and its determinants in children under five years in Ghana. PLoS One [Internet]. 2019;14(7):1–23. Available from: 10.1371/journal.pone.0219665 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Shahriar M, Iqubal MS, Mitra S, Das AK. A deep learning approach to predict malnutrition status of 0–59 month’s older children in Bangladesh. Proc—2019 IEEE Int Conf Ind 40, Artif Intell Commun Technol IAICT 2019. 2019;145–9. Available from: 10.1109/ICIAICT.2019.8784823 [DOI] [Google Scholar]
11.Kirk D, Kok E, Tufano M, Tekinerdogan B, Feskens EJM, Camps G. Machine Learning in Nutrition Research. Adv Nutr [Internet]. 2022. Nov 1;13(6):2573–89. Available from: 10.1093/advances/nmac103 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Rahman SMJ, Ahmed NAMF, Abedin MM, Ahammed B, Ali M, Rahman MJ, et al. Investigate the risk factors of stunting, wasting, and underweight among under-five Bangladeshi children and its prediction based on machine learning approach. PLoS One [Internet]. 2021;16(6 June 2021):1–11. Available from: doi: 10.1371/journal.pone.0253172 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Talukder A, Ahammed B. Machine learning algorithms for predicting malnutrition among under-five children in Bangladesh. Nutrition. 2020;78. Available from: doi: 10.1016/j.nut.2020.110861 [DOI] [PubMed] [Google Scholar]
14.Jain S, Khanam T, Abedi AJ, Khan AA. Efficient Machine Learning for Malnutrition Prediction among under-five children in India. 2022 IEEE Delhi Sect Conf DELCON 2022. Available from: 10.1109/delcon54057.2022.9753080 [DOI]
15.Khare S, Kavyashree S, Gupta D, Jyotishi A. Investigation of Nutritional Status of Children based on Machine Learning Techniques using Indian Demographic and Health Survey Data. Procedia Comput Sci [Internet]. 2017;115:338–49. Available from: 10.1016/j.procs.2017.09.087 [DOI] [Google Scholar]
16.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. J Open Source Softw [Internet]. 2019;4(43):1686. Available from: 10.21105/joss.01686 [DOI] [Google Scholar]
17.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(9):2825–2830. [Google Scholar]
18.Lumley T. Analysis of Complex Survey Samples. J Stat Softw. 2004;9(1):1–19. Available from: 10.18637/jss.v009.i08 [DOI] [Google Scholar]
19.Xanthopoulos P, Pardalos PM, Trafalis TB, Xanthopoulos P, Pardalos PM, Trafalis TB. Linear discriminant analysis. Robust data Min. 2013;27–33. Available from: 10.1007/978-1-4419-9878-1_4 [DOI] [Google Scholar]
20.Bisong E, Bisong E. Logistic regression. In: Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners. Springer; 2019. p. 243–50. [Google Scholar]
21.Jung A. Machine Learning: The Basics. Springer. 2022. 289 p. [Google Scholar]
22.Parmar A, Katariya R, Patel V. A Review on Random Forest: An Ensemble Classifier. Lect Notes Data Eng Commun Technol. 2019;26:758–63. Available from: 10.1007/978-3-030-03146-6_86 [DOI] [Google Scholar]
23.Schmidt M. Least Squares Optimization with L1-Norm Regularization. CS542B Proj Rep [Internet]. 2005;504(December):195–221. Available from: http://people.cs.ubc.ca/~schmidtm/Software/lasso.pdf [Google Scholar]
24.Chen T, Guestrin C. XGBoost: A scalable tree boosting system. Proc ACM SIGKDD Int Conf Knowl Discov Data Min. 2016;13-17-Augu:785–94. Available from: 10.1145/2939672.2939785 [DOI] [Google Scholar]
25.Wang M, Li X, Lei M, Duan L, Chen H. Human health risk identification of petrochemical sites based on extreme gradient boosting. Ecotoxicol Environ Saf. 2022;233:113332. Available from: doi: 10.1016/j.ecoenv.2022.113332 [DOI] [PubMed] [Google Scholar]
26.Ramón A, Torres AM, Milara J, Cascón J, Blasco P, Mateo J. eXtreme Gradient Boosting-based method to classify patients with COVID-19. J Investig Med. 2022;70(7):1472–80. Available from: doi: 10.1136/jim-2021-002278 [DOI] [PubMed] [Google Scholar]
27.Antipov EA, Pokryshevskaya EB. Interpretable machine learning for demand modeling with high-dimensional data using Gradient Boosting Machines and Shapley values. J Revenue Pricing Manag [Internet]. 2020;19(5):355–64. Available from: 10.1057/s41272-020-00236-4 [DOI] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0296625.r001

Decision Letter 0

Benojir Ahammed

10 Jul 2023

PONE-D-23-10665Predicting and Identifying Factors Associated with Undernutrition among Children Under Five Years in Ghana using Machine Learning AlgorithmsPLOS ONE

Dear Dr. Anku,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 24 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Benojir Ahammed, M.Sc.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript is well written. However the authors should provide justification for selection of the ML models and how was this linked to the type of dataset analysed/trained.

Lack of sensitivity of many models should be explained by doing indepth discussion of the performance of each model. The authors discuss more the output of the models but little on the selected models.

Why size of the dataset is considered as a limitation while it was possible to use a large dataset?

Reviewer #2: -The present paper has tried to determine the predictors of malnutrition in children using various statistical methods. However, the reason for using several methods for a time point is not clear.

A study with the following title was published by Mercedes de Onis [et al] in 2004.

“Methodology for estimating regional and global trends of child malnutrition”

The paper described the methodology developed by the World Health Organization (WHO) to derive global and regional trends of child stunting and underweight, and reports trends in prevalence and numbers affected for 1990–2005.

The proposed method in that study can be used to analyze trends and predictors over the time.

It is recommended to refer to the previous method and mention the reasons of the need of other methods, especially considering that the methods proposed in this paper have very low sensitivities. - Page 18: It seems there was a mistake between sensitivity and specificity.

Despite showing high levels of specificity (99% to 100%) for predicting under-five wasting, all models had very poor sensitivity (0% to 10%) (Table 4, Figure 1).

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Bahareh Nikooyeh

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Feb 13;19(2):e0296625. doi: 10.1371/journal.pone.0296625.r002

Author response to Decision Letter 0

24 Aug 2023

We thank the editor and reviewers for taking their time to review our paper and the constructive comments they have offered. We have responded to all feedback from reviewers point by point. Changes have been made in the file named “Revised Manuscript with Track Changes” and an unmarked version in the file named “Manuscript”, and we have provided a summary of our feedback below:

Response to Academic Editor

Comment 1: Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Response: We have made changes to the manuscript to meet the requirements stated in the attached document and the file names.

Comment 2: Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

Response: Information regarding ethics has been updated in the ethical consideration section.

Comment 3: Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

Response: The code used for analysis has been made available, and the information is updated in the data availability statement.

Comment 4: In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

Response:

Data availability statement

This study was based on a publicly available dataset with no personal identifiers, and is freely available upon request from the DHS program (DHS). The code used for the analysis can be found in the repository.

Comment 5: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Response: References have been reviewed.

Response to Reviewer 1

Comment: The manuscript is well written. However, the authors should provide justification for selection of the ML models and how was this linked to the type of dataset analysed/trained.

Response: Thank you very much for your review. The models selected are classification machine learning models and they were selected because of the outcome of interest which was binary

Comment: Lack of sensitivity of many models should be explained by doing in-depth discussion of the performance of each model. The authors discuss more the output of the models but little on the selected models.

Response: We oversampled the training dataset to deal with class imbalance of the minority class which improved the sensitivity of the models. We have also added discussion point on our explanation for the superiority in discrimination observed for the Random Forest and the SVM models.

Comment: Why size of the dataset is considered as a limitation while it was possible to use a large dataset?

Response: Limitation cited regarding the size of dataset has been deleted because it is no longer pertinent after having address the class imbalance problem.

Response to Reviewer 2

Comment: The present paper has tried to determine the predictors of malnutrition in children using various statistical methods. However, the reason for using several methods for a time point is not clear. A study with the following title was published by Mercedes de Onis [et al] in 2004.

“Methodology for estimating regional and global trends of child malnutrition” The paper described the methodology developed by the World Health Organization (WHO) to derive global and regional trends of child stunting and underweight, and reports trends in prevalence and numbers affected for 1990–2005.

The proposed method in that study can be used to analyze trends and predictors over the time. It is recommended to refer to the previous method and mention the reasons of the need of other methods, especially considering that the methods proposed in this paper have very low sensitivities.).

Response: Thank you very much for your feedback and also for sharing the study by Mercedes de Onis [et al] in 2004. We reviewed the study and found that the methods proposed in the study have different objectives compared to our study. Our study aimed at predicted stunting, wasting and underweight and not the trends and prevalence over time as proposed in the stated study.

Comment: - Page 18: It seems there was a mistake between sensitivity and specificity. Despite showing high levels of specificity (99% to 100%) for predicting under-five wasting, all models had very poor sensitivity (0% to 10%) (Table 4, Figure 1

Response: We have also looked at the error pointed out from the manuscript and made changes accordingly. Once again, we are grateful for your time and feedback.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(23.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0296625.r003

Decision Letter 1

Benojir Ahammed

18 Dec 2023

Predicting and Identifying Factors Associated with Undernutrition among Children Under Five Years in Ghana using Machine Learning Algorithms

PONE-D-23-10665R1

Dear Dr. %Anku%,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Benojir Ahammed, M.Sc.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have effectively addressed the feedback concerning the methodology, results, and discussion sections, and implementing the required revisions to enhance the manuscript.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

PLoS One. doi: 10.1371/journal.pone.0296625.r004

Acceptance letter

Benojir Ahammed

5 Feb 2024

PONE-D-23-10665R1

PLOS ONE

Dear Dr. Anku,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Mr. Benojir Ahammed

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(23.6KB, docx)}

Data Availability Statement

[pone.0296625.ref001] 1.Bitew FH, Sparks CS, Nyarko SH. Machine learning algorithms for predicting undernutrition among under-five children in Ethiopia. Public Health Nutr. 2022;25(2):269–80. Available from: doi: 10.1017/S1368980021004262 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref002] 2.United Nations Children’s Fund (UNICEF). Malnutrition [Internet]. 2022 [cited 2022 Dec 24]. Available from: https://data.unicef.org/topic/nutrition/malnutrition/

[pone.0296625.ref003] 3.WHO. Malnutrition [Internet]. 2021 [cited 2023 Jan 31]. Available from: https://www.who.int/news-room/fact-sheets/detail/malnutrition

[pone.0296625.ref004] 4.Mkhize M, Sibanda M. A Review of Selected Studies on the Factors Associated with the Nutrition Status of Children Under the Age of Five Years in South Africa. Int J Environ Res Public Health. 2020. Oct 30;17(21):7973 Available from: doi: 10.3390/ijerph17217973 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref005] 5.GSS; GHS; ICF International. Ghana demographic health survey. Demogr Heal Surv 2014 [Internet]. 2015;530. Available from: https://dhsprogram.com/pubs/pdf/FR307/FR307.pdf [Google Scholar]

[pone.0296625.ref006] 6.Fenta HM, Zewotir T, Muluneh EK. A machine learning classifier approach for identifying the determinants of under-five child undernutrition in Ethiopian administrative zones. BMC Med Inform Decis Mak [Internet]. 2021;21(1):1–12. Available from: doi: 10.1186/s12911-021-01652-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref007] 7.Ghana Statistical Service Ghana Demographic Health Survery. Ghana Demographic and Health Survey 2008: Ghana Statistical Service, Ghana Health Service, Ghana AIDS Commission [Internet]. Ghana Statistical Service (GSS) Ghana Demographic and Health Survey. 2008. 1–512 p. Available from: http://www.dhsprogram.com/pubs/pdf/FR221/FR221[13Aug2012].pdf [Google Scholar]

[pone.0296625.ref008] 8.Ghana Statisical Service. Snapshots on key findings Ghana Multiple Indicator Cluster Survey 2017/18. 2018;1–74. Available from: https://www2.statsghana.gov.gh/docfiles/publications/MICS/Ghana%20MICS%202017-18.%20Summary%20report%20-%20consolidated%20Snapshots%2023.11.%202018%20(1).pdf [Google Scholar]

[pone.0296625.ref009] 9.Boah M, Azupogo F, Amporfro DA, Abada LA. The epidemiology of undernutrition and its determinants in children under five years in Ghana. PLoS One [Internet]. 2019;14(7):1–23. Available from: 10.1371/journal.pone.0219665 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref010] 10.Shahriar M, Iqubal MS, Mitra S, Das AK. A deep learning approach to predict malnutrition status of 0–59 month’s older children in Bangladesh. Proc—2019 IEEE Int Conf Ind 40, Artif Intell Commun Technol IAICT 2019. 2019;145–9. Available from: 10.1109/ICIAICT.2019.8784823 [DOI] [Google Scholar]

[pone.0296625.ref011] 11.Kirk D, Kok E, Tufano M, Tekinerdogan B, Feskens EJM, Camps G. Machine Learning in Nutrition Research. Adv Nutr [Internet]. 2022. Nov 1;13(6):2573–89. Available from: 10.1093/advances/nmac103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref012] 12.Rahman SMJ, Ahmed NAMF, Abedin MM, Ahammed B, Ali M, Rahman MJ, et al. Investigate the risk factors of stunting, wasting, and underweight among under-five Bangladeshi children and its prediction based on machine learning approach. PLoS One [Internet]. 2021;16(6 June 2021):1–11. Available from: doi: 10.1371/journal.pone.0253172 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296625.ref013] 13.Talukder A, Ahammed B. Machine learning algorithms for predicting malnutrition among under-five children in Bangladesh. Nutrition. 2020;78. Available from: doi: 10.1016/j.nut.2020.110861 [DOI] [PubMed] [Google Scholar]

[pone.0296625.ref014] 14.Jain S, Khanam T, Abedi AJ, Khan AA. Efficient Machine Learning for Malnutrition Prediction among under-five children in India. 2022 IEEE Delhi Sect Conf DELCON 2022. Available from: 10.1109/delcon54057.2022.9753080 [DOI]

[pone.0296625.ref015] 15.Khare S, Kavyashree S, Gupta D, Jyotishi A. Investigation of Nutritional Status of Children based on Machine Learning Techniques using Indian Demographic and Health Survey Data. Procedia Comput Sci [Internet]. 2017;115:338–49. Available from: 10.1016/j.procs.2017.09.087 [DOI] [Google Scholar]

[pone.0296625.ref016] 16.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. J Open Source Softw [Internet]. 2019;4(43):1686. Available from: 10.21105/joss.01686 [DOI] [Google Scholar]

[pone.0296625.ref017] 17.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(9):2825–2830. [Google Scholar]

[pone.0296625.ref018] 18.Lumley T. Analysis of Complex Survey Samples. J Stat Softw. 2004;9(1):1–19. Available from: 10.18637/jss.v009.i08 [DOI] [Google Scholar]

[pone.0296625.ref019] 19.Xanthopoulos P, Pardalos PM, Trafalis TB, Xanthopoulos P, Pardalos PM, Trafalis TB. Linear discriminant analysis. Robust data Min. 2013;27–33. Available from: 10.1007/978-1-4419-9878-1_4 [DOI] [Google Scholar]

[pone.0296625.ref020] 20.Bisong E, Bisong E. Logistic regression. In: Building machine learning and deep learning models on google cloud platform: A comprehensive guide for beginners. Springer; 2019. p. 243–50. [Google Scholar]

[pone.0296625.ref021] 21.Jung A. Machine Learning: The Basics. Springer. 2022. 289 p. [Google Scholar]

[pone.0296625.ref022] 22.Parmar A, Katariya R, Patel V. A Review on Random Forest: An Ensemble Classifier. Lect Notes Data Eng Commun Technol. 2019;26:758–63. Available from: 10.1007/978-3-030-03146-6_86 [DOI] [Google Scholar]

[pone.0296625.ref023] 23.Schmidt M. Least Squares Optimization with L1-Norm Regularization. CS542B Proj Rep [Internet]. 2005;504(December):195–221. Available from: http://people.cs.ubc.ca/~schmidtm/Software/lasso.pdf [Google Scholar]

[pone.0296625.ref024] 24.Chen T, Guestrin C. XGBoost: A scalable tree boosting system. Proc ACM SIGKDD Int Conf Knowl Discov Data Min. 2016;13-17-Augu:785–94. Available from: 10.1145/2939672.2939785 [DOI] [Google Scholar]

[pone.0296625.ref025] 25.Wang M, Li X, Lei M, Duan L, Chen H. Human health risk identification of petrochemical sites based on extreme gradient boosting. Ecotoxicol Environ Saf. 2022;233:113332. Available from: doi: 10.1016/j.ecoenv.2022.113332 [DOI] [PubMed] [Google Scholar]

[pone.0296625.ref026] 26.Ramón A, Torres AM, Milara J, Cascón J, Blasco P, Mateo J. eXtreme Gradient Boosting-based method to classify patients with COVID-19. J Investig Med. 2022;70(7):1472–80. Available from: doi: 10.1136/jim-2021-002278 [DOI] [PubMed] [Google Scholar]

[pone.0296625.ref027] 27.Antipov EA, Pokryshevskaya EB. Interpretable machine learning for demand modeling with high-dimensional data using Gradient Boosting Machines and Shapley values. J Revenue Pricing Manag [Internet]. 2020;19(5):355–64. Available from: 10.1057/s41272-020-00236-4 [DOI] [Google Scholar]

PERMALINK

Predicting and identifying factors associated with undernutrition among children under five years in Ghana using machine learning algorithms

Eric Komla Anku

Henry Ofori Duah

Roles

Abstract

Background

Methods

Results

Conclusion

Introduction

Methods

Data source

Data preparation

Outcome

Covariates

Analytic strategy

Linear discriminant analysis (LDA)

Logistic regression

Support vector machine (SVM)

Random forest

Least absolute shrinkage and selection operator (LASSO) regression

Ridge regression

ML approach

Algorithm evaluation

Table 1. A sample confusion matrix of binary classifier.

Accuracy

Sensitivity

Specificity

Area under the curve receiver operating characteristics (AUC-ROC)

Variable importance

Ethical considerations

Results

Sample characteristics

Table 2. Sociodemographic and anthropometric characteristics of children.

Weighted prevalence of nutritional indicators and associated factors

Table 3. Weighted prevalence of nutritional outcomes by sociodemographic factors.

Predictive algorithms for child undernutrition indicators and associated receiver operator characteristics on the test data

Wasting

Table 4. Accuracy of Predictive algorithms for child undernutrition indicators on the test data.

Fig 1. Receiver operator characteristics on the ML models for wasting test data.

Stunting

Fig 2. Receiver operator characteristics on the ML models for stunting on test data.

Underweight

Fig 3. Receiver operator characteristics on the ML models for underweight on the test data.

The important determinants of childhood undernutrition indicators

Fig 4. Top 20 most important variables from the XGBoost model for wasting.

Fig 6. Top 20 most important variables from the XGBoost model for underweight.

Fig 5. Top 20 most important variables from the XGBoost model for stunting.

Discussion

Limitations

Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

Benojir Ahammed

Roles

Author response to Decision Letter 0

Decision Letter 1

Benojir Ahammed

Roles

Acceptance letter

Benojir Ahammed

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases