Abstract
Objectives
Diabetes has become a leading cause of mortality in both developed and developing countries, impacting a growing number of individuals worldwide. As the prevalence of the disease continues to rise, researchers have diligently worked towards developing accurate diabetes prediction models. The primary aim of this study is to utilize a diverse set of machine learning algorithms to detect the presence of diabetes, particularly in females, at an early stage. By leveraging these methods, this research seeks to provide physicians with valuable tools to identify the disease early, enabling timely interventions and improving patient outcomes.
Methods
In this study, some state-of-the-art machine learning techniques, such as random forest classifiers with gridsearchCV, XGBoost, NGBoost, Bagging, LightGBM, and AdaBoost classifiers, were employed. These models were chosen as the base layer of our proposed stacked ensemble model because of their high accuracy. Before feeding the data into the models, the dataset was preprocessed to ensure optimal performance and obtain improved results.
Results
The accuracy achieved in this study was 92.91%, which demonstrates its competitiveness with the existing approaches. Moreover, the utilization of the Shapley additive explanation (SHAP) facilitated the interpretation of machine learning models.
Conclusion
We anticipate that these findings will be beneficial to healthcare providers, stakeholders, students, and researchers involved in diabetes prediction research and development.
Keywords: Diabetes prediction, Machine learning, Stacked Ensemble, PIMA
Introduction
Diabetes is a medical condition characterized by abnormally high levels of blood glucose, commonly known as blood sugar. Patients with diabetes have difficulty converting carbohydrates into glucose, which is essential for providing energy for daily activities. Consequently, there is a gradual increase in blood sugar levels, as the glucose remains in the bloodstream instead of reaching the body’s cells [1]. Diabetes has become a leading cause of death in both developing and developed nations, prompting significant investments in research to find a cure for this critical disease. According to the National Diabetes Statistics Report 2022, it affects approximately 37.3 million people, accounting for 11.3% of the USA’s population, with 8.5 million individuals (23%) unaware of their condition [2].
Various factors contribute to the risk of developing diabetes, including being overweight or obese, insulin resistance, hormonal disorders, genetic predisposition, and family history [3]. Additionally, certain behavioral risk factors, such as diet quality, sleep duration, and physical activity, increase the likelihood of developing diabetes. However, these factors are not accurate predictors of prediabetes and insulin resistance. An indicative factor for these conditions is a larger waist circumference [4, 5]. Medical practitioners carefully consider a patient’s medical history, physical examination results, and any concerning symptoms before resorting to expensive procedures to diagnose diabetes [6]. However, the reliance on human judgment in these diagnostic techniques can lead to misdiagnoses and delays in obtaining accurate results, making prediction challenging due to the involvement of multiple parameters.
To predict and detect diseases, various predictive, quantitative, and statistical models, including machine learning algorithms, have been employed [7]. The advancements in machine learning have significantly improved the ability to recognize and categorize images, predict diseases, and enhance decision-making through data analysis. Machine learning applications aim to train computer systems to surpass human performance. These models are trained using supervised learning algorithms and evaluated using test data [8]. To comprehend the diabetic illness detection process and address the aforementioned issues, multiple perspectives, models, and papers from various researchers are taken into consideration [9].
In this study, we utilized the Pima Diabetes datasets [10] and various machine learning techniques, such as voting classifier, random forest classifier with gridsearchCV, XGBoost classifier, NGBoost classifier, bagging classifier, LightGBM classifier, and AdaBoost classifier, to provide valuable insights. Prior to training the models, feature engineering techniques were applied to enhance performance. Additionally, the Shapley additive explanation (SHAP) method was employed to interpret the machine learning models and identify hidden patterns in healthcare data, which can aid in making accurate diagnoses to provide better healthcare services.
The contributions of the research are as follows:
Introducing a highly effective feature engineering technique to identify the most essential and pertinent features that accurately represent the target variable’s pattern.
Conducting a comprehensive evaluation of various machine learning classifiers, including gridsearchCV, XGBoost, NGBoost, Bagging, LightGBM, and AdaBoost, aiding in the selection of the most suitable classifiers for the proposed ensemble model.
Presenting a stacked ensemble learning framework for diabetes prediction, with a comparison against baseline and existing diabetes prediction models.
Discussing the model using Shapley additive explanation (SHAP) values to provide insights into feature importance and model interpretability.
The study is designed as follows: “Related works” provides a concise overview of previous research efforts. In “Materi-als and methods”, the techniques of the proposed model are discussed. The experiment’s results and model interpretation are presented in “Results”. “Discussion” discusses the overall performance of the proposed model. Lastly, “Conclusions and future work” of the paper presents the conclusion and outlines potential future work.
Related works
This section of this research paper explores the existing literature and studies relevant to the utilization of machine learning methods for forecasting diabetes. It delves into prior efforts and approaches proposed by researchers to address diabetes prediction, highlighting their strengths and limitations. By examining the advancements and gaps in the field, this section sets the foundation for introducing the novel stacked ensemble machine learning approach proposed in this paper. Through a comprehensive review of related works, the reader gains insights into the current state of research in diabetes prediction and understands the significance of the proposed approach in advancing the field.
A comprehensive framework for remote health monitoring that automates the prediction and management of diabetes risk [11]. This approach involves the implementation of a support vector machine on the Pima Indian Diabetes Database, preceded by processes like feature scaling, imputation, selection, and augmentation. The resulting predictive model achieved notable accuracy, sensitivity, and specificity levels of 83.20%, 87.20%, and 79%, respectively, demonstrating compatibility with established methodologies through 10-fold stratified cross-validation.
Mujumdar, Aishwarya, and V. Vaidehi proposed a machine learning technique [12] for diabetic data analysis, utilizing the pipeline concept to predict diabetes classification. The model incorporates external diabetes factors and demonstrates good accuracy with the AdaBoost classifier.
Swapna G., Vinayakumar R., and Soman K.P. presented a deep learning architecture-based methodology [13] for classifying diabetic and normal HRV signals. Employing LSTM, CNN, and their pairings, they capture intricate temporal dynamic attributes from the input HRV data. The model achieves an impressive accuracy of 95.7%.
The study compares the effectiveness of multiple machine learning classifiers [14], namely LR, DT, SVM, Xgboost, RF, and AdaBoost, achieving an 83% accuracy with the AdaBoost technique. Another machine learning-based diabetes diagnosis system [15] trained using shallow ML techniques, including Decision Tree, SVM, ANN, Logistic Regression (LR), and Naive Bayes, exhibits promising results. The accuracy of the LR technique reaches 77%.
The authors propose a novel framework for diabetes mellitus prediction [16], incorporating machine learning with hyper-parameter tuning techniques to increase accuracy. The model attains 83% accuracy using the widely recognized PIMA dataset.
A model utilizing various classifiers, including linear regression, linear SVM, polynomial kernel SVM, Random Forest, and Voting classifiers, predicts the potential early onset of diabetes, particularly in females, with an accuracy of 82% [17].
A data fusion process is used to develop a unified dataset from multiple streams, optimized for machine learning algorithms [18]. SVM and ANN are combined to identify and forecast key events in diabetic patients, achieving a remarkable accuracy rate of 94.67% using the SVM-ANN model.
Several researchers proposed different models in their studies [19–26], often utilizing shallow machine classifier methods. While these models exhibit sound accuracy in diabetes disease identification, there remains a need for balanced and improved diagnosis. Therefore, this research aims to explore advanced machine learning methods to find the most effective classifier for diabetes prediction.
Materials and methods
The purpose of this study is to forecast the likely computerized diabetes prognosis that will be useful for both doctors and patients in the healthcare field. This section provides a brief overview of the research methodologies and materials used in the work that follows to achieve the goal. Figure 1 shows the flow diagram of the proposed work. The entire implementation was done in Python and Jupyter Notebook. Various packages, including NumPy, Pandas, Scikit, and Matplotlib, were employed to analyze the data.
Fig. 1.
Stages of the research
Dataset
The dataset “Pima Indian Diabetes” [27] is employed for training and testing various machine learning techniques. It comprises 768 records with 8 features, namely pregnancy, glucose, blood pressure, BMI, skin thickness, age, insulin, and outcome. The outcome is a binary response attribute, taking values ’1’ for diabetic patients and ’0’ for non-diabetic patients. Table 1 presents the medical features of this dataset for 303 patients.
Table 1.
Description of PIMA Indian Dataset [27]
| Attributes | Range | Description |
|---|---|---|
| Pregnancies | 0-17 | Number of pregnancies |
| Glucose | 0-199 | In an oral glucose tolerance experiment, |
| the plasma glucose concentration at two hours. | ||
| Blood Pressure | 0-122 | Diastolic blood pressure (mm Hg). |
| BMI | 0-67.1 | Body mass index = weight (kg) / [height (m)]2 |
| Skin Thickness | 0-99 | Triceps skin fold thickness (mm). |
| Diabetes Pedigree Function | 0.078-2.42 | A technique for calculating diabetes risk based on |
| family history. | ||
| Age | 21-81 | Years of age |
| Insulin | 0-846 | 2-Hour serum insulin (mu U/ml) |
| Outcome | 0-1 | Diagnoses classification: healthy = 0, |
| diabetes affected =1 |
Data cleaning and preprocessing
Data from the real world frequently includes noise, gaps in values, and might not adhere to an ideal format, rendering it unsuitable for direct application in training machine learning models. Data preprocessing is essential to improve the accuracy and effectiveness of machine learning models by cleaning and organizing the data appropriately. In this study, we will preprocess the dataset to prepare it for various machine learning techniques. By employing the Exploratory Data Analysis (EDA) technique [28], we can effectively handle missing data, eliminate outliers, and select relevant features, leading to a significantly improved model accuracy.
Data Cleaning
After conducting a thorough examination of the dataset, we confirm the absence of any missing entries. This validation renders our dataset fully prepared for an extensive Exploratory Data Analysis (EDA) exploration.
Exploratory data analysis
There are a total of 768 records and 8 features with one target variable. Figure 2 shows the summary of our data. We can see that the dataset is a little bit balanced, with 268 diabetes patients and 500 normal patients.
Fig. 2.
Dataset Summary
Correlation values were calculated to assess the degree of influence a feature has on the target attribute (outcome) and to explore its impact on other attributes, aiming for a comprehensive understanding of the data. The Pearson (product-moment) correlation coefficient Eq. 1 was employed to compute the correlation values.
| 1 |
Where,
correlation coefficient, values of a feature,f, in a sample, mean of the values of f, values of a feature,k, in a sample, mean of the values of k. It determines the measure of the linear relationship between the two features by computing the ratio of the covariance of the two features to the product of their standard deviations. Figure 3 displays a heat map of the estimated correlation values between features.
Fig. 3.

Heat map of the estimated correlation values
It is clear that glucose, followed by diabetes pedigree function, has the strongest deep correlation with the outcome variables.
Data engineering
The Pima Indian Dataset contains several missing values in certain instances. For example, having zero blood pressure is illogical. To address this issue without losing valuable data, the missing values in the dataset, which consisted of only 768 instances in total, were filled using the mean. Additionally, the dataset required normalization due to varying scales among its features. Normalization ensures that each feature contributes almost equally to the final decision, preventing any single feature from dominating the target. To achieve this, a standard scaler Eq. 2 was employed for normalizing the dataset. A standardized value can be defined as:
| 2 |
Where, feature value, mean value of v, and standard deviation.
Feature engineering
The processed data undergoes feature engineering to generate the features subsequently that the ML models are looking for. Derived features (,........., ) that are developed for models during training and prediction by applying the one hot encoding method to the columns in the provided dataset. These features are prepared by some categorical ranges and arithmetic operations shown in the following Algorithm 1. Then, we performed the label encoders and applied the appropriate transformations to each column. We can extract a new or more beneficial collection of features by analyzing the potential information in raw data.
Algorithm 1.

Feature Engineering (Dataset).
Model building
After completing data pre-processing, ML classifiers from the sci-kit-learn Python Toolkit are utilized. Initially, a function like model evaluation train-test split is employed to divide the dataset into training and testing sets. Subsequently, different ML classifiers are employed for diabetes diagnosis. Baseline models are created, and a 10-fold cross-validation is performed to identify the best-performing baseline algorithms in terms of accuracy, which will be used as the base layer for the stacked ensemble method. The 10-fold cross-validation of each model can be defined using the Eq. 3.
| 3 |
where, k is the number of folds, is the accuracy of a model using the k-fold cross-validation approach. The six highest-performing methods in terms of accuracy from the previously mentioned approaches are selected to form the base layer of the stacked ensemble technique. The model selection process can be defined as:
| 4 |
where, selected model, descending order function of cross-validation accuracies of n different models. Let us briefly introduce our selected models for building our proposed model.
RF with gridsearchCV
This method uses numerous separate decision trees to function as a single one. GridSearchCV [29] is a cross-validation tool that defines a range of values for selected parameters to improve model performance. It generates sets of potential values before testing them against one another in a “grid” for each hyper-parameter.
LGBM classifier
The LGBM (Light Gradient Boosting Machine) algorithm combines two key ideas known as GOSS (Gradient-based one-sided sampling) and GBDT (Gradient Boosting Decision Tree), commonly used in research prediction [30]. GBDT employs boosting, a collective process that creates a robust classifier by combining weak classifiers. It builds a model framework from the data and then creates a second model to address errors in the first framework. Efficacy, accuracy, and interoperability are crucial components of this technique. GOSS is another algorithm utilized for model prediction. The dataset is sampled using Gradient-based side sampling (GOSS), which gives more weight to data points with larger gradients when computing the gain. This method effectively utilizes some data points while randomly removing others from the study to maintain accuracy.
XGBoost classifier
As an ensemble learning algorithm, XGBoost makes predictions by combining the outputs of various models, sometimes referred to as base learners [31]. It uses Decision Trees as its basic learners, just like Random Forests. It significantly outperforms the conventional Gradient Boosting Decision Tree (GBDT) technique in terms of computation speed, generalization effectiveness, and scalability. It can perform the three primary gradient boosting techniques: Gradient Boosting, Regularized Boosting, and Stochastic Boosting. Unlike other libraries, XGBoost also provides the capability to incorporate and finely adjust regularization parameters [32].
NGBoost classifier
The NGBoost boosting technique utilizes Natural Gradient Boosting, an adaptable approach for probabilistic predictions [33]. This algorithm consists of three abstract modular components: the scoring rule, the parametric probability distribution, and the base learner. The weak or base learners take inputs and generate conditional probabilities using their outputs, employing Decision Trees as tree learners. The parametric probability distribution is formed by combining the outputs of the base learners to create a conditional distribution.
AdaBoost classifier
This iterative process assigns more weight to examples that were incorrectly classified in the previous round, aiming to focus subsequent classifiers on challenging cases. AdaBoost classifiers are meta-estimators that begin by adjusting a classifier on the initial dataset [34]. Subsequently, multiple copies of the classifier are fitted with the same dataset, but the weights of misclassified instances are adjusted to prioritize challenging cases for subsequent classifiers.
Bagging classifier
A bagging classifier is a type of ensemble meta-estimator that pairs base classifiers with random portions of the original dataset and subsequently combines or averages their predictions to generate a final prediction [35]. Bagging comprises two components: aggregation and bootstrapping. In bootstrapping, a sample is selected from a set using the replacement strategy, and these samples are used for the learning process. The final model aggregates the results from each individual model into a single prediction, selecting the category with the highest frequency output for classification.
Let us assume we have a training set x and samples created using bootstrapping. We now train a unique artificial intelligence classifier for each bootstrap sample . The outputs from each of these distinct classifiers will be averaged in the final classifier. This method relates to voting in the classification context:
| 5 |
Proposed classifier
Our proposed model is a stacked ensemble, intended to improve the model’s performance by making predictions from multiple nodes. By utilizing stacking, we can train various models to address similar challenges and subsequently combine their outputs to generate a new, more potent model. In contrast to bagging and boosting techniques, stacking focuses on assembling diverse and robust sets of learners to attain superior outcomes.
We selected the top 6 models for the base level (level 0) of the stacked ensemble based on their performance in cross-validation accuracy, aiming to improve the ensemble’s performance beyond that of individual machine learning models. In this study, we utilize the Bagging Classifier as our super learner or generalizer. Figure 4 illustrates the flow of the 10-fold stacking model. The process involves dividing the training set into 10 parts, with 9 parts used as training data for different classifiers, and each part serving as test data. Among the list of basic classifiers, we chose the following: LGBM classifier, Random forest classifier with gridsearchCV, XGBoost, NGBoost, Bagging, and AdaBoost classifiers. Each basic classifier is trained and evaluated using the test data from the chosen segment. After 10 folds of training, each classifier provides 10 sets of test results, and their union constitutes the entire training set. All the test data from the fundamental classifiers is then fed into the Bagging Classifier to identify the best ensemble from the group of classifiers. A stacking ensemble scheme can be defined formally as follows: The pair of k-folds with representing the r recorded values and the p values to predict is given a set of N possible learning algorithms , . Let be the model created by the learning algorithms on x to forecast , and be the generalizer function in charge of merging the models to predict such a value. can be a model created by a learning algorithm or a general function like the average. Then, the probable value can be defined by the Eq. 6:
| 6 |
The output of the proposed model can be explained using the SHAP(Shapley Additive Explanations) value. Its purpose is to quantify how each individual feature value contributes to the prediction for a given instance. In an analogy of a cooperative game where the prediction is the reward, individual feature values are regarded as players [36]. Let us consider a dataset with p features. We denote the set of features as F, where . A collective contribution, denoted as , is defined as a subset of features. We also consider the empty set, , as a collective contribution with no features. The characteristic function is denoted as v. For each collective contribution C, v(C) represents a real number known as the value of the collective contribution. It is worth noting that we need to compute contributions for all permutations of F, and the contribution of a feature, i, is calculated Eq. 7 accordingly.
| 7 |
Fig. 4.
Architecture of the proposed staked ensemble model
We utilized the global mean (Tree SHAP) approach in the diabetes disease prediction model. The horizontal axis represents the average change in the model’s output magnitude when a feature is omitted or “hidden” from the model. “Hidden” implies excluding the variable from the model. The average influence on the model’s output can be expressed as given in Eq. 8:
| 8 |
Where the global mean Eq. 9 is
| 9 |
Performance matrix
To determine whether a patient has diabetes or not, several machine learning algorithms are applied. The evaluation of these models’ performance involves utilizing Accuracy, Miss Rate, Precision, Recall, and F1-score, as defined in Eqs. 10, 11, 12, 13 and 14, respectively [37]. In these equations, TP stands for true positive, FP for false positive, TN for true negative, and FN for false negative. Additionally, the ROCAUC curve is utilized, which is a probability curve plotting True Positive Rate (TPR) or Recall against False Positive Rate (FPR), defined in Eqs. 12 and 15. This curve illustrates the performance of a classification model across all categorization levels.
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
Results
In this section, we present a comprehensive analysis of the results obtained from our model. The performance of the model is evaluated using various metrics, including accuracy, recall, precision, and F1 score, along with the construction of a confusion matrix to assess the predictive capabilities. Additionally, the model’s interpretability is explored through SHAP (SHapley Additive exPlanations) values, providing valuable insights into the underlying factors influencing diabetes predictions. The combination of quantitative metrics and interpretability measures offers a holistic view of the model’s effectiveness and enhances our understanding of its decision-making process. Through this rigorous evaluation, we aim to demonstrate the model’s potential for early diabetes detection and its application in real-world healthcare scenarios.
The objective of this project is to create a model employing diverse machine learning approaches, aiming to assist healthcare professionals in early diabetes detection and improving patient well-being. The performance of multiple machine learning techniques was assessed in this study. The experimentation was carried out on a computing system equipped with an 8th generation Intel Core i7 processor, clocked at up to 3.1 GHz, and 16 GB of RAM. The entire dataset was tested on the selected ML algorithms in this experiment.
Table 2 displays the performance of various machine learning classifiers in terms of different metrics such as accuracy, precision, and ROC on the raw data. Among these techniques, the NGBoost Classifier exhibited the highest accuracy, achieving 78%. On the other hand, the XGBoost Classifier showed the lowest accuracy, with a score of 73% across different measures. Additionally, the voting classifier, GaussianNB, LGBM, and bagging classifier also delivered favorable results, attaining accuracies of 75.02%, 77%, 77%, and 76.5%, respectively.
Table 2.
Some ML algorithms performance without feature engineering
| Model | Accuracy(%) | Miss Rate(%) | Precision | Recall | F1 Score | ROC |
|---|---|---|---|---|---|---|
| Voting Classifier | 75.02 | 24.08 | 0.76 | 0.75 | 0.75 | 0.74 |
| RF with gridsearchCV | 76.05 | 23.85 | 0.76 | 0.76 | 0.76 | 0.75 |
| XGBoost classifier | 73.32 | 26.68 | 0.74 | 0.73 | 0.73 | 0.73 |
| NGBoost classifier | 78.02 | 21.98 | 0.78 | 0.78 | 0.78 | 0.77 |
| ADABoost classifier | 76.4 | 24.6 | 0.76 | 0.76 | 0.76 | 0.76 |
| Bagging Classifier | 76.54 | 23.46 | 0.76 | 0.76 | 0.76 | 0.76 |
| GaussianNB | 77 | 23 | 0.77 | 0.77 | 0.77 | 0.77 |
| LGBM Classifier | 77 | 23 | 0.76 | 0.77 | 0.76 | 0.76 |
Table 3 illustrates the performances of several machine learning classifiers on processed data, considering various metrics such as accuracy, precision, and ROC. The Bagging Classifier emerged as the top-performing technique, achieving an accuracy of 92%. Conversely, the GaussianNB Classifier demonstrated the lowest accuracy, with a score of 73% across different measures. Additionally, the LGBM classifier, XGBoost classifier, NGBoost, RF with gridsearchCV, and AdaBoost classifier also delivered favorable results, attaining accuracies of 91%, 91%, 89%, 88%, and 87%, respectively.
Table 3.
Some ML algorithms performance after feature engineering
| Model | Accuracy(%) | Miss Rate(%) | Precision | Recall | F1 Score | ROC |
|---|---|---|---|---|---|---|
| Voting Classifier | 82 | 8 | 0.83 | 0.82 | 0.82 | 0.82 |
| RF with gridsearchCV | 88 | 12 | 0.88 | 0.88 | 0.88 | 0.88 |
| XGBoost classifier | 91 | 9 | 0.91 | 0.91 | 0.91 | 0.91 |
| NGBoost classifier | 89 | 11 | 0.89 | 0.89 | 0.89 | 0.89 |
| ADABoost classifier | 87 | 13 | 0.87 | 0.87 | 0.87 | 0.87 |
| Bagging Classifier | 91.8 | 8.2 | 0.92 | 0.92 | 0.92 | 0.92 |
| GaussianNB | 73 | 27 | 0.78 | 0.73 | 0.74 | 0.72 |
| LGBM Classifier | 91 | 9 | 0.91 | 0.91 | 0.91 | 0.91 |
Upon applying the proposed feature engineering method to the raw data, we observed notable improvements in the accuracy of most algorithms, with the exception of the GaussianNB classifier. Specifically, the Voting classifier, RF with gridsearchCV, XGBoost, NGBoost, AdaBoost, LGBM, and Bagging classifiers achieved an increase in accuracy by 6.98%, 11.95%, 17.68%, 10.98%, 10.6%, 14%, and 15.34%, respectively, compared to the results obtained on raw data. These significant accuracy gains underscore the effectiveness of the feature engineering technique in enhancing the overall performance of the models. The performance of the mentioned models is presented in Table 3, and the top-performing classifiers are RF with gridsearchCV (88%), XGBoost (91%), NGBoost (89%), AdaBoost (87%), LGBM (91%), and Bagging (91.8%) classifiers. Based on their impressive accuracy and other performance metrics, we select these models as our base models for the ensemble approach.
Table 4 displays the performance of the proposed model on raw data, while Table 5 represents its performance on processed data obtained after implementing the feature engineering technique. Notably, the proposed model achieved its best performance on the processed data, with an impressive accuracy of 92.91%. This signifies a substantial increase of 16.91% compared to the accuracy of 76% achieved on raw data, along with improvements in other evaluation metrics.
Table 4.
Proposed model’s performance on raw data
| Model | Accuracy(%) | Miss Rate(%) | Precision | Recall | F1 Score | ROC |
|---|---|---|---|---|---|---|
| Proposed Classifier | 76 | 24 | 0.76 | 0.76 | 0.76 | 0.76 |
Table 5.
Proposed model’s performance after feature engineering on data
| Model | Accuracy(%) | Miss Rate(%) | Precision | Recall | F1 Score | ROC |
|---|---|---|---|---|---|---|
| Proposed Classifier | 92.91 | 7.22 | 0.93 | 0.92 | 0.93 | 0.93 |
Figures 5 and 6 depict the performance metrics for early detection of diabetes disease (yes and no) by the proposed classifier and other models. The proposed classifier exhibits exceptional precision values of 0.97 (no) and 0.87 (yes), high recalls of 0.91 (no) and 0.96 (yes), and favorable f1-scores of 0.94 (no) and 0.89 (yes). In contrast, the voting classifier demonstrates comparatively weaker performance with precisions of 0.91 (no) and 0.67 (yes), recalls of 0.83 (no) and 0.80 (yes), and f1-scores of 0.87 (no) and 0.73 (yes). Among other models, the XGBoost and AdaBoost classifiers exhibit similar performances, while the bagging classifier outperforms them with precisions of 0.96 (no) and 0.84 (yes), recalls of 0.91 (no) and 0.92 (yes), and f1-scores of 0.94 (no) and 0.88 (yes). Additionally, the LGBM classifier demonstrates favorable values as well. The Random Forest with gridsearchCV performs slightly less favorably compared to the XGBoost and AdaBoost classifiers.
Fig. 5.
Performance matrices of different ML techniques after feature engineering (No)
Fig. 6.
Performance matrices of different ML techniques after feature engineering (Yes)
Figure 7 presents a comparison of accuracy between different models before and after applying feature engineering to the data. Every model achieved improved accuracy after feature engineering. Specifically, the Voting classifier, RF with gridsearchCV, XGBoost, NGBoost, AdaBoost, LGBM, Bagging classifier, and the proposed model attained accuracy gains of 6.98%, 11.95%, 17.68%, 10.98%, 10.6%, 14%, 15.46%, and 16.78%, respectively, compared to their performance on raw data. Notably, the proposed model exhibited the highest accuracy of 92.91% in predicting diabetes disease among all other ML techniques. The Random Forest classifier with gridsearchCV, XGBoost classifier, NGBoost classifier, AdaBoost classifier, LGBM, and Bagging classifier achieved accuracies of 88%, 90.75%, 89.22%, 87%, 91%, and 91.88%, respectively.
Fig. 7.
Accuracy comparison of different models (before and after feature engineering)
Figure 8 displays the confusion matrix of the proposed work. The matrix diagonal represents the number of correctly classified data points for each class. A reliable model should exhibit a substantial value on the main diagonal and smaller values on the off-diagonal elements. In this case, the confusion matrix demonstrates significant values on the main diagonal, specifically 95 and 46, indicating successful classification for the respective classes.
Fig. 8.

Confusion matrix of the proposed stacked ensemble model
Figure 9 illustrates the ROC curves of eight classifiers. The proposed model achieved the highest Area Under the Curve (AUC) compared to other algorithms. Subsequently, Bagging, LGBM, XGBoost, and NGBoost classifiers also exhibit favorable AUC values, which are relatively comparable to one another. On the other hand, the AUC values for the voting classifier and Random Forest with grid-search are notably lower, with voting having the lowest AUC among all methods mentioned. Consequently, based on these metrics, the proposed classifier model stands out as the best-performing model in this study.
Fig. 9.

ROC Curve of different classifiers after feature engineering
Table 6 presents a comparison of different machine learning techniques employed for diabetes prediction using the PIMA Indian Diabetes Dataset in prior research articles. Our proposed model outperforms other individual models, exhibiting higher accuracy (92.91%) as well as superior performance in other evaluation metrics such as precision, recall, and f1-score. The results demonstrate that our approach offers significant improvements over existing methods, showcasing its efficacy in accurately predicting diabetes at an early stage. The higher accuracy and comprehensive performance evaluation of our model highlight its potential as a valuable tool for physicians and healthcare professionals in detecting and managing diabetes effectively.
Table 6.
Accuracy comparison with more contemporary Works
| Models | Dataset | Accuracy (%) |
|---|---|---|
| QML[9] | PIMA Indian Diabetes | 86 |
| Logistic Regression [14] | PIMA Indian Diabetes | 77.6 |
| LR with hyper-parameter [15] | PIMA Indian Diabetes | 91 |
| LR with grid and Randomsearch [16] | PIMA Indian Diabetes | 83 |
| SVM-RBF [17] | PIMA Indian Diabetes | 83.2 |
| Deep Extreme Learning Machine (DELM) [26] | PIMA Indian Diabetes | 92 |
| Random Forest [11] | PIMA Indian Diabetes | 82 |
| ANN [24] | PIMA Indian Diabetes | 92 |
| SVM [25] | PIMA Indian Diabetes | 82 |
| Proposed Work | PIMA Indian Diabetes | 92.91 |
In our research paper, we also employ SHAP (SHapley Additive exPlanations) values for model interpretation. SHAP values offer a robust and insightful way to understand the contributions of individual features to the prediction outcomes made by our stacked ensemble model. The output of any ML model can be explained using a game theoretic method termed SHAP. With feature importance, one may calculate the level of contribution of each feature in a dataset to the prediction the model made. As a result, we can get rid of attributes that don’t really affect the model’s predictions and instead focus on improving the more crucial ones.
Figure 10 shows the average impact of selected features on the proposed model, where the horizontal axis conveys the average impact value and the vertical axis depicts the feature name ordered from highest to lowest effect. We can see that the feature “Insulin” has the highest effect, while the derived feature “” has the lowest effect on the prediction of the model.
Fig. 10.
Impact of features to the proposed model
Figure 11 visualizes the 20 most crucial features that significantly impact the prognostic method. On the x-axis, SHAP values are represented, while the y-axis displays the feature list, ordered based on importance from highest to lowest. The color spectrum ranges from red, denoting higher feature values, to blue, representing lower feature values. Notably, high values of Insulin exhibit a strong negative contribution to the prediction, whereas low values have a considerable positive impact. Similarly, the Glucose variable shows a highly positive contribution when its values are high and a moderate negative contribution for low values. Features and demonstrate minimal contribution to the prediction, regardless of whether their values are high or low. This insight into feature importance provides valuable interpretability to understand how the model makes predictions and aids in identifying the most influential variables for diabetes prediction.
Fig. 11.
Feature Importance with SHAP Value
Discussion
This research paper introduces a novel approach, which aims to enhance the accuracy and interpretability of diabetes prediction models. The findings and insights gained from this study shed light on the potential application of machine learning techniques in the early detection of diabetes, with a specific focus on females.
One of the significant contributions of this research lies in the utilization of a stacked ensemble approach, which combines the strengths of multiple base classifiers to achieve better predictive performance. The ensemble model demonstrated a commendable accuracy of 92.91%, outperforming recent results by 1.91%. This robust accuracy underscores the effectiveness of the stacked ensemble technique in diabetes prediction and holds promise for practical applications in the healthcare sector.
Moreover, the application of SHAP (SHapley Additive exPlanations) values for model interpretation provided valuable insights into the factors influencing the predictions. However, the paper acknowledges a limitation in not deeply delving into the interpretability of the ensemble model and the specific impact of each feature on predictions. Future research could further explore this aspect to enhance the transparency and trustworthiness of the model’s decision-making process.
The research also acknowledges the use of a structured dataset, which might limit the model’s ability to handle unstructured data. As unstructured data sources become increasingly important in healthcare, further investigations on incorporating diverse data types would be beneficial to improve the model’s performance and versatility.
Additionally, while the proposed approach focuses on diabetes prediction in females, it would be insightful to expand the study to include male patients as well. This expansion would offer a more comprehensive understanding of the model’s generalizability and applicability across different patient populations.
The paper contributes to the growing body of research in diabetes prediction using machine learning and highlights the potential for early diagnosis, intervention, and personalized healthcare strategies. By addressing the mentioned limitations and building upon the proposed model’s strengths, future studies can continue to advance the field of diabetes prediction and contribute to improved patient outcomes and healthcare decision-making. Ultimately, the insights gained from this research have the potential to pave the way for more efficient and accurate diabetes prediction models, benefiting both clinicians and patients in the fight against this prevalent chronic disease.
Conclusions and future work
The primary objective of this research is to utilize various machine learning methods to predict the potential occurrence of diabetes at an early stage, specifically focusing on females. Detecting diabetes early allows for timely lifestyle adjustments, which can help prevent complications and slow down the progression of the condition, reducing the risk of severe outcomes such as heart and kidney damage. The paper presented multiple techniques for training diverse models, and the proposed classifier achieved an accuracy of 92.91%, outperforming recent results by 1.91%. To explain the ML models, Shapley additive explanation (SHAP) was employed, as it offers mathematical assurances for the accuracy and consistency of explanations. This work acknowledges two main limitations. Firstly, the study relied on a structured dataset for the research, potentially overlooking the benefits of incorporating unstructured data. This limitation might restrict the model’s ability to capture more complex patterns and insights from diverse data sources. Secondly, while SHAP values were utilized for model interpretation, the paper did not extensively delve into the interpretability of the ensemble model and the individual feature contributions to predictions. A more in-depth exploration of these aspects would have enhanced the understanding of the model’s decision-making process and provided valuable insights into the factors driving diabetes predictions. Despite these limitations, the research provides valuable contributions to the prediction of diabetes using a stacked ensemble machine learning approach.
Acknowledgements
We would like to thank the Bangladesh University of Business and Technology, and the Queensland University of Technology for providing the necessary facilities.
Appendix: Supplementary data
Algorithm 2.

Extended Feature Engineering (Dataset).
Author Contributions
Conceptualization and methodology, K.O. and M.H.R.; software, K.O., M.H.R. and M.M.I.; validation, K.O., M.H.R and M.R.I.; formal analysis, M.W.; investigation, K.O. and M.R.I; resources, M.H.R, and M.M.I; writing-original draft preparation, K.O. and M.H.R.; writing-review and editing, M.W. and A.H.W.; visualization, K.O. and A.H.W; supervision, M.W. and A.H.W. All authors have read and agreed to the published version of the manuscript.
Funding
This work received no external funding.
Data Availability
The data used to support the findings of the study are available at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
Declarations
Conflicts of Interest
We have no conflicts of interest.
Financial Disclosure
No financial interests related to the material of this manuscript have been declared.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mahedi Hasan Rasel and Md. Manzurul Islam contributed equally to this work.
References
- 1.Alam TM, Iqbal MA, Ali Y, Wahab A, Ijaz S, Baig TI, Hussain A, Malik MA, Raza MM, Ibrar S, et al. A model for early prediction of diabetes. Inform Med Unlocked. 2019;16:100204. doi: 10.1016/j.imu.2019.100204. [DOI] [Google Scholar]
- 2.National Diabetes Statistics Report | Diabetes | Centers for Disease Control and Prevention. 2022. https://www.cdc.gov/diabetes/data/statistics-report/index.html. Accessed 25 Jan 2023
- 3.Hosseini Sarkhosh SM, Esteghamati A, Hemmatabadi M, Daraei M. Predicting diabetic nephropathy in type 2 diabetic patients using machine learning algorithms. J Diabetes Metab Disord. 2022;21(2):1433–1441. doi: 10.1007/s40200-022-01076-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yang MH, Hall SA, Piccolo RS, Maserejian NN, McKinlay JB. Do behavioral risk factors for prediabetes and insulin resistance differ across the socioeconomic gradient? results from a community-based epidemiologic survey. International journal of endocrinology 2015. 2015 [DOI] [PMC free article] [PubMed]
- 5.Hemanth S, Alagarsamy S. Hybrid adaptive deep learning classifier for early detection of diabetic retinopathy using optimal feature extraction and classification. J Diabetes Metab Disord. 2023:1–15 [DOI] [PMC free article] [PubMed]
- 6.Nabovati E, Rangraz Jeddi F, Tabatabaeizadeh SM, Hamidi R, Sharif R. Design, development, and usability evaluation of a smartphone-based application for nutrition management in patients with type ii diabetes. J Diabetes Metab Disord. 2022:1–9 [DOI] [PMC free article] [PubMed]
- 7.Bukhari MM, Alkhamees BF, Hussain S, Gumaei A, Assiri A, Ullah SS. An improved artificial neural network model for effective diabetes prediction. Complexity. 2021;2021:1–10. doi: 10.1155/2021/5525271. [DOI] [Google Scholar]
- 8.Khodabakhsh P, Asadnia A, Moghaddam AS, Khademi M, Shakiba M, Maher A, Salehian E. Prediction of in-hospital mortality rate in covid-19 patients with diabetes mellitus using machine learning methods. J Diabetes Metab Disord. 2023:1–14 [DOI] [PMC free article] [PubMed]
- 9.Gupta H, Varshney H, Sharma TK, Pachauri N, Verma OP. Comparative performance analysis of quantum machine learning with deep learning for diabetes prediction. Complex Intell Syst. 2022;8(4):3073–3087. doi: 10.1007/s40747-021-00398-7. [DOI] [Google Scholar]
- 10.Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst. 2020;8:1–14. doi: 10.1007/s13755-019-0095-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ramesh J, Aburukba R, Sagahyroon A. A remote healthcare monitoring framework for diabetes prediction using machine learning. Healthc Technol Lett. 2021;8(3):45–57. doi: 10.1049/htl2.12010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mujumdar A, Vaidehi V. Diabetes prediction using machine learning algorithms. Procedia Comput Sci. 2019;165:292–299. doi: 10.1016/j.procs.2020.01.047. [DOI] [Google Scholar]
- 13.Swapna G, Vinayakumar R, Soman K. Diabetes detection using deep learning algorithms. ICT Express. 2018;4(4):243–246. doi: 10.1016/j.icte.2018.10.005. [DOI] [Google Scholar]
- 14.Mohammadi G, Pezeshki F, Vatanchi YM, Moghbeli F. Application of technology in educating nursing students during covid-19: A systematic review. Front Health Inform. 2021;10(1):64. doi: 10.30699/fhi.v10i1.273. [DOI] [Google Scholar]
- 15.Latchoumi T, Dayanika J, Archana G. A comparative study of machine learning algorithms using quick-witted diabetic prevention. Ann Romanian Soc Cell Biol. 2021:4249–59
- 16.Krishnamoorthi R, Joshi S, Almarzouki HZ, Shukla PK, Rizwan A, Kalpana C, Tiwari B, et al. A novel diabetes healthcare disease prediction framework using machine learning techniques. J Healthc Eng. 2022:2022 [DOI] [PMC free article] [PubMed] [Retracted]
- 17.Abdulhadi, N., Al-Mousa, A.: Diabetes detection using machine learning classification methods. In: 2021 International conference on information technology (ICIT). IEEE; 2021. pp. 350–354.
- 18.Nadeem MW, Goh HG, Ponnusamy V, Andonovic I, Khan MA, Hussain M. A fusion-based machine learning approach for the prediction of the onset of diabetes. In: Healthcare, MDPI; 2021. vol. 9, p. 1393. [DOI] [PMC free article] [PubMed]
- 19.Hasan MK, Alam MA, Das D, Hossain E, Hasan M. Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access. 2020;8:76516–76531. doi: 10.1109/ACCESS.2020.2989857. [DOI] [Google Scholar]
- 20.Naz H, Ahuja S. Deep learning approach for diabetes prediction using pima indian dataset. J Diabetes Metab Disord. 2020;19:391–403. doi: 10.1007/s40200-020-00520-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Juneja A, Juneja S, Kaur S, Kumar V. Predicting diabetes mellitus with machine learning techniques using multi-criteria decision making. Int J Inf Retr Res (IJIRR). 2021;11(2):38–52. [Google Scholar]
- 22.Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting diabetes mellitus with machine learning techniques. Front Genet. 2018;9:515. doi: 10.3389/fgene.2018.00515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Moradifar P, Amiri MM. Prediction of hypercholesterolemia using machine learning techniques. J Diabetes Metab Disord. 2022:1–11 [DOI] [PMC free article] [PubMed]
- 24.Srivastava S, Sharma L, Sharma V, Kumar A, Darbari H. Prediction of diabetes using artificial neural network approach. In: Engineering vibration, communication and information processing: ICoEVCI 2018, Springer: India; 2019. pp. 679–687.
- 25.Ahmed U, Issa GF, Khan MA, Aftab S, Khan MF, Said RA, Ghazal TM, Ahmad M. Prediction of diabetes empowered with fused machine learning. IEEE Access. 2022;10:8529–8538. doi: 10.1109/ACCESS.2022.3142097. [DOI] [Google Scholar]
- 26.Rehman A, Athar A, Khan MA, Abbas S, Fatima A, Saeed A, et al. Modelling, simulation, and optimization of diabetes type ii prediction using deep extreme learning machine. J Ambient Intell Smart Environ. 2020;12(2):125–138. doi: 10.3233/AIS-200554. [DOI] [Google Scholar]
- 27.Pima Indians Diabetes Database — kaggle.com. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database. Accessed 22 Nov 2022
- 28.Data MC, Komorowski M, Marshall DC, Salciccioli JD, Crutain Y. Exploratory data analysis. Secondary Analysis of Electronic Health Records, 2016:185–203
- 29.Ahmad GN, Fatima H, Ullah S, Saidi AS, et al. Efficient medical diagnosis of human heart diseases using machine learning techniques with and without gridsearchcv. IEEE Access. 2022;10:80151–80173. doi: 10.1109/ACCESS.2022.3165792. [DOI] [Google Scholar]
- 30.Ahamed BS, Arya S, et al. Lgbm classifier based technique for predicting type-2 diabetes. Eur J Intern Med. 2021;8(3):454–467. [Google Scholar]
- 31.Wang C, Deng C, Wang S. Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost. Pattern Recogn Lett. 2020;136:190–197. doi: 10.1016/j.patrec.2020.05.035. [DOI] [Google Scholar]
- 32.Dhaliwal SS, Nahid A-A, Abbas R. Effective intrusion detection system using xgboost. Information. 2018;9(7):149. doi: 10.3390/info9070149. [DOI] [Google Scholar]
- 33.Duan T, Anand A, Ding DY, Thai KK, Basu S, Ng A, Schuler A. Ngboost: natural gradient boosting for probabilistic prediction. In: International conference on machine learning. PMLR; 2020. pp. 2690–2700.
- 34.Soui M, Mansouri N, Alhamad R, Kessentini M, Ghedira K. Nsga-ii as feature selection technique and adaboost classifier for covid-19 prediction using patient’s symptoms. Nonlinear Dyn. 2021;106(2):1453–75. doi: 10.1007/s11071-021-06504-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Manimegalai T, Manju J, Rubiston MM, Vidhyashree B, Prabu RT. Prediction of optimized stock market trends using hybrid approach based on knn and bagging classifier (knnb). In: 2022 IEEE 11th International Conference on Communication Systems and Network Technologies (CSNT). IEEE; 2022. pp. 257–262.
- 36.Wang D, Thunéll S, Lindberg U, Jiang L, Trygg J, Tysklind M. Towards better process management in wastewater treatment plants: Process analytics based on shap values for tree-based machine learning methods. J Environ Manage. 2022;301: 113941. [DOI] [PubMed]
- 37.Sagar SP, Oliullah K, Sohan K, Patwary MFK. Prcmla: product review classification using machine learning algorithms. In: Proceedings of international conference on trends in computational and cognitive engineering: proceedings of TCCE 2020. Springer; 2021. pp. 65–75.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data used to support the findings of the study are available at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.








