Skip to main content
BMC Medical Informatics and Decision Making logoLink to BMC Medical Informatics and Decision Making
. 2025 Dec 31;26:36. doi: 10.1186/s12911-025-03333-9

Explainable extratreeclassifier model for early detection of type 2 diabetes: evidence from the PERSIAN Dena Cohort

Mustafa Ghaderzadeh 1, Zahra Rafie 2, Cirruse Salehnasab 3,
PMCID: PMC12866003  PMID: 41469992

Abstract

Background

Type 2 diabetes mellitus (T2DM) develops gradually and often remains undiagnosed until complications emerge. Early detection through transparent machine-learning models can improve prevention and targeted screening. This study developed and evaluated an interpretable Extra Trees Classifier (ETC) for early detection of T2DM within the PERSIAN Dena Cohort, emphasizing probability calibration, fairness, and clinical interpretability.

Methods

Data from 3,203 adults aged 35–70 years were analyzed. Seventy-nine demographic, lifestyle, anthropometric, comorbidity, and biochemical variables were considered; fifteen informative predictors were retained after preprocessing and feature elimination. The ETC was optimized by randomized hyperparameter search and evaluated through ten-fold cross-validation with an additional 80 / 20 internal–external split. Isotonic regression was used to calibrate probability estimates. Model transparency and feature influence were examined using SHapley Additive exPlanations (SHAP) and Morris sensitivity analysis.

Results

Cross-validated performance showed mean accuracy 0.69 ± 0.03 and AUC 0.69 ± 0.04, indicating moderate discrimination and stable internal consistency. On the 20% hold-out set, the uncalibrated model achieved AUC 0.67 and F1 0.66. After isotonic calibration, AUC declined to 0.64 and the Brier score increased to 0.48 (slope 0.09; intercept − 1.50), revealing under-confident probability estimates. Excluding fasting blood sugar (FBS) improved performance (AUC 0.77), whereas categorizing FBS into deciles reduced AUC to 0.57. Across sex and age subgroups, AUCs ranged 0.63–0.70 without systematic bias. SHAP and Morris analyses identified FBS, fatty-liver status, age, kidney-stone history, and triglycerides as dominant predictors, with lifestyle factors such as beverage and vegetable intake exerting secondary, modifiable influence.

Conclusions

Although overall predictive power was limited, the calibrated ETC provided transparent insight into feature interactions, calibration behavior, and data limitations. The framework highlights that interpretability and fairness are as essential as accuracy for trustworthy clinical AI. Future research should expand predictor diversity, address class imbalance, and validate across other PERSIAN cohorts to develop a more generalizable, interpretable model for early T2DM risk prediction.

Graphical abstract

graphic file with name 12911_2025_3333_Figa_HTML.jpg

Keywords: Type 2 diabetes mellitus, Machine learning, Extra trees classifier, Explainable AI, Probability calibration, PERSIAN Dena Cohort

Introduction

Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder characterized by impaired insulin secretion and insulin resistance, leading to persistent hyperglycemia and a wide range of complications—including cardiovascular disease, nephropathy, neuropathy, and retinopathy—that severely impact quality of life. It remains one of the most pressing public-health challenges worldwide. According to the 11th edition of the IDF Diabetes Atlas, an estimated 589 million adults were living with diabetes in 2024, and this number is projected to exceed 850 million by 2050; more than 40% of all cases remain undiagnosed [1, 2]. Complementary evidence from the NCD Risk Factor Collaboration estimated 828 million adults with diabetes globally in 2022, nearly 445 million of whom were untreated, underscoring continuing gaps in early detection and care [3]. Both the World Health Organization Global Report on Diabetes and the American Diabetes Association Standards of Care emphasize the urgent need for prevention and earlier diagnosis [4, 5]. Beyond its health toll, diabetes imposes a formidable economic burden, with annual healthcare expenditures now exceeding one trillion US dollars and projected to double by 2030 [6].

Traditional statistical models—such as logistic regression and Cox proportional-hazards models—have long served as the backbone of diabetes risk prediction due to their interpretability and analytical robustness [7, 8]. However, these approaches rely on assumptions of linearity and independence that rarely hold in complex, high-dimensional biomedical data. Machine-learning (ML) techniques offer more flexible solutions capable of modeling non-linear interactions and managing heterogeneous data structures. Ensemble tree-based algorithms, including Random Forest [9], Gradient Boosting, XGBoost [10], and the Extra Trees Classifier (ETC), have consistently achieved strong predictive performance for T2DM in population-based studies [11, 12]. Among them, XGBoost has become widely used in healthcare prediction because of its scalability and efficiency [10]. Systematic reviews confirm that tree-based methods frequently outperform—or complement—deep-learning and hybrid frameworks depending on dataset complexity [13, 14].

Despite these successes, most ML models remain “black boxes,” limiting their interpretability and clinical acceptance. Healthcare professionals must understand the rationale behind model predictions before applying them to patient care [15, 16]. Explainable AI (XAI) methods aim to bridge this gap by quantifying the contribution of each predictor to the model output. SHapley Additive exPlanations (SHAP) have gained particular prominence because they provide both global and local interpretability while maintaining theoretical consistency [17, 18]. Complementary tools such as Local Interpretable Model-agnostic Explanations (LIME) and Partial Dependence Plots (PDP) can further visualize relationships but lack SHAP’s additive fairness. Global sensitivity analysis, such as the Morris method, augments SHAP by assessing the robustness of feature influence across the entire input space [19].

Recent studies integrating ensemble classifiers with XAI frameworks have demonstrated that interpretable models not only sustain competitive accuracy but also expose meaningful clinical predictors—including fasting blood sugar (FBS), body-mass index (BMI), triglycerides (TG), and blood pressure—thereby enhancing user trust [20, 21]. While some hybrid or multi-stage models report extremely high accuracy [22, 23], these systems often sacrifice transparency and reproducibility. Human-centered interpretability therefore remains indispensable for real-world deployment [15, 16].

Against this background, the present study sought to develop, calibrate, and interpret an Extra Trees Classifier for early detection of T2DM using data from the PERSIAN Dena Cohort [24]. The ETC was selected for its robustness to noisy and correlated predictors, computational efficiency, and ease of interpretation relative to deeper or more opaque architectures [11, 12]. SHAP analysis was used to explain both individual and population-level predictions, while the Morris sensitivity method assessed the stability of feature influence. In addition to interpretability, rigorous internal–external validation and isotonic calibration were implemented to ensure the model’s reliability and fairness.

This work makes three principal contributions:

  1. Construction and validation of a robust ETC for T2DM prediction using comprehensive demographic, anthropometric, lifestyle, comorbidity, and biochemical variables.

  2. Integration of SHAP and Morris sensitivity analyses to provide complementary global and local interpretability, supported by isotonic probability calibration.

  3. Demonstration of how interpretable outputs and calibration insights can support medical decision-making by linking predictions to both biological and modifiable lifestyle risk factors.

By addressing the dual challenges of accuracy and explainability, this study advances the practical readiness of ML-based tools for diabetes prediction and contributes to their responsible integration into clinical and public-health practice.

Methods

Study design and data source

This retrospective study analyzed a structured dataset containing seventy-nine independent predictors and one binary outcome variable, Has D.M II, denoting the presence or absence of type 2 diabetes mellitus (T2DM). All records were anonymized before analysis. Predictors represented five domains—demographic, lifestyle, anthropometric, comorbidity, and biochemical. The dataset was derived from the PERSIAN Dena Cohort, one of the regional components of the national PERSIAN project designed to investigate non-communicable disease determinants across Iran [24].

Study population

Data from 3,203 adults aged 35–70 years were included. T2DM was defined as fasting-blood-sugar (FBS) ≥ 126 mg/dL or a documented physician diagnosis accompanied by antidiabetic therapy. Among the participants, 402 (12.55%) had diabetes, while 2,801 were non-diabetic. Predictors comprised demographic variables (age, sex, education, employment), lifestyle indicators (dietary pattern, physical activity, smoking, alcohol use, sleep duration), anthropometric indices (body-mass index [BMI], waist and hip circumferences), comorbidities (hypertension, cardiovascular disease, fatty liver, thyroid disorder), and biochemical measurements (triglycerides, HDL-C, LDL-C, liver enzymes, blood pressure).

Data preprocessing

Missing values for continuous variables were imputed using the median, and categorical features were imputed using the mode. Categorical variables were encoded numerically via one-hot encoding. Because the Extra Trees Classifier (ETC) is inherently scale-invariant, no normalization or standardization was required for continuous predictors. Class imbalance was addressed using the Synthetic Minority Over-Sampling Technique (SMOTE) to balance positive and negative classes. These preprocessing steps align with best practices in previous ML studies for diabetes prediction [11, 23, 25].

Feature selection

To reduce dimensionality and enhance interpretability, recursive feature elimination with cross-validation (RFECV) was applied. RFECV iteratively removed less informative variables while maintaining predictive performance, resulting in a final set of 15 features. Feature selection has been shown to improve model generalization and computational efficiency in high-dimensional biomedical datasets [13, 19, 26].

Model selection and rationale

The Extra Trees Classifier (ETC) was selected as the predictive model for several reasons:

  • Demonstrated high predictive accuracy in structured healthcare data [11, 12];

  • Robustness to multicollinearity and noisy variables;

  • Intrinsic interpretability via ensemble feature-importance measures; and.

  • Efficiency in high-dimensional, heterogeneous datasets.

Unlike studies comparing multiple algorithms, this work intentionally focused on a single model to ensure methodological transparency and enable detailed explainability and calibration analysis within a consistent framework.

Hyperparameter optimization

Hyperparameters were tuned using randomized search with stratified ten-fold cross-validation, iterating 100 combinations to maximize mean AUC. Parameters explored included:

  • n_estimators: 100–1,000.

  • max_depth: 5–50.

  • min_samples_split: 2–10.

  • min_samples_leaf: 1–5.

  • max_features: {sqrt, log2, or 0.1–1.0 fraction}

  • bootstrap: {True, False}

Randomization ensured efficient exploration of the parameter space and mitigated overfitting [14].

Model evaluation

Initial performance was assessed using stratified ten-fold cross-validation, reporting mean ± SD for accuracy, AUC, recall, precision, F1-score, Cohen’s Kappa, and Matthews Correlation Coefficient (MCC). To obtain a more realistic estimate of generalizability, an 80 / 20 internal–external hold-out validation was then performed [6]. Metrics on the hold-out set included AUC, F1, specificity, Brier score, calibration slope and intercept, precision, recall, and MCC.

Probability calibration

Because uncalibrated tree-based ensembles may yield over- or under-confident probabilities, isotonic regression was applied for post-hoc calibration [6, 21]. The calibration model was fitted on out-of-fold predictions from cross-validation and evaluated on the hold-out set. Calibration quality was summarized by slope (ideal = 1), intercept (ideal = 0), and Brier score (lower = better). Reliability plots compared predicted and observed event frequencies, enabling visual assessment of probability accuracy.

Sensitivity and subgroup analyses

Model robustness was examined under two sensitivity scenarios:

  1. FBS exclusion: to assess dependence on a diagnostic variable.

  2. FBS categorization: substitution of the continuous FBS with decile-based categories.

Fairness was further evaluated by computing AUC across sex (male/female) and age (35–49, 50–59, 60–70 years) subgroups, thereby exploring potential bias in discrimination performance [15, 16].

Explainable AI approach

Model interpretability was examined using three complementary approaches:

  1. SHapley Additive exPlanations (SHAP) provided both global and local interpretability, assigning each predictor an additive contribution to the model output [17, 18, 20].

  2. Morris global sensitivity analysis quantified the stability and magnitude of feature influence under controlled perturbations [19].

  3. Baseline feature importance from the ETC’s built-in Gini-impurity measure was used as a reference for comparison.

Together, these methods revealed how biochemical, anthropometric, and lifestyle variables contributed to predicted diabetes risk, enhancing transparency and clinical interpretability [13, 15, 20].

Software configuration

Analyses were performed in Python 3.11. Data preprocessing utilized pandas v2.2 and NumPy v1.26; modeling was conducted with scikit-learn v1.4. Visualizations were produced using matplotlib v3.8 and seaborn v0.13. SHAP analyses used shap v0.44, and sensitivity analysis was implemented via SALib.

Ethical considerations

The study conformed to the principles of the Declaration of Helsinki for research involving human subjects. All data were fully anonymized before analysis, and no personally identifiable information was available to investigators. Because only secondary anonymized data were used, formal ethical review was waived under local regulations. Confidentiality and data-protection standards were maintained throughout.

Results

Dataset characteristics

The analysis included 3,203 adults aged 35–70 years from the PERSIAN Dena Cohort; among them, 402 (12.55%) met the criteria for type 2 diabetes mellitus (T2DM) and 2,801 (87.45%) were non-diabetic.

The curated dataset comprised 79 candidate predictors encompassing demographic, lifestyle, anthropometric, comorbidity, and biochemical domains, with Has D.M II as the binary outcome.

A full description of the predictors, their definitions, and modeling roles is presented in Table 1.

Table 1.

Features and specifications of the dataset

Row Feature Description Type Role
1 GenderID Male or Female Independent Input
2 Age Numeric Independent Input
3 LastEduID Highest Educational Degree Independent Input
4 MET_Final Metabolic condition of individuals Independent Input
5 WaistCircumference Waist circumference (numeric) Independent Input
6 HipCircumference Hip circumference (numeric) Independent Input
7 WristCircumference Wrist circumference (numeric) Independent Input
8 BMI Body mass index based on height and weight Independent Input
9 HasJob_new Recent job change Independent Input
10 Employmentstatus Employment status Independent Input
11 SleepDuration24h1 Sleep duration in 24 h Independent Input
12 SleepDurationMidDay1 Midday nap duration Independent Input
13 TV1 Hours of TV watching Independent Input
14 AtDeskWork1 Desk job hours Independent Input
15 Computer1 Computer usage hours Independent Input
16 Eating1 Amount of food consumed Independent Input
17 Cooking1 Method of food preparation Categorical Input
18 Driving1 Time spent driving Independent Input
19 Walking1 Time spent walking Independent Input
20 AerobicExercise1 Time spent on aerobic exercise Independent Input
21 DrivingHeavyVehicle1 Heavy vehicle driver Independent Input
22 LightAgricultural1 Low-maintenance agricultural work Independent Input
23 HeavyLaborAgricultJobs1 High-maintenance agricultural work Independent Input
24 DPB Diastolic blood pressure Independent Input
25 SPB Systolic blood pressure Independent Input
26 RBC Red blood cell count Independent Input
27 HGB Hemoglobin level Independent Input
28 FBS Fasting blood sugar Independent Input
29 TG Triglycerides level Independent Input
30 CHOL Cholesterol level Independent Input
31 SGOT Aspartate aminotransferase (AST) enzyme Independent Input
32 SGPT Alanine aminotransferase (ALT) enzyme Independent Input
33 ALP Alkaline phosphatase enzyme Independent Input
34 HDL.C High-density lipoprotein cholesterol (HDL-C) Independent Input
35 GGT Blood glucose after glucose intake and specific time Independent Input
36 LDL_Calc Low-density lipoprotein cholesterol (LDL-C) Independent Input
37 WSI_total Number of imaging sessions Independent Input
38 UseDrugs Use of narcotics or opiates Independent Input
39 UseAlcohol Alcohol consumption Independent Input
40 Pizza Pizza consumption Independent Input
41 TEACOFFEE Tea and coffee consumption Independent Input
42 BAVARAGE Energy drinks consumption Independent Input
43 Pickle Pickle consumption Independent Input
44 Oil Oil consumption Independent Input
45 SimpleSugar Simple sugar consumption Independent Input
46 Vegetables Vegetable consumption Independent Input
47 Nuts Nut consumption Independent Input
48 Juice Juice consumption Independent Input
49 Fruits Fruit consumption Independent Input
50 Legumes Legume consumption Independent Input
51 Whitemeat White meat consumption Independent Input
52 Redmeat Red meat consumption Independent Input
53 Totalmeat Total meat consumption Independent Input
54 dairyproduct Dairy product consumption Independent Input
55 Wholegrain Whole grain consumption Independent Input
56 Nutrient FFQ_protein Daily protein intake Independent Input
57 Total_lipid_fat Liquid fat consumption Independent Input
58 GrilledFoodIntID Grilled food consumption Independent Input
59 FriedFoodIntID Fried food consumption Independent Input
60 PotatoFryTypeID Fried potato consumption Independent Input
61 VegFryTypeID Fried vegetable consumption Independent Input
62 UsedOilTypeID Type of oil used Independent Input
63 ReUseMold Reuse of containers Independent Input
64 UsedScratchedTeflon Use of scratched Teflon pans Independent Input
65 FoodSaltUsedID Salt used in food Independent Input
66 OnionFryTypeID Fried onion consumption Independent Input
67 CookingWareID1 Type of cookware used Independent Input
68 VegKeepDried Consumption of dried vegetables Independent Input
69 VegKeepRefrigerator Consumption of refrigerated vegetables Independent Input
70 VegKeepFreezer Consumption of frozen vegetables Independent Input
71 BreadContainerTypeID1 Bread-containing products consumption Independent Input
72 LemonContainerTypeID1 Lemon-containing products consumption Independent Input
73 CVD_History History of stroke Independent Input
74 HasDepression Presence of depression Independent Input
75 urolithiasis Presence of urolithiasis Independent Input
76 HasThyroidDisease Presence of thyroid disease Independent Input
77 HasFattyLiver Presence of fatty liver Independent Input
78 HasCardiacDisease Presence of heart disease Independent Input
79 HasHypertension Presence of hypertension Independent Input
80 Has D.M II Presence of Type 2 diabetes Dependent Target

Missing values were imputed by the median (continuous variables) or mode (categorical), and categorical predictors were one-hot encoded.

Because the Extra Trees Classifier (ETC) is scale-invariant, no normalization or standardization was applied.

Class imbalance (≈ 12% positive) was addressed by the Synthetic Minority Over-Sampling Technique (SMOTE).

Feature selection with recursive feature elimination and cross-validation (RFECV) reduced the 79 variables to 15 informative predictors, which formed the final modeling set.

Overall discrimination and calibration

Ten-fold cross-validation produced mean accuracy 0.69 ± 0.03 and AUC 0.69 ± 0.04, reflecting moderate discrimination and stable internal consistency.

On the independent 20% hold-out set, the uncalibrated model achieved AUC 0.67 and F1 0.66 (Fig. 1, ROC curve; Fig. 2, precision–recall curve).

Fig. 1.

Fig. 1

ROC curves for uncalibrated and isotonic-calibrated ETC models on the hold-out set

Fig. 2.

Fig. 2

Precision–recall curve for the calibrated ETC on the hold-out set

After isotonic calibration, discrimination declined slightly to AUC 0.64, while reliability worsened (Brier 0.48; slope 0.09; intercept − 1.50), indicating under-confident probability estimates (Fig. 3).

Fig. 3.

Fig. 3

Reliability (calibration) plot comparing predicted versus observed event probabilities

The confusion matrix (Fig. 4) revealed a strong bias toward positive predictions and negligible specificity, consistent with flattened calibrated probabilities.

Fig. 4.

Fig. 4

Confusion matrix for calibrated ETC predictions (threshold = 0.50)

Aggregate performance before and after calibration is summarized in Fig. 5, which illustrates the trade-off between discrimination and probability reliability.

Fig. 5.

Fig. 5

Comparison of AUC and Brier scores before and after calibration

Sensitivity to fasting-blood-sugar (FBS) representation

To test whether model performance depended on the diagnostic feature FBS, two sensitivity experiments were performed.

When FBS was excluded, the calibrated model’s AUC improved to 0.77 (F1 = 0.66); however, replacing FBS with decile-based categories caused AUC to drop to 0.57 (F1 = 0.66) (Fig. 6).

Fig. 6.

Fig. 6

FBS sensitivity analysis comparing models with full, excluded, and decile-binned FBS inputs

These findings indicate that the continuous form of FBS contributes major predictive information but can also inflate discrimination when used jointly with other correlated variables.

Subgroup (fairness) performance

Fairness analysis assessed discrimination by sex and age on calibrated hold-out predictions.

AUCs were 0.70 for males and 0.63 for females.

By age group, AUCs were 0.66 (35–49 y), 0.69 (50–59 y), and 0.66 (60–70 y) (Figs. 7 and 8).

Fig. 7.

Fig. 7

Subgroup performance by sex (AUC values on calibrated hold-out)

Fig. 8.

Fig. 8

Subgroup performance by age group (AUC values on calibrated hold-out)

Although absolute performance remained modest, these results show no systematic bias across demographic subgroups.

Global and local interpretability

Global SHAP analysis

Global SHAP analysis (Fig. 9) identified fasting blood sugar (FBS), HasFattyLiver, Age, HasKidneyStone, and Triglycerides (TG) as the five most influential predictors.

Fig. 9.

Fig. 9

Global SHAP summary plot showing feature importance, direction, and magnitude of influence on predicted T2DM risk

Lifestyle and dietary factors—including Vegetables, TEACOFFEE, BAVARAGE, and Juice consumption—showed secondary, modifiable effects.

Higher FBS, fatty-liver status, and kidney-stone history increased predicted diabetes probability, whereas greater vegetable intake modestly lowered it.

The wide distribution of SHAP values for the same feature indicated heterogeneous effects across individuals.

Morris global sensitivity analysis

To examine stability of feature influence, the Morris method was applied (Fig. 10).

Fig. 10.

Fig. 10

Morris sensitivity indices quantifying the magnitude and stability of feature influence for the calibrated ETC model

FBS had the largest sensitivity (µ* ≈ 0.42), followed by HasFattyLiver (≈ 0.17), HasKidneyStone (≈ 0.16), Age (≈ 0.13), and TG (≈ 0.09).

Local SHAP interpretations

To illustrate case-level reasoning, local SHAP explanations were generated for three representative individuals (Fig. 11).

Fig. 11.

Fig. 11

Local SHAP interpretations for three representative cases showing how specific features (red = positive, blue = negative influence) affect individual predictions relative to the baseline output

In high-risk cases, elevated FBS and fatty-liver status were dominant positive contributors, while Age, TG, and lifestyle variables modulated probabilities in opposite directions.

For lower-risk profiles, protective effects from normal FBS and healthy dietary patterns were evident.

These localized explanations demonstrate how the model combines biochemical and behavioral features to derive patient-specific predictions, enhancing clinical interpretability.

Comparison with the original analysis

The present findings differ markedly from those of the preliminary version, which reported AUC ≈ 0.99 and F1 ≈ 0.97 under internal-only cross-validation.

Those earlier results evaluated the model on the same data used for training and optimization, included the diagnostic variable FBS without calibration, and therefore over-estimated performance.

The current workflow—featuring internal–external validation, explicit class-balance adjustment, and isotonic calibration—provides a realistic estimate of generalizability.

The lower AUC values (≈ 0.64–0.69) reflect true model performance when evaluated properly and highlight the importance of robust validation for reproducible AI in clinical research.

Discussion

The calibrated Extra Trees Classifier (ETC) developed in this study provided a transparent yet moderately accurate framework for early detection of type 2 diabetes mellitus (T2DM) within the PERSIAN Dena Cohort. The model achieved an area under the curve (AUC) of 0.64 on the hold-out dataset after isotonic calibration and demonstrated consistent but modest discrimination across sex and age subgroups. Although these results are lower than those observed in earlier internal-only analyses, they represent a far more realistic estimate of model performance and underscore the importance of rigorous internal–external validation and probability calibration in clinical machine-learning research [6, 21].

Interpretation of model behavior

Tree-based ensemble algorithms such as the ETC are capable of capturing nonlinear and high-order interactions among biochemical, anthropometric, and lifestyle variables [11, 12, 27]. The recalibrated ETC in this work revealed that a relatively small number of predictors drive most of the discriminative power. Fasting-blood-sugar (FBS) unsurprisingly remained the dominant variable, followed by fatty-liver status, age, kidney-stone history, and triglycerides. These findings are consistent with prior population-level studies identifying hepatic and lipid metabolism as central to diabetes risk [13, 20, 25]. Notably, lifestyle and dietary factors—such as vegetable consumption, beverage and tea-coffee intake—showed secondary but interpretable effects, aligning with recent evidence linking diet quality and habitual drink choices to glycemic control [20, 25].

The global SHAP analysis provided directionality and magnitude of feature influence, confirming that higher FBS, hepatic impairment, and dyslipidemia increased predicted risk, while greater vegetable intake modestly reduced it. The Morris global sensitivity analysis reinforced the stability of these findings and quantified the robustness of each feature’s contribution across the input space. Together, these complementary interpretability methods illustrate how an ensemble model can function as an analytic microscope—exposing the hierarchy and interplay of metabolic, behavioral, and demographic determinants of diabetes [13, 14, 19, 20].

Understanding calibration and reduced performance

The decline in discrimination from the earlier version (AUC ≈ 0.99) to the present validated analysis (AUC ≈ 0.64) stems from methodological improvements rather than model degradation. The initial study evaluated performance within the same dataset used for training and relied heavily on FBS, a diagnostic marker. When probability calibration and an independent hold-out set were introduced, the model’s predictive capability was no longer inflated by information leakage. Similar patterns have been reported in other health-prediction studies where rigorous validation reduced overly optimistic internal estimates [6, 26, 27].

Isotonic calibration corrected the model’s probability scaling but exposed under-confidence in mid-range risk estimates—an expected phenomenon in limited or imbalanced datasets. The calibrated ETC thus prioritizes probability reliability over numerical AUC, aligning with the TRIPOD recommendations for risk-prediction modeling [5, 6]. This trade-off underscores that interpretability and calibration, not raw accuracy, determine the clinical utility of ML models [1518, 28].

Clinical and translational implications

The ETC framework demonstrates how explainable ML can reveal clinically meaningful structure even when overall accuracy is modest. Identifying FBS, fatty-liver disease, and triglycerides as core drivers, alongside modifiable behaviors, mirrors the integrated nature of diabetes pathogenesis and prevention. Local SHAP explanations illustrate how individual patient features combine to shape predicted risk, enabling clinicians to visualize why a specific person is classified as high- or low-risk. Such transparency fosters trust and could facilitate integration of ML outputs into preventive counseling or electronic-health-record decision support.

At a population level, the model’s interpretability may help researchers identify which lifestyle factors exert the greatest marginal effects, guiding community-based interventions or public-health messaging. Furthermore, the fairness analysis showing similar AUCs across sex and age groups indicates that ensemble-based explainable approaches can achieve equitable performance when properly validated—a critical consideration for responsible AI in healthcare [15, 16, 29].

Comparison with related work

Several recent studies have achieved very high accuracy in T2DM prediction using deep learning, hybrid, or multi-stage ensemble models [7, 11, 22, 23, 27]. However, these methods often require complex architectures, extensive feature engineering, or proprietary preprocessing pipelines, limiting reproducibility and clinical interpretability. By contrast, the present work intentionally prioritized transparency over maximal accuracy. The inclusion of SHAP and Morris analyses positions this study within the emerging movement toward interpretable and fair medical AI [13, 15, 19, 28].

While the model’s AUC is lower than that of more complex architectures, its interpretive richness offers compensatory value. As recent reviews emphasize, explainability, reproducibility, and calibration are increasingly regarded as essential benchmarks for translating ML models from research into clinical workflows [1517, 29].

Implications for future research

Several directions can enhance predictive performance while maintaining transparency. First, integrating additional longitudinal or genetic predictors from other PERSIAN regional cohorts could improve discrimination and generalizability. Second, semi-supervised or federated-learning frameworks may allow multi-center model training without compromising privacy. Third, calibration methods such as Bayesian binning or temperature scaling could be tested alongside isotonic regression to refine probability reliability. Finally, a prospective validation study is needed to evaluate clinical impact and user acceptance in real-world settings.

Recent studies have emphasized the importance of transparent model development, calibration, and reproducible evaluation frameworks for clinical machine learning. Our revised analytical pipeline follows these recommendations, integrating explainability, fairness, and internal–external validation within a cohesive workflow [3032].

Summary

In summary, this study demonstrates that a calibrated, explainable Extra Trees Classifier can provide clinically interpretable insights into diabetes risk even when its numerical accuracy is moderate. The results emphasize that model transparency, calibration, and fairness are indispensable for trustworthy AI in healthcare. Rather than pursuing maximal predictive metrics, this approach champions reproducibility and scientific integrity—foundations essential for translating machine learning into practical, ethical, and equitable diabetes-prevention strategies.

Limitations

This study has several limitations that should be acknowledged.

First, the analysis was conducted using data from a single regional cohort (PERSIAN Dena), which limits external generalizability. Although an internal–external validation design was implemented, future work should incorporate additional PERSIAN subcohorts or independent national datasets to confirm robustness across populations with different socioeconomic and genetic backgrounds.

Second, while the internal–external split and isotonic calibration provided a realistic estimate of generalizability, no prospective external validation was available. As emphasized in predictive-modeling literature [6, 21], the absence of a fully independent evaluation may constrain conclusions about long-term clinical utility.

Third, the present analysis deliberately focused on a single, interpretable algorithm (Extra Trees Classifier) to ensure methodological transparency and reproducibility. Exploring alternative calibrated ensemble models—such as Random Forests, Gradient Boosting, or LightGBM—using identical validation frameworks may yield incremental improvements while retaining explainability.

Fourth, the dataset exhibited a moderate class imbalance (12.55% T2DM prevalence) and a relatively limited number of highly discriminative biochemical markers. Although SMOTE oversampling was applied, some degree of information dilution or oversmoothing may have affected calibration stability.

Fifth, fasting-blood-sugar (FBS)—a diagnostic biomarker—played a dominant role in model behavior. While sensitivity experiments excluding and categorizing FBS helped quantify its influence, these adjustments highlight the ongoing challenge of building models that predict risk rather than re-identify known cases. Future iterations should aim to include longitudinal and behavioral risk trajectories to capture true pre-diagnostic signals.

Finally, cross-sectional data restrict causal inference. The model identifies associations and predictive patterns but cannot determine temporal or mechanistic relationships between predictors and disease onset. Integrating prospective follow-up data and continuous physiological monitoring could strengthen causal interpretability and clinical translation.

Despite these limitations, this study contributes meaningfully to the methodological literature by illustrating the importance of validation, calibration, and explainability for achieving credible and trustworthy machine-learning predictions in medicine.

Conclusions

This work developed, validated, and interpreted an explainable Extra Trees Classifier for early detection of type 2 diabetes mellitus using data from the PERSIAN Dena Cohort.

Through rigorous internal–external validation and isotonic calibration, the study produced an honest and realistic estimate of model performance (AUC 0.64 on the hold-out set) while demonstrating how ensemble interpretability methods—SHapley Additive exPlanations and Morris sensitivity analysis—can illuminate the relative contributions of biomedical and lifestyle factors.

The results showed that fasting-blood-sugar, fatty-liver status, kidney-stone history, age, and triglycerides were the most influential predictors, supported by modifiable dietary indicators such as vegetable and beverage intake. Although absolute discrimination was modest, the model achieved transparent, reproducible, and fair performance across demographic subgroups, setting a methodological benchmark for interpretable cohort-based prediction.

More broadly, this research emphasizes that scientific validity in clinical AI depends as much on calibration and interpretability as on accuracy. Transparent reporting of realistic performance metrics helps prevent overfitting and promotes reproducibility—core principles of responsible machine learning.

Future research should extend this calibrated explainable-AI framework across multiple PERSIAN regions, incorporate additional behavioral and genetic predictors, and evaluate its integration into clinical and public-health workflows. By combining interpretability, fairness, and validation, explainable ensemble models like the ETC can evolve from experimental tools into practical decision-support systems that advance equitable diabetes prevention and precision public health.

Acknowledgements

The authors express their gratitude to the Dena Cohort Research Center for providing access to data and continuous collaboration.We also thank the PERSIAN Cohort Study and its steering committee for supporting national-level non-communicable-disease research.The authors appreciate the constructive comments of the peer reviewers, which substantially improved the clarity and quality of this work.

Abbreviations

AI

Artificial Intelligence

ALP

Alkaline Phosphatase

ALT

Alanine Aminotransferase

AST

Aspartate Aminotransferase

AUC

Area Under the Receiver Operating Characteristic Curve

BMI

Body Mass Index

CHOL

Total Cholesterol

CI

Confidence Interval

CV

Cross-Validation

DBP

Diastolic Blood Pressure

ETC

Extra Trees Classifier

F1

F1-Score (harmonic mean of precision and recall)

FBS

Fasting Blood Sugar

GGT

Gamma-Glutamyl Transferase

HbA1c

Hemoglobin A1c

HDL-C

High-Density Lipoprotein Cholesterol

IRB

Institutional Review Board

LDL-C

Low-Density Lipoprotein Cholesterol

LIME

Local Interpretable Model-Agnostic Explanations

LR

Logistic Regression

MCC

Matthews Correlation Coefficient

ML

Machine Learning

NHANES

National Health and Nutrition Examination Survey

PDP

Partial Dependence Plot

RFECV

Recursive Feature Elimination with Cross-Validation

ROC

Receiver Operating Characteristic

SHAP

SHapley Additive exPlanations

SMOTE

Synthetic Minority Over-Sampling Technique

SPB

Systolic Blood Pressure

T2DM

Type 2 Diabetes Mellitus

TG

Triglycerides

TRIPOD

Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis

WHO

World Health Organization

XAI

Explainable Artificial Intelligence

Author contributions

Zahra Rafie: Visualization, Investigation, Resources, Writing – Original Draft. Mustafa Ghaderzadeh: Conceptualization, Visualization, Validation, Methodology, Software, Formal Analysis, Investigation, Writing – Review & Editing. Cirruse Salehnasab: Conceptualization, Project Administration, Supervision, Funding Acquisition, Data Curation, Formal Analysis, Validation, Writing – Review & Editing. All authors reviewed and approved the final manuscript prior to submission.

Funding

No external or institutional funding was received for this research. All analyses were conducted as part of the authors’ academic and institutional responsibilities.

Data availability

The datasets analyzed in this study are part of the PERSIAN Dena Cohort, maintained by the Yasuj University of Medical Sciences, Yasuj, Iran.Because the dataset contains sensitive participant information, raw data are not publicly available.However, qualified researchers may request access to the anonymized dataset by contacting: Dr. Cirruse SalehnasabYasuj University of Medical Sciences, Yasuj, Iran. Email: cirruse.salehnasab@gmail.comAll data-access requests will be reviewed by the Dena Cohort Data-Access Committee and must comply with institutional and national ethical regulations.The Python code used for data preprocessing, model development, and explainability analysis is openly available at: https://github.com/salehnasab/Explainable-Extra-Trees-Machine-Learning-Model-for-Early-Detection-of-Type-2-Diabetes/blob/main/DMPrediction.ipynb.

Declarations

Ethics approval and consent to participate

This study was conducted in accordance with the ethical principles of the Declaration of Helsinki. Ethical approval was obtained from the Institutional Review Board (IRB) of Yasuj University of Medical Sciences, Yasuj, Iran (approval code: IR.YUMS.REC.1402.152). The analysis was retrospective and based on secondary use of anonymized data from the PERSIAN Dena Cohort, which enrolled adults aged 35–70 years with complete demographic, anthropometric, clinical, and lifestyle information related to type 2 diabetes mellitus. All participants in the original PERSIAN Cohort provided written informed consent for use of their data in future health research. No direct participant contact occurred in this study, and all records were de-identified prior to analysis to ensure confidentiality and compliance with international ethical standards.

Consent for publication

Not applicable. The study used anonymized secondary data with no identifiable personal information.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sun H, Saeedi P, Karuranga S, et al. IDF diabetes atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res Clin Pract. 2022;183:109119. 10.1016/j.diabres.2021.109119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bommer C, Heesemann E, Sagalova V, et al. The global economic burden of diabetes in adults aged 20–79 years: a cost-of-illness study. Lancet Diabetes Endocrinol. 2017;5(6):423–30. 10.1016/S2213-8587(17)30097-9. [DOI] [PubMed] [Google Scholar]
  • 3.Harding JL, Pavkov ME, Magliano DJ, et al. Global trends in diabetes complications: a review of current evidence. Diabetologia. 2019;62(1):3–16. 10.1007/s00125-018-4711-2. [DOI] [PubMed] [Google Scholar]
  • 4.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data Mining, Inference, and prediction. 2nd ed. Springer; 2009.
  • 5.Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. 10.1136/bmj.g7594. [DOI] [PubMed] [Google Scholar]
  • 6.Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal–external, and external validation. J Clin Epidemiol. 2016;69:245–7. 10.1016/j.jclinepi.2015.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rustam F, Mehmood A, Ahmad M, et al. A hybrid approach for diabetes prediction using CNN and ensemble learning. Health Inf Sci Syst. 2024;12(1):10. 10.1007/s13755-024-00251-3.38375133 [Google Scholar]
  • 8.Islam M, Ferdousi R, Rahman MM, et al. An explainable machine learning framework for early diabetes detection using ensemble classifiers. Front Public Health. 2025;13:1392103. 10.3389/fpubh.2025.1392103. [Google Scholar]
  • 9.Breiman L. Random forests. Mach Learn. 2001;45:5–32. 10.1023/A:1010933404324. [Google Scholar]
  • 10.Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. 10.1145/2939672.2939785
  • 11.Matboli H, El-Morsy S, Saleh M, et al. A multi-stage machine learning model for accurate diabetes classification. Diagnostics. 2025;15(3):367. 10.3390/diagnostics15030367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hasan M, Mahmud T, Rana M, et al. A hybrid feature selection and machine learning approach for diabetes prediction. Comput Biol Med. 2025;174:108249. 10.1016/j.compbiomed.2025.108249. [Google Scholar]
  • 13.Pang Z, Liu J, Xu J, et al. Interpretable machine learning for population-based diabetes risk prediction: A Shapley value approach. Sci Rep. 2025;15:4532. 10.1038/s41598-025-41532-7.39920283 [Google Scholar]
  • 14.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–74. [Google Scholar]
  • 15.Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25:24–9. 10.1038/s41591-018-0316-z. [DOI] [PubMed] [Google Scholar]
  • 16.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380:1347–58. 10.1056/NEJMra1814259. [DOI] [PubMed] [Google Scholar]
  • 17.Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;319(13):1317–8. 10.1001/jama.2017.18391. [DOI] [PubMed] [Google Scholar]
  • 18.Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med. 2016;375:1216–9. 10.1056/NEJMp1606181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ribeiro MT, Singh S, Guestrin C. Why should I trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. 10.1145/2939672.2939778
  • 20.Lundberg SM, Erion G, Chen H, et al. From local explanations to global Understanding with explainable AI for trees. Nat Mach Intell. 2020;2:56–67. 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kuhn M, Johnson K. Applied predictive modeling. Springer; 2013. 10.1007/978-1-4614-6849-3.
  • 22.Ribeiro A, Antunes C, Silva D, et al. Feature selection in high-dimensional healthcare datasets: A review. Brief Bioinform. 2021;22(6):bbab263. 10.1093/bib/bbab263. [DOI] [PubMed] [Google Scholar]
  • 23.Shickel B, Tighe PJ, Bihorac A, Rashidi P, Deep EHR. A survey of recent advances in deep learning techniques for electronic health record analysis. IEEE J Biomed Health Inf. 2018;22(5):1589–604. 10.1109/JBHI.2017.2767063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sadeq H, Ahmed M, Mohammed B. Performance analysis of machine learning techniques for diabetes prediction. Int J Adv Comput Sci Appl. 2021;12(5):559–65. 10.14569/IJACSA.2021.0120570. [Google Scholar]
  • 25.Cichosz SL, Johansen MD, Hejlesen OK. Toward big data analytics: review of predictive models in managing diabetes. Healthc Inf Res. 2024;30(2):83–92. 10.4258/hir.2024.30.2.83. [Google Scholar]
  • 26.Talebi Moghaddam S, Shariatpanahi S, Hosseini R, et al. Machine learning models for type 2 diabetes prediction with class imbalance treatment. BMC Med Inf Decis Mak. 2024;24:112. 10.1186/s12911-024-02311-8. [Google Scholar]
  • 27.Srinivasu PN, Bhoi AK, Bian G, et al. A hybrid AI framework with explainability for healthcare prediction. Comput Methods Programs Biomed. 2024;241:107642. 10.1016/j.cmpb.2024.107642. [Google Scholar]
  • 28.Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine learning. ArXiv Preprint. 2017. arXiv:1702.08608.
  • 29.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]
  • 30.Vu T, Kokubo Y, Inoue M, Yamamoto M, Mohsen A, Martin-Morales A, Dawadi R, Inoue T, Tay JT, Yoshizaki M, Watanabe N, Kuriya Y, Matsumoto C, Arafa A, Nakao YM, Kato Y, Teramoto M, Araki M. Machine learning model for predicting coronary heart disease risk: development and validation using insights from a Japanese population-based study. JMIR Cardio. 2025;9:e68066. 10.2196/68066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Thanh NT, Luan VT, Viet DC, Tung TH, Thien V. A machine learning-based risk score for prediction of mechanical ventilation in children with dengue shock syndrome: a retrospective cohort study. PLoS ONE. 2024;19(12):e0315281. 10.1371/journal.pone.0315281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sinha M, Haaland P, Krishnamurthy A, Lan B, Ramsey SA, Schmitt PL, Sharma P, Xu H, Fecho K. Causal analysis for multivariate integrated clinical and environmental exposures data. BMC Med Inf Decis Mak. 2025;25(1):27. 10.1186/s12911-025-02903-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analyzed in this study are part of the PERSIAN Dena Cohort, maintained by the Yasuj University of Medical Sciences, Yasuj, Iran.Because the dataset contains sensitive participant information, raw data are not publicly available.However, qualified researchers may request access to the anonymized dataset by contacting: Dr. Cirruse SalehnasabYasuj University of Medical Sciences, Yasuj, Iran. Email: cirruse.salehnasab@gmail.comAll data-access requests will be reviewed by the Dena Cohort Data-Access Committee and must comply with institutional and national ethical regulations.The Python code used for data preprocessing, model development, and explainability analysis is openly available at: https://github.com/salehnasab/Explainable-Extra-Trees-Machine-Learning-Model-for-Early-Detection-of-Type-2-Diabetes/blob/main/DMPrediction.ipynb.


Articles from BMC Medical Informatics and Decision Making are provided here courtesy of BMC

RESOURCES