Skip to main content
BioData Mining logoLink to BioData Mining
. 2025 Aug 21;18:57. doi: 10.1186/s13040-025-00477-2

The application of artificial intelligence models in predicting the risk of diabetic foot: a multicenter study

Yao Li 1,2,3, Siyuan Zhou 2,3,5, Bichen Ren 2,3,5, Shuai Ju 1,2,3, Xiaoyan Li 1,2,3, Wenqiang Li 1,2,3, Bingzhe Li 3,4, Yunmin Cai 1,2,3, Chunlei Chang 1,2,3, Lihong Huang 3,4, Zhihui Dong 1,2,3,5,
PMCID: PMC12372307  PMID: 40841667

Abstract

This study explores diabetic foot (DF), a severe complication in diabetes, by combining deep learning (DL) and machine learning (ML) to develop a multi-model prediction tool. Early identification of high-risk DF patients can reduce disability and mortality. The research also aims to create an integrated application to assist clinicians in precise, efficient risk assessment for early intervention. In this multicenter retrospective study, 6,180 elderly diabetic patients (aged 60–85) were enrolled from 11 community hospitals in Shanghai in 2024. Lasso regression was used to identify 16 key DF risk factors, including age, MMSE score, lower limb discomfort, ABI, and hematocrit. Fourteen ML models (RF, XGBoost, CART, MLP, etc.) and three DL models (DNN, CNN, Transformer) were trained, with hyperparameters optimized via cross-validation and grid search. An application was developed integrating these models, offering both single and batch prediction options with visualization tools for clinical use.

Experimental results showed the Logistic regression ensemble model achieved robust performance, with AUC values of 0.943 (validation set, 95% CI: 0.935–0.951) and 0.938 (test set, 95% CI: 0.929–0.947), along with high accuracy, precision, recall, and F1 scores. SHAP analysis revealed key predictive features including ABI results, lower limb discomfort, and MMSE score. The developed app integrates multiple models, compares their predictions for different clinical scenarios, and enhances prediction transparency and reliability.The multi-model approach demonstrates strong predictive performance for DF risk, offering clinicians an intuitive and accurate assessment tool tailored to individual patients. By combining multiple models, we enhance result stability and clinical applicability compared to single-model approaches. Future work will focus on algorithm optimization, expanded datasets, and real-time monitoring integration to enable more precise, dynamic risk evaluation for improved DF prevention and early intervention.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13040-025-00477-2.

Keywords: Artificial intelligence, Machine learning, Prediction model

Introduction

Diabetic foot (DF) is a prevalent and serious complication of diabetes mellitus [1], characterized by ulcers, infections, and potential tissue necrosis [2], often leading to amputation. The development of DF stems from chronic hyperglycemia-induced neuropathy, peripheral vascular disease, and immune dysfunction [3]. According to WHO, approximately 15–20% of diabetic patients experience DF annually, with up to 15% requiring limb amputation [4, 5]. These complications significantly impair patients’ quality of life and impose substantial healthcare costs.

With the global rise in diabetes prevalence, the incidence of DF is increasing [6], underscoring the urgent need for improved screening and early intervention strategies. Traditional screening tools such as the Ankle-Brachial Index (ABI), foot pulse detection, loss of protective sensation (LOPS) testing, and visual foot assessment are widely used [7, 8]. However, these approaches are often subjective, time-consuming, and inadequate for identifying high-risk but asymptomatic patients due to their limited capacity to capture complex interactions among clinical variables.

In recent years, artificial intelligence (AI) has emerged as a transformative tool in medical prediction and diagnostics [9, 10]. Machine learning (ML) algorithms such as Random Forest (RF), Support Vector Machine (SVM) [11], and Gradient Boosting Decision Tree (GBDT) [12] have shown promising results in risk stratification by analyzing multifactorial patient data. Nonetheless, their reliance on manual feature engineering may limit performance with high-dimensional data.

Deep learning (DL) methods, including Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Transformer architectures, can automatically extract complex feature representations and outperform traditional ML in certain tasks [13, 14]. However, DL models are often criticized for their “black box” nature, making them difficult for clinicians to interpret and trust in decision-making.

To address these challenges, explainable AI (XAI) methods such as SHAP (Shapley Additive Explanations) have been proposed to enhance model transparency and clinical utility [15, 16]. Additionally, ensemble learning, particularly model stacking, has shown superior performance in medical risk prediction by integrating the strengths of diverse models.

In this study, we develop a comprehensive multi-model AI system for DF risk prediction that combines ML and DL algorithms with SHAP-based interpretability. We also present a clinical application that allows for single and batch predictions with user-friendly visualization tools. Our contributions include (1) a novel ensemble approach integrating both ML and DL models, (2) the incorporation of explainable AI for clinical interpretation, and (3) the development of a practical app for clinician use. This work aims to improve early DF identification and provide a robust decision-support tool in clinical settings.

Research design and methods

Study participants and data collection

This multicenter retrospective study recruited 6,180 elderly diabetic patients aged 60 to 85 years from 11 community hospitals in Shanghai between January and December 2024. Inclusion criteria were: (1) confirmed diagnosis of type 2 diabetes mellitus according to ADA criteria; (2) age ≥ 60 years; (3) completed baseline clinical, biochemical, and vascular assessments. Exclusion criteria included: (1) active foot ulcer or recent foot amputation; (2) cognitive impairment impeding informed consent; (3) incomplete key data entries. The final cohort included 1,235 (20.0%) high-risk DF cases.

Demographic, clinical (e.g., ABI, dorsalis pedis pulse, MMSE), and laboratory (e.g., ALP, hematocrit) data were collected using standardized protocols and EDC systems. Data underwent rigorous quality control and de-identification prior to analysis.

Data preprocessing and balancing

To address class imbalance (20% high-risk DF), we employed the Synthetic Minority Oversampling Technique (SMOTE) on the training dataset. Prior to training, the data were randomly split into 70% training, 15% validation, and 15% testing sets, ensuring stratification by outcome label. Missing values were imputed using median imputation. Categorical variables were one-hot encoded; continuous variables were standardized.

Clinical features and baseline assessment

The study examined DF-related characteristics encompassing demographic factors (age, gender, BMI), laboratory values (hematocrit, ALP, HbA1c), clinical symptoms (stroke history, lower limb discomfort), vascular assessments (ABI for peripheral vascular evaluation, dorsalis pedis pulse), and sociodemographic variables (foot hygiene frequency, residential region), with standardized ABI and LOPS tests employed to evaluate vascular and neurological risks respectively, all collected using rigorous protocols to ensure data accuracy.

Feature selection and Lasso regression analysis

Lasso regression with L1 regularization was applied to the training set to select significant predictive variables. The optimal penalty coefficient λ was determined via 10-fold cross-validation minimizing mean squared error. Sixteen features were retained based on non-zero coefficients and clinical relevance, including MMSE, LABI, ABI, hematocrit, ALP, lower limb discomfort, LOPS, and depression score.

Development and Comparison of Machine Learning Models.

We developed 14 ML models (e.g., RF, XGBoost, CART, MLP) and 3 DL models (CNN, DNN, Transformer). Grid search combined with 5-fold cross-validation was used for hyperparameter optimization. Model parameters (e.g., number of estimators, learning rate, hidden layer size) are detailed in Supplementary Table S1.

To ensure comparability, all deep learning models were trained using a batch size of 32 for a maximum of 10 epochs. Early stopping was applied if validation loss did not improve for 10 consecutive epochs, and a learning rate reduction strategy was used, halving the rate after 5 epochs without improvement (minimum learning rate = 1e − 5).

DNN architecture consisted of three fully connected layers with 256, 128, and 64 neurons, respectively, using ReLU activation. L2 regularization (weight decay = 1e − 4), Dropout (0.3, 0.3, and 0.2 for the three layers), and Batch Normalization were applied after each layer. The optimizer was Adam (learning rate = 0.001), and the loss function was binary cross-entropy.

CNN architecture included two convolutional blocks with 64 and 128 filters, kernel size 3, and ‘same’ padding. Each block was followed by Batch Normalization, MaxPooling (pool size = 2), and Dropout (0.2). A fully connected layer with 64 neurons was used, with L2 regularization (1e − 4) and Dropout (0.3). The optimizer and loss function were the same as in the DNN.

Transformer model employed two encoder blocks using multi-head self-attention with 4 heads (each of dimension 32). The feedforward layer had a hidden size of 128, and each layer incorporated residual connections and Layer Normalization. A global average pooling layer and a 64-neuron fully connected layer produced the final output. Dropout was set to 0.2. The optimizer was Adam (learning rate = 0.0005), with binary cross-entropy as the loss.

For ensemble learning, we constructed three stacking models using logistic regression, decision tree, and KNN as meta-learners. Base learners included RF, CART, XGBoost, and MLP. Stacking training followed a 5-fold cross-validation framework on the training set, ensuring out-of-fold predictions were used to train the meta-model. Base model weights were implicitly learned by the logistic regression layer.

Performance was evaluated using Accuracy, Precision, Recall, F1 score, AUC, and 95% confidence intervals on validation and test sets. Confusion matrices were generated to assess classification performance.

Application development and feature design

This study developed a user-friendly clinical application integrating multiple ML algorithms (XGBoost, RF, Logistic Regression) to predict DF risk based on nine key indicators including age, MMSE scores, LABI/RABI ratios, hematocrit levels, lower limb discomfort, ABI values, and dorsalis pedis pulse, offering both individual risk predictions with visualized probability distributions and batch processing capabilities through Excel/CSV uploads for automated multi-patient analysis and downloadable reports.

Model interpretability analysis

In this study, Shapley additive interpretation (Shap) analysis was used to enhance the interpretability of the model, and the contribution of each feature to DF risk prediction was quantified by global visualization (summary and bar graph) and local interpretation of individual cases. So that clinicians can understand how key factors such as MMSE score, lower limb discomfort and ABI value specifically affect the prediction, so as to improve the transparency and clinical practicability of the model.

Statistical analysis

Statistical analyses were performed using SPSS 26.0 and Python 3.8, employing independent t-tests for continuous variables and chi-square tests for categorical variables (two-tailed, p < 0.05 significance threshold), with model performance rigorously evaluated through cross-validation and independent test sets to ensure reliability and generalizability of results.

Results

Baseline characteristics

Figure 1 shows the research workflow from data collection to model evaluation. Our analysis of 6180 diabetes patients (20.0% high risk) showed that there was significant difference between risk groups (p < 0.001) under various clinical parameters (Tables 1 and 2, the ending of the manuscript).The analysis identified age as a significant DF risk predictor, consistent with prior research. High-risk patients exhibited elevated hematocrit (suggesting microvascular damage), along with increased ALP levels and depression scores, indicating concurrent metabolic and psychological comorbidities.The risk group showed significantly lower LABI, RABI (indicating impaired lower extremity perfusion) and MMSE scores (suggesting cognitive decline), along with higher rates of abnormal ABI, dorsalis pedis pulse and LOPS results.These are established indicators of DF risk.

Fig. 1.

Fig. 1

Overview of the Research Process

Table 1.

Descriptive statistics and T-test analysis of baseline characteristics

Feature Mean SD Non-risk Mean Risk Mean t-Statistic p-Value
LABI 1.063 0.147 1.081 0.991 12.797 < 0.001
Age 70.158 6.279 69.586 72.445 -14.102 < 0.001
Intelligence Score(MMSE) 24.744 4.564 25.030 23.601 8.954 < 0.001
RABI 1.053 0.138 1.068 0.995 11.943 < 0.001
Hematocrit 12.024 15.987 11.232 15.194 -7.321 < 0.001
ALP 86.174 12.661 85.529 88.758 -7.524 < 0.001
Depression Score 5.496 4.141 5.307 6.251 -6.435 < 0.001

Table 2.

Distribution of clinical feature categories and Chi-square test results

Feature Category Proportion (%) χ² value p-value
ABI Result 0 88.58 343.802 < 0.001
1 11.42
Dorsalis Pedis Pulse 1 93.40 325.977 < 0.001
2 2.82
3 1.47
4 2.04
5 0.28
LOPS 0 82.39 30.458 < 0.001
1 17.61
Stroke 0 81.57 51.989 < 0.001
1 18.43
Lower Limb Discomfort 0 85.50 50.233 < 0.001
1 14.50
Pre-retirement Occupation 0 26.34 45.265 < 0.001
1 73.66
Types of Hypoglycemic Drugs 1 80.81 37.729 < 0.001
2 8.74
3 10.45
Current Residential Region 1 62.15 21.638 < 0.001
2 37.85
Foot Washing Frequency 1 0.36 41.214 < 0.001
2 5.08
3 94.09
4 0.47

Notes: ABI Result: 0: Normal; 1: Abnormal, indicating possible peripheral arterial disease. Dorsalis Pedis Pulse: 1: Normal pulse; 2–5: Gradual weakening of pulse from mildly to undetectable. LOPS: 0: No loss of sensation; 1: Loss of protective sensation. Stroke: 0: No history of stroke; 1: History of stroke. Lower Limb Discomfort: 0: No discomfort; 1: Discomfort present. Pre-retirement Occupation: 0: Low-risk occupation; 1: High-risk occupation. Types of Hypoglycemic Drugs: 1: Insulin; 2: Oral hypoglycemics; 3: Other drugs. Current Residential Region: 1: Urban; 2: Rural. Foot Washing Frequency: 1: Never; 2: Occasionally; 3: Frequently; 4: Daily

These findings correlate with DF-related vascular and neurological damage. The risk group also showed higher rates of stroke history, lower limb discomfort, and physically demanding occupations, suggesting long-term health impacts. What’s more, increased insulin use in this group likely reflects more advanced diabetes progression. From the perspective of foot washing frequency, the risk group provided significantly less foot care, and fewer patients washed their feet frequently, so they were more prone to DF. These findings indicate that the risk of DF is related to a variety of clinical characteristics, which can provide key information for subsequent risk prediction models. Supplementary figure S1 shows the difference in the distribution of each feature between the risk group and the no risk group, indicating that the clinical indicators of the two groups have very significant changes.

Feature selection and Lasso regression analysis

This study utilized Lasso regression (least absolute shrinkage and selection operator) for feature selection to enhance model accuracy and stability. The method’s L1 regularization identified 16 clinically significant DF risk predictors, including: vascular indices (LABI, RABI, ABI), metabolic markers (ALP, hematocrit), neurological assessments (MMSE, LOPS), clinical history (stroke, lower limb discomfort), and behavioral factors (foot hygiene frequency). Complete details are presented in Table 3(the ending of the manuscript). Lasso regression quantifies the contribution of each feature by regularizing the feature coefficients. The symbol of the coefficient can indicate the direction of the relationship between the feature and the DF risk. If the coefficient is negative, it indicates a negative correlation with the DF risk, that is, the larger the feature value, the lower the DF risk. Conversely, a positive coefficient indicates a positive correlation with the DF risk. The increase in eigenvalue corresponds to higher risk, and the coefficients of Labi and MMSE are negative. The higher the value of these characteristics, the lower the risk of DF. However, the higher the characteristic coefficient values of ABI results, foot back pulse and depression score, the higher the risk of DF.Lasso regression also found other key clinical features, such as age, RABI, and hematocrit. These characteristics play a crucial role in the risk assessment of DF. The higher the values of these characteristics, the higher the risk of developing DF. Lasso regression also discovered other key clinical characteristics, such as age, RABI and Hematocrit. These characteristics play a very crucial role in the risk assessment of DF.

Table 3.

Coefficients and absolute coefficients of Lasso regression feature selection

Type characteristic coefficient absolute coefficient
Numerical LABI -0.038 0.038
Categorical ABI Result 0.031 0.031
Numerical Age 0.029 0.029
Categorical Dorsalis Pedis Pulse 0.027 0.027
Numerical Intelligence Score(MMSE) -0.016 0.016
Numerical RABI -0.013 0.013
Numerical Hematocrit 0.011 0.011
Categorical LOPS 0.010 0.010
Categorical Stroke 0.006 0.006
Numerical ALP 0.006 0.006
Categorical Lower Limb Discomfort 0.005 0.005
Categorical Pre-retirement Occupation -0.004 0.004
Categorical Types of Hypoglycemic Drugs 0.002 0.002
Categorical Current Residential Region -0.002 0.002
Numerical Depression Score 0.001 0.001
Categorical Foot Washing Frequency 0.000 0.000

Heatmap analysis of the random forest model

To evaluate the performance of the RF model in predicting the risk of DF, we specially created a heat map to demonstrate its specific ability to predict the risk of patients. The heat map analysis can clearly show the contributions made by clinical features such as LABI, age, and MMSE to the prediction results. In a heat map, each row represents a feature and each column represents a sample. The intensity of the colors in the heat map can reflect the actual magnitude of these feature values. The color gradient legend on the heat map can help us see the value range of different variables, which includes several clinical variables. For example, Hematocrit, ALP, Depression Score, ABI result, Dorsalis Pedis Pulse, etc. (Fig. 2). From the results obtained from the analysis, the performance of the RF model on the training set and the test set is the same. It can effectively distinguish patients at risk from those no risk, which indicates that this model has relatively strong prediction accuracy and good generalization. This model is based on the eigenvalues of different samples. The differences existing between high-risk patients and low-risk patients were successfully captured. The RF model relies heavily on variables such as LABI, age and MMSE. These variables show very prominent numerical differences in the heat map and play a crucial role in risk prediction. The heat map analysis also emphasizes how this model integrates various clinical characteristics. It makes predictions based on the interaction among multiple variables.

Fig. 2.

Fig. 2

Heatmap Analysis of the Random Forest (RF) Model

The heatmap illustrates the performance of the RF model in predicting DF risk. Each row represents a clinical variable, and each column corresponds to a patient sample. The intensity of the color indicates the magnitude of the variable values. The color gradient legend represents the range of variable values, with categorical variables shown in different colors.

Model development and performance evaluation

After adjusting the dataset of diabetic patients to a balanced state, this paper developed and evaluated 14 different ML models, including RF, XGBoost, CART, MLP, CNN, DNN, Transformer and Stacking ensemble models. At the very beginning, this paper trained four basic models, namely RF, extreme gradient boosting, CART, and multi-layer perceptron, and optimized their hyperparameters by using the grid search method.We constructed a stacked integration model using KNN, logistic regression and decision tree and trained three DL models called DNN, CNN and Transformer. Supplementary Figure S2 shows the training processes of three DL models, including their loss curves and accuracy curves. The specific model parameter settings can be seen in the Supplementary Table S1. Figure 3A and B, and 3C respectively present the ROC curves of the training dataset, validation dataset, and test dataset. The classification performance of different models on these datasets was directly compared, and their performance in risk prediction was evaluated. Table 4 (the ending of the manuscript)presents a detailed summary of the diagnostic performance of each model on the training dataset, validation dataset, and test dataset. This includes key evaluation metrics such as accuracy, precision, recall rate, F1 score, and area under the curve. Supplementary Figure S3 also presents the performance of different models in risk prediction in a more intuitive way.

Fig. 3.

Fig. 3

Classification performance comparison of different machine learning models on training, validation, and test sets. (A, B, C) Show ROC Curves on Training, Validation, and Test Sets for Evaluating the Classification Performance of Different ML Models. (D, E, F) Show Calibration Curves, Reflecting the Consistency Between Predicted Probabilities and Actual Incidence. (G, H, I) Show DCA (Decision Curve Analysis) Curves, Used to Evaluate the Net Benefit of Different Models Across Different Thresholds

Table 4.

Diagnostic performance comparison of different machine learning models

Method Model Dataset Accuracy Precision Recall F1 Score AUC
Grid RandomForest train 1.000 1.000 1.000 1.000 1.000
Grid RandomForest validation 0.856 0.858 0.853 0.856 0.929
Grid RandomForest test 0.855 0.858 0.851 0.855 0.925
Grid XGBoost train 0.993 0.999 0.988 0.993 1.000
Grid XGBoost validation 0.875 0.908 0.835 0.870 0.930
Grid XGBoost test 0.862 0.894 0.822 0.857 0.924
Grid CART train 0.852 0.856 0.846 0.851 0.939
Grid CART validation 0.776 0.779 0.771 0.775 0.839
Grid CART test 0.754 0.758 0.745 0.752 0.832
Grid MLP train 0.922 0.888 0.966 0.926 0.984
Grid MLP validation 0.764 0.719 0.868 0.786 0.837
Grid MLP test 0.778 0.739 0.857 0.794 0.842
Original RandomForest train 0.837 0.811 0.879 0.844 0.942
Original RandomForest validation 0.753 0.735 0.793 0.763 0.857
Original RandomForest test 0.763 0.748 0.795 0.771 0.856
Original XGBoost train 0.909 0.956 0.857 0.904 0.968
Original XGBoost validation 0.858 0.907 0.799 0.849 0.917
Original XGBoost test 0.849 0.900 0.785 0.838 0.915
Original CART train 0.781 0.734 0.882 0.801 0.869
Original CART validation 0.704 0.669 0.807 0.731 0.762
Original CART test 0.715 0.685 0.799 0.737 0.762
Original MLP train 0.849 0.841 0.862 0.851 0.931
Original MLP validation 0.741 0.729 0.766 0.747 0.804
Original MLP test 0.741 0.725 0.776 0.749 0.799
Stacking KNN train 0.990 0.991 0.989 0.990 0.999
Stacking KNN validation 0.869 0.892 0.839 0.865 0.920
Stacking KNN test 0.863 0.886 0.834 0.859 0.914
Stacking LogisticRegression train 0.998 1.000 0.997 0.998 1.000
Stacking LogisticRegression validation 0.877 0.889 0.861 0.875 0.943
Stacking LogisticRegression test 0.877 0.893 0.856 0.875 0.938
Stacking DecisionTree train 0.992 0.992 0.991 0.992 0.999
Stacking DecisionTree validation 0.877 0.890 0.860 0.875 0.936
Stacking DecisionTree test 0.878 0.892 0.859 0.875 0.934
DL DNN train 0.745 0.728 0.783 0.754 0.830
DL DNN validation 0.702 0.683 0.755 0.717 0.771
DL DNN test 0.698 0.679 0.750 0.713 0.770
DL CNN train 0.739 0.726 0.769 0.747 0.817
DL CNN validation 0.689 0.675 0.730 0.701 0.761
DL CNN test 0.700 0.683 0.748 0.714 0.764
DL Transformer train 0.616 0.608 0.655 0.631 0.663
DL Transformer validation 0.602 0.598 0.620 0.609 0.646
DL Transformer test 0.618 0.611 0.650 0.630 0.659

The Logistic Regression Stacking model demonstrated the highest AUC—0.943 on the validation set (95% CI: 0.935–0.951) and 0.938 on the test set (95% CI: 0.929–0.947) and consistent performance across accuracy (0.877), precision (0.889), recall (0.861), and F1-score (0.875). It has a relatively strong generalization ability. The KNN Stacking and DecisionTree Stacking models also have very good performances on different datasets. It can be seen from the experimental results that the superimposed ensemble model generally has a strong performance advantage in DF risk prediction. Figure 3D and E, and 3F are the calibration curves of the model. Among them, the calibration curve of the Logistic Regression Stacking model basically coincides with the diagonal. This means that the predicted probability and the actual incidence rate are highly consistent, indicating that this model has good calibration performance. Compared with it, the calibration curves and diagonal deviations of the Grid CART and original CART models are relatively large. In the high-probability area, there are significant differences between the prediction results of these two models and the actual observed values. In terms of clinical applicability, Fig. 3G and H, and 3I show the Decision Curve Analysis (DCA) curves of the training, testing, and validation datasets. Except for the Transformer model, all other models have shown relatively robust net returns within different threshold ranges. The Stacking model is better than the other models, and Grid XGBoost and Original XGBoost also show higher net benefits. Among them, the net benefit of the Logistic Regression Stacking model is the highest, and it is selected as the best model for DF risk prediction.

Model interpretation and SHAP analysis

We used SHAP (Shapley Additive Explanations) values to interpret the model’s predictions both globally and locally. Globally, the SHAP summary plot and bar chart identified ABI, MMSE score, and lower limb discomfort as the most influential predictors. This aligns with clinical understanding: a lower ABI indicates peripheral vascular disease, which is a major contributor to DF risk; reduced MMSE reflects cognitive decline, which may impair self-care; and limb discomfort reflects possible ischemia or neuropathy.

This paper aims to enhance clinicians’ understanding and acceptance of the DF risk prediction model. By using the SHAP method, a relatively detailed explanation is provided for the prediction of the final model. The SHAP method calculates the contribution of each feature to the model output and also offers global and local explanations. In this way, it is possible to figure out how each feature affects the prediction result. The global interpretation of SHAP can show the overall contribution of each feature to the model prediction. Just as shown in Fig. 4A and B, the SHAP summary point plot and bar chart calculate the average contribution of each feature in all predictions, which is very helpful for determining which features have the greatest impact on the DF risk prediction. The higher the value of the “final intelligence score”, the more helpful it is. The risk probability of the risk group is higher, and the DF risk of patients with poor cognitive function will be higher. The local explanation presents the specific situation of how features affect individual predictions. The SHAP dependency graph in Fig. 4C explains in greater depth the impact of each feature on the prediction of a single patient. The color of the points represents the feature values, the horizontal axis represents the feature values, and the vertical axis represents the corresponding SHAP values. This indicates whether specific characteristics have a positive or negative impact on risk prediction.

Fig. 4.

Fig. 4

Global and Local Model Interpretations Using SHAP. (A) SHAP Summary Dot Plot. The SHAP summary dot plot shows the contribution of each feature. Each point represents a patient’s feature, with the color indicating the size of the feature value, red representing higher values and blue representing lower values. The vertical Stacking of points reflects the sample density. (B) SHAP Summary Bar Chart. The SHAP summary bar chart displays the average contribution of each feature to the model’s output, sorted by the magnitude of contribution. (C) SHAP Dependence Plot. The SHAP dependence plot shows how each feature affects the model’s output. Each point represents a patient, with the horizontal axis representing the feature value and the vertical axis representing the SHAP value. The color of the point represents the actual value of the feature

A local explanation case study is presented in Supplementary Figure S2. For an individual patient with moderate DF risk, ABI and MMSE contributed significantly to the predicted probability. The SHAP force plot visualizes the positive and negative impacts of each feature, helping clinicians understand the model’s reasoning.

The SHAP method can also provide personalized explanations for each prediction, which can help us understand how different clinical characteristics affect the model output. Just as shown in Supplementary Table S2, the table lists the SHAP key values of different characteristics and the impact of these characteristics on the risk prediction of DF. Among these analytical works, features such as “Lower Limb Discomfort” and “LOPS” showed more dispersed SHAP values, which means that their impact on prediction varies greatly among different patients. Compared with them, features such as “occupation before retirement” and “history of Stroke” played a more consistent role. The contributions they make in the model are relatively stable.

Development and features of the diabetic foot risk prediction app

The DF risk prediction app we developed operates based on big data. It integrates multiple ML and DL models to predict the high-risk population of DF. The system supports multi-model selection (e.g., DL ensemble, grid-optimized RF) and displays individual predictions, risk categories, and probability visualizations.

The main interface is shown in Fig. 5A-D. Features include:

Fig. 5.

Fig. 5

The workflow of the APP. (A) This application supports batch prediction and allows users to upload Excel files or CSV files containing information of multiple patients. (B) Users can select one or more prediction Models in the panel, such as DeepLearning Models, Grid Search Optimized Models, etc. (C) After entering the eigenvalues, click the “Start Single Prediction” button, and at the same time, you can randomly test the sample size. (D) Prediction probability, risk stratification results and visual probability distribution map

  1. Real-time prediction display and probability graph (Fig. 5C).

  2. Model comparison table and performance charts.

  3. SHAP-based feature importance visualization.

  4. Error handling for invalid inputs or missing values.

This design enables clinicians to perform both single-patient risk screening and bulk cohort evaluations with immediate visual feedback.

On the main interface, users need to input a total of 16 key clinical indicators, including Age, Intelligence Score(MMSE), LABI/RABI, ABI Result, Hematocrit, Stroke, Lower Limb Discomfort, Dorsalis Pedis Pulse, ALP, LOPS, Pre-retirement Occupation, Types of Hypoglycemic Drugs, Current Residential Region, Depression Score and foot washing frequency. It is also possible to choose to generate random sample data. Users can select one or several prediction models on the left panel, such as DeepLearning Models, Grid Search Optimized Models, etc.After the selection is completed, enter the feature values and click the “Start Single Prediction” button.Then, on the right side, the prediction probability, the result of risk stratification, and the visual probability distribution graph will be displayed. For more details, please refer to Fig. 5. This application also supports batch prediction. It allows users to upload Excel files or CSV files containing multiple patient information. The system will perform operations such as data preprocessing, feature mapping, and multi-model batch prediction by itself. Users can download all the prediction results with just one click. This function enables users to handle the predicted situations of multiple patients more efficiently, thus facilitating clinicians to carry out large-scale screening work.

The main innovation point of this application is to introduce the concept of “multi-model integrated visualization” into the field of DF risk prediction. This overcomes the limitations of the traditional “single optimal model” prediction. In previous studies and tools, most of the results given were from a single optimal model. However, all the complementarity and uncertainty of different algorithms in different populations and different feature combinations have been ignored. This system allows users to select multiple models simultaneously for prediction, and will display the risk probability and prediction labels of each model in the form of charts and text. Such a design improves the transparency of the prediction results. It enables both doctors and patients to understand the consistency and differences among various models, thus avoiding the risk of misjudgment caused by relying on a single model.

Discussion

Diabetic foot (DF), a serious diabetes complication causing infections, ulcers, and potential amputations [17], significantly impacts patients’ quality of life and healthcare costs [18]. While traditional screening methods like ABI and LOPS tests remain clinically useful, their reliance on single indicators and subjective clinician judgment limits comprehensive risk assessment, driving current research toward AI-based ML/DL approaches that better account for multifactorial interactions in DF prediction. By integrating diverse algorithms with explainable AI (XAI) techniques, the proposed system balances predictive accuracy with clinical interpretability.

This study developed an integrated ML/DL approach for DF risk prediction, with the Logistic Regression Stacking model achieving strong performance (AUC = 0.938, Accuracy = 0.877, Precision = 0.889, Recall = 0.861, F1 = 0.875), demonstrating superior predictive capability consistent with prior research by Zhou (2025) on Stacking’s advantages [19] and Vidivelli’s (2025) findings on ensemble learning’s ability to enhance model accuracy and generalizability through complementary base model strengths [20].

Ensemble learning methods reduce bias and overfitting by combining multiple models [21], as demonstrated in our research using RF Stacking, XGBoost, and CART. This method utilizes the advantages of each algorithm [22], such as XGBoost/RF for feature selection and CART for non-linear relationships [23, 24], to improve the accuracy and stability of predictions, especially when dealing with complex medical data with high-dimensional features or imbalanced categories [25].

While deep learning models (DNN, CNN, Transformer) demonstrate strong feature extraction capabilities [26], our results indicate ensemble learning outperforms DL in this study. This aligns with existing research showing DL’s limitations with smaller datasets [27], including potential overfitting and reduced generalizability [28]. Deep learning requires a large amount of high-quality data to train models. However, in cases where the data volume is small or the data quality is low, the performance of deep learning may not be as good as that of traditional machine learning. Traditional machine learning algorithms such as logistic regression and Support Vector Machine (SVM) perform more stably and efficiently on small datasets.The characteristic “black box“ [29] nature of DL further hinders clinical adoption, as clinicians require transparent decision-making processes. By employing ensemble learning methods, we achieved both improved predictive accuracy and enhanced model interpretability, thereby increasing clinical applicability.

In terms of clinical implementation, our study contributes several key innovations. First, we integrate both ML and DL algorithms into a unified ensemble structure, leveraging complementary strengths for robust performance. Second, we employ SHAP for both global and individual-level interpretation, quantifying feature contributions to DF risk predictions [30] and allowing clinicians to understand not only which features matter, but also why and how they influence specific predictions. The analysis revealed ABI as the most influential predictor [31], consistent with Khan et al.‘s (2025) findings linking ABI reduction to peripheral vascular deterioration preceding DF onset [32]. MMSE scores and lower limb discomfort often overlooked in conventional screening, also demonstrated significant predictive value in our ensemble model. Third, we developed a user-facing application that supports both single and batch predictions with multi-model outputs, probability visualizations, and error-handling mechanisms.Wang et al. also mentioned in 2024 that ensemble learning methods can bring more robust and reliable results to the risk prediction of DF [33]. Unlike existing DF calculators that rely on fixed rules or single-model outputs, our system offers comparative transparency across multiple algorithms in real-time.

The design of our system aligns with recent advances in interpretable and clinician-centric medical AI. For example, the We-XAI framework for cardiovascular disease prediction [1] and the EfficientNetB3-based explainable model for arrhythmia detection [2] have demonstrated the importance of integrating transparency and ensemble strategies in clinical tools. Similar to these approaches, our system provides not only accurate predictions but also actionable explanations, enhancing trust and utility in high-stakes environments.

Despite the promising performance of our model, several limitations remain. First, the study is limited to a single regional dataset from Shanghai and lacks external validation. Second, while SHAP enhances interpretability, further clinical case studies are needed to verify its real-world impact on decision-making. Third, although the app is designed for clinician use, broader usability testing and interface optimization are ongoing.

Future directions include prospective validation across multiple regions and ethnic populations, integration with real-time monitoring systems such as wearable sensors, and refinement of user experience design based on clinician feedback.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (19.5KB, docx)
Supplementary Material 2 (11.9KB, docx)
Supplementary Material 3 (317KB, docx)

Author contributions

Y.L.,S.Z.,B.R.,S.J.,X.L.,W.L.,B, L,Y.M.,C.C.and L.H.designed and implemented the study. S.Z.performed the statistical analyses. Y.L. wrote the manuscript. Y.L. and S.Z. verified the data. All authors critically revised the manuscript for important intellectual content. All authors approved the final manuscript. Z.D. is guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Funding

Support was provided by the four major chronic disease special projects(2023ZD0504300), and Shanghai Jinshan District Fifth Cycle Outstanding Young Talents Supporting Project (JSKJ-KTYQ-2023-02).

Data availability

The de-identified dataset and analysis code are available upon request from the corresponding author (lee19920928@163.com ).

Declarations

Competing interests

The authors declare no competing interests.

Duality of interest

No potential conflicts of interest relevant to this article were reported.

Ethics declarations

The study was approved by the Jinshan Institutional Ethics Committee (JIEC 2024-S35) and conducted in accordance with the Declaration of Helsinki. All participants provided informed consent.

Consent to publish

Not applicable.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Jia X, Dou Z, Zhang Y, Yu C, Yang M, Xie H, Lin Y, Liu Z. Application of a novel thermal/pH-responsive antibacterial Paeoniflorin hydrogel crosslinked with amino acids for accelerated diabetic foot ulcers healing. Mater Today Bio. 2025;32:101736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wu FF, Wang J, Liu GB. Clinical effects of Thread-Dragging therapy on gangrene of Non-ischemic diabetic foot ulcers. Chin J Integr Med. 2025;31(6):552–7. [DOI] [PubMed] [Google Scholar]
  • 3.Swoboda L, Held J. Impaired wound healing in diabetes. J Wound Care. 2022;31(10):882–5. [DOI] [PubMed] [Google Scholar]
  • 4.Borhade MB, Yashi K, Singh S, Diabetes. and Exercise. 2025 Feb 26. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 30252351. [PubMed]
  • 5.Elmubark M, Fahal L, Ali F, Nasr H, Mohamed A, Igbokwe K. Assessment of risk factors leading to amputation among diabetic septic foot patients in khartoum, Sudan. Cureus. 2024;16(12):e75517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Elainein MAA, Whdan MM, Samir M, Hamam NG, Mansour M, Mohamed MAM, Snosy MM, Othman MA, Sobieh AS, Saad MG, Labna MA, Allam S. Therapeutic potential of adipose-derived stem cells for diabetic foot ulcers: a systematic review and meta-analysis. Diabetol Metab Syndr. 2025;17(1):9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Salvotelli L, Stoico V, Perrone F, Cacciatori V, Negri C, Brangani C, Pichiri I, Targher G, Bonora E, Zoppini G. Prevalence of neuropathy in type 2 diabetic patients and its association with other diabetes complications: the Verona diabetic foot screening program. J Diabetes Complications. 2015 Nov-Dec;29(8):1066–70. [DOI] [PubMed]
  • 8.Alonso-Fernández M, Mediavilla-Bravo JJ, López-Simarro F, Comas-Samper JM, Carramiñana-Barrera F, Mancera-Romero J, de Santiago Nocito A. Grupo de Trabajo de diabetes de SEMERGEN. Evaluation of diabetic foot screening in primary care. Endocrinol Nutr. 2014 Jun-Jul;61(6):311–7. English, Spanish. [DOI] [PubMed]
  • 9.Echefu G, Batalik L, Lukan A, Shah R, Nain P, Guha A, Brown SA. The digital revolution in medicine: applications in Cardio-Oncology. Curr Treat Options Cardiovasc Med. 2025;27(1):2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nayana BR, Pavithra MN, Chaitra S, Bhuvana Mohini TN, Stephan T, Mohan V, Agarwal N. EEG-based neurodegenerative disease diagnosis: comparative analysis of conventional methods and deep learning models. Sci Rep. 2025;15(1):15950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tao H, You L, Huang Y, Chen Y, Yan L, Liu D, Xiao S, Yuan B, Ren M. An interpreting machine learning models to predict amputation risk in patients with diabetic foot ulcers: a multi-center study. Front Endocrinol (Lausanne). 2025;16:1526098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wu Y, Dong D, Zhu L, Luo Z, Liu Y, Xie X. Interpretable machine learning models for detecting peripheral neuropathy and lower extremity arterial disease in diabetics: an analysis of critical shared and unique risk factors. BMC Med Inf Decis Mak. 2024;24(1):200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lu W, Wang M, Yu Y, Ma L, Shi Y, Huang Z, Gong M. A novel self-supervised graph clustering method with reliable semi-supervision. Neural Netw. 2025;187:107418. [DOI] [PubMed] [Google Scholar]
  • 14.Wang D, Xian X, Song C. Joint learning of failure mode recognition and prognostics for degradation processes. IEEE Trans Autom Sci Eng. 2024;21(2):1421–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Padhy SK, Mohapatra A, Patra S. WE-XAI: explainable AI for CVD prediction using weighted feature selection and ensemble classifiers. Netw Model Anal Health Inf Bioinforma. 2025;14:13. [Google Scholar]
  • 16.Padhy SK, Mohapatra A, Patra S. A lightweight efficientNetB3 explainable model for enhancing prediction of cardiac arrhythmia using ECG signals. Netw Model Anal Health Inf Bioinforma. 2025;14:49. [Google Scholar]
  • 17.Aravindhan A, Fenwick E, Wing Dan Chan A, Eyn Kidd Man R, Ee Tang W, Chuan Tan N, Sabanayagam C, Chay J, Pui Ng L, Teen Wong W, Fern Soo W, Wei Lim S, Lamoureux EL. Nonadherence to diabetes complications screening in a multiethnic Asian population: protocol for a mixed methods prospective study. JMIR Res Protoc. 2025;14:e63253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ferreira RC. Diabetic foot. Part 1: ulcers and infections. Rev Bras Ortop (Sao Paulo). 2020;55(4):389–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhou Z, Jia Y, Yan H, Xu J, Wang S, Wen J. Risk prediction models for patients with recurrent diabetic foot ulcers: A systematic review. Public Health. 2025;244:105744. [DOI] [PubMed] [Google Scholar]
  • 20.Vidivelli S, Padmakumari P, Shanthi P. Multimodal autism detection: deep hybrid model with improved feature level fusion. Comput Methods Programs Biomed. 2025;260:108492. [DOI] [PubMed] [Google Scholar]
  • 21.Schran C, Brezina K, Marsalek O. Committee neural network potentials control generalization errors and enable active learning. J Chem Phys. 2020;153(10):104105. [DOI] [PubMed] [Google Scholar]
  • 22.Tsai CA, Chang YJ. Efficient selection of Gaussian kernel SVM parameters for imbalanced data. Genes (Basel). 2023;14(3):583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wu T, Yang H, Chen J, Kong W. Machine learning-based prediction models for renal impairment in Chinese adults with hyperuricaemia: risk factor analysis. Sci Rep. 2025;15(1):8968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Al-Wardy M, Zarei E, Nikoo MR. Improving index-based coastal vulnerability assessment using machine learning in Oman. Sci Total Environ. 2025;976:179311. [DOI] [PubMed] [Google Scholar]
  • 25.Gao R, Hu M, Li R, Luo X, Suganthan PN, Tanveer M. Stacked ensemble deep random vector functional link network with residual learning for Medium-Scale Time-Series forecasting. IEEE Trans Neural Netw Learn Syst. 2025;PP. [DOI] [PubMed]
  • 26.Escudero-Arnanz Ó, Martínez-Agüero S, Martín-Palomeque P, Marques G, Mora-Jiménez A, Álvarez-Rodríguez I, Soguero-Ruiz J. Multimodal interpretable data-driven models for early prediction of multidrug resistance using multivariate time series. Health Inf Sci Syst. 2025;13(1):35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Yue W, Han R, Wang H, Liang X, Zhang H, Li H, Yang Q. Development and validation of clinical-radiomics deep learning model based on MRI for endometrial cancer molecular subtypes classification. Insights Imaging. 2025;16(1):107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kalemati M, Zamani Emani M, Koohi S. InceptionDTA: predicting drug-target binding affinity with biological context features and inception networks. Heliyon. 2025;11(3):e42476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Fathy W, Emeriaud G, Cheriet F. A comprehensive review of ICU readmission prediction models: from statistical methods to deep learning approaches. Artif Intell Med. 2025;165:103126. [DOI] [PubMed] [Google Scholar]
  • 30.Long Y, Xu X, Chen J, Liu S, Li J, Dong Y. An explainable predictive model of direct pulp capping in carious mature permanent teeth. J Dent. 2024;149:105269. [DOI] [PubMed] [Google Scholar]
  • 31.Khan Z, Zeb S, Ashraf, Rumman, Ali A, Aleem F, Omair F. The relationship between plasma fibrinogen levels and the severity of diabetic foot ulcers in diabetic patients. Cureus. 2025;17(3):e81118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kammien AJ, Evans BG, Hu KG, Colen DL. Geographic region and insurance status predict access to salvage procedures for diabetic Lower-Extremity wounds in the united States. Ann Plast Surg. 2025;94(4S Suppl 2):S349–52. [DOI] [PubMed] [Google Scholar]
  • 33.Xiaoling W, Shengmei Z, BingQian W, Wen L, Shuyan G, Hanbei C, Chenjie Q, Yao D, Jutang L. Enhancing diabetic foot ulcer prediction with machine learning: A focus on localized examinations. Heliyon. 2024;10(19):e37635. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (19.5KB, docx)
Supplementary Material 2 (11.9KB, docx)
Supplementary Material 3 (317KB, docx)

Data Availability Statement

The de-identified dataset and analysis code are available upon request from the corresponding author (lee19920928@163.com ).


Articles from BioData Mining are provided here courtesy of BMC

RESOURCES