Abstract
In recent days, diabetics, a chronic disease has risen significantly which leads to more health complications. Among those complications, diabetic foot ulcer (DFU) is much serious. DFU is a wound on the foot of a person who is affected with diabetics. It sometimes leads to fatality if untreated. Diagnosing the DFU in its early stage remains challenging due to medical impediments by the diabetics. Thermography serves as a promising technique in the early prediction of the DFU and aids for an improvised treatment towards the eradication of foot amputations. But still, utilizing thermography images for clinical treatments continues to be underexplored in treating DFU due to its computational complexities and existence of ambiguities in thermal images. To overcome this challenge, this research paper proposes an Intelligent Prediction System (IPS) using the modified swin transformers for an effective segmentation and deep capsule networks for an accurate prediction of DFU. In the segmentation phase, swin transformers can be used as U-NET based architecture to segment the lesions of foot ulcers. Deep features are extracted by the capsule networks and supplied to the deep shallow network which works on the standard of extreme learning networks to achieve the early prediction of DFU. The extensive experimentation is conducted using the thermal foot ulcer images in Python3.20 and Tensorflow –Keras Libraries. To verify the efficiency of the proposed schema, evaluated performances are assessed with other research experiments. Results show that the proposed schema achieves the highest prediction accuracy (99%) with promising segmented performance (98.6%). Moreover, the proposed model excels the varied residing schema and establishes a firm foothold in the early prediction of DFUs.
Keywords: Diabetic foot ulcers, Capsule networks, Swin transformers, Shallow network, Extreme learning principles
Introduction
Diabetic foot ulcer (DFU) is one of the most severe challenges of diabetes, which sometimes potentially leads to foot amputation if left untreated [1, 2]. Key indicators of DFU include changes in skin color, variations in skin temperature, foot swelling, and leg pain accompanied by dry, cracked skin. Typically, diagnosing DFU can be expensive, and delayed detection may have serious consequences. If it is not detected early, the fatality rate would increase among the patients [3–5].In contemporary scenario, several early prediction and detection methods have been proposed [6–9]. However, these methods exhibit the moderate performance with the less prediction accuracy. In addition, several intelligent approaches have also been implemented using the thermal images to detect the DFU complications automatically [10].
Several studies show that the usage of thermal imaging system can be considered as an essential technique that utilizes the foot temperature to analyse the diabetic foot complications. The thermal images exhibit intrinsic complex properties which impacts on the early prediction performance system [11]. To solve the above mentioned problem, machine learning [12] and Deep learning [13, 14] approaches are utilized as a catalyst that can boost the performance in diagnosing DFU. Several conventional ML algorithms like Support vector machines (SVM), Artificial neural networks (ANN), K-nearest neighbourhood and Hybrid Decision trees were used to predict the diabetic foots using thermal images, yet they gave poor results. Since the machine learning technique fails to detect the thermal images, this research devised deep learning as a more potential means for an early prediction of diabetic foots.
In recent days, deep learning algorithms have shown its capability in detecting the thermal images which can be used as an early prediction system. In earlier days, deep learning algorithms [15–18] were limited to an early prediction of DFU and fail to focus on the different risks level predictions of foot ulcers. To overcome this drawback, the paper proposes the integrated design of segmentation and prediction of different risk levels of diabetic foot ulcers. In this work, new prediction framework is designed by ensemble cast of powerful deep learning algorithms ssubsequently to the comprehensive testing. Besides, the proposed scheme is constructed from the base using thermal foot images. The main contribution of this article is.
Proposing an intelligent framework for the early prediction of risk levels of the diabetic foots relied on thermal images.
Proposing the novel segmentation model based on Modified Swin Transformers for the diabetic foot segmentation relied on the thermal images.
Assessing the highest quality hybrid schema for the deep feature extraction and high accurate prediction system based on the capsule shallow networks.
Evaluating the proposed model with the thermal foot image datasets and comparing with the other, existing frameworks in terms of prediction performance.
Literature review
An enhanced UNet model-based foot ulcer segmentation approach was presented by L. Xing et al. in 2022. The suggested approach added an SVM to the output node and a coarse localization module that makes decisions relied on prior information to the traditional network UNet as its foundation. In terms of dice score (89.02%), this framework performed better but the computational complexity remained to be unsolved [19]. Mahbod et al. (2022) introduced an ensemble method based on LinkNet and U-Net, two encoder-decoder CNN models. The performance of this framework in terms of dice score is average (92.07%). But this framework consumes more memory space, which is its biggest flaw [20].
Bouallal et al. (2022) proposed a deep learning-based segmentation approach for diabetic foot (DF) thermal images, addressing challenges such as image ambiguity and low clarity. Their model, Double Encoder-ResUnet (DE-ResUnet), integrates residual networks and U-Net architecture while fusing RGB and thermal information for enhanced accuracy. Using a dataset of 398 paired thermal and RGB images from 54 healthy subjects and 145 diabetic patients, the model was trained on 50% of the data, validated on 25%, and tested on the remaining 25%. The proposed model demonstrated superior segmentation performance, achieving an average intersection over union (IoU) of 97% and effectively delineating high-risk ulceration regions such as toes and heels. This study highlights the potential of deep learning in automating DF segmentation for early diagnosis and clinical applications [21]. Munadi et al. (2022) proposed a deep learning framework for early detection of diabetic foot ulcers (DFU) using decision fusion and thermal imaging. The study aimed to improve upon previous DFU detection models, which achieved an accuracy of 97%. The proposed framework employed ShuffleNet and MobileNetV2 as baseline classifiers, trained on plantar thermogram datasets, and integrated their outputs using a novel decision fusion method. This approach significantly enhanced classification accuracy, achieving 100% in distinguishing DFU-positive and DFU-negative cases, representing a 3.4% improvement over baseline models. The findings highlight the effectiveness of decision fusion in deep learning-based DFU detection, outperforming traditional machine learning and state-of-the-art deep learning classifiers [22]. DFU segmentation using U-Net was presented by D. Bouallal et al. in 2020. The thermal and color images provided by the FLIR ONE Pro thermal camera are combined to train U-Net. According to the results, this multimodal strategy outperforms the uses of thermal pictures. It is more accurate to use this framework. However, it is not advised for real-time situations because as data size increases, system performance steadily declined [23].
J. Amin et al. in 2020 proposed an assortment of classifiers, including KNN, DT, Ensemble, softmax, and NB, used in the classification phase to analyse the classification outcomes and opt for the efficient classifiers. Deep features are retrieved during this phase and fed to these classifiers. The high-level properties of the affected regions are shown by leveraging the gradient-weighted class activation mapping (Grad-Cam) model following classification for improved comprehension. The CNN network received the categorized photos for the purpose of locating affected areas. Although the computational complexity is increased, this framework offered improved classification accuracy [24–26].
Alzubaidi et al. (2020) introduced DFU_QUTNet, a novel deep convolutional neural network designed for the classification of diabetic foot ulcers (DFU) using a dataset of 754 foot images containing both healthy and ulcer-affected skin. Unlike traditional CNN architectures, which suffer from performance degradation when excessively deep, DFU_QUTNet increases network width while maintaining depth, facilitating better gradient propagation and feature extraction. The extracted features were used to train Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) classifiers. The study also compared DFU_QUTNet with fine-tuned versions of GoogleNet, VGG16, and AlexNet, demonstrating superior performance with an F1-score of 94.5%, highlighting its effectiveness in DFU classification [27].Tulloch et al. (2020) conducted a systematic review on the application of machine learning in the prevention, diagnosis, and management of diabetic foot ulcers (DFUs), highlighting its potential to enhance patient care. The study systematically analyzed 37 out of 3769 papers following PRISMA-DTA guidelines, with inclusion criteria requiring studies to mention ML, DFUs, and report relevant accuracy metrics. The review found that various ML algorithms achieved at least 90% accuracy in DFU-related tasks such as image segmentation, classification, raw data analysis, and risk assessment. Despite demonstrating promising results in controlled settings, the study emphasized the need for further research, particularly in direct comparisons with standard care practices, health economic analyses, and large-scale data collection to improve ML applications in DFU management [28]. A deep learning-based approach for precise wound area segmentation was proposed by C. Cui et al. in 2019. This technique used convolutional neural networks (CNNs) to create probability maps by first processing input images to eliminate artefacts. The infected region is extracted from the probability maps as the last step. The issue of eliminating certain false positives is also addressed. Studies revealed that this technique can deliver excellent results in terms of segmentation accuracy and the Dice index. The increased memory usage of this framework, however, is its fundamental flaw [29].
Proposed methodology
In this research system, three tier architecture of hybrid deep learning framework is proposed to segment and classify the diabetic foot images. In the first stage, U-based Modified Swin transformers (U-MST) are used to segment the diabetic foot images into varied structural pathological regions. In the second stage, deep topographic features the proposed schema for integrated design of segmentation and classification is depicted in Fig. 1.
Fig. 1.
Proposed framework for segmentation, feature extraction and classification layer
The overall framework integrates three modules — (i) Modified Swin Transformer (U-MST) for segmentation, (ii) Capsule Network for feature extraction, and (iii) Extreme Learning Machine (ELM) for final classification (Fig. 2).
Fig. 2.
Overall framework of the proposed work
The total loss function combines both segmentation and classification objectives:
Ltotal = αLseg + βLcls.
where Lseg is the Dice loss and Lcls is the categorical cross-entropy. The weight coefficients α = 0.6 and β = 0.4 were empirically determined to balance optimization.
Materials and methods
The study included patients who were clinically diagnosed with Type II diabetes mellitus for a minimum of three years, without any prior lower-limb amputation or open ulceration at the time of thermal imaging. Individuals exhibiting peripheral arterial disease, neuropathic deformities, non-diabetic foot wounds, or thermal imaging artefacts caused by skin conditions were excluded from the study. The control group consisted of healthy, non-diabetic participants with no history of vascular or neurological disorders, ensuring a clear distinction between diabetic and non-diabetic subjects for comparative analysis.
The dataset comprises thermal images (thermograms) of the foot region, sourced from [30]. It includes data collected from 122 individuals diagnosed with diabetes (DM group) and 45 non-diabetic participants (control group). Table 1 shows the demographic details of the data group.
Table 1.
Demographic details of the diabetic and non diabetic group
| Parameter | Diabetic group (n=122) | Control group (n=45) |
|---|---|---|
| Mean age (years) | 56.2 ± 7.4 | 54.8 ± 6.9 |
| Gender (M/F) | 65/57 | 23/22 |
| Mean BMI (kg/m2) | 26.8 ± 3.2 | 25.4 ± 2.7 |
This dataset illustrates the variations in temperature distribution within the foot area for both groups. Research indicates that elevated plantar temperature correlates with an increased risk of ulceration in diabetic individuals. The thermograms are captivated by utilizing high-resolution thermal cameras. The acquired images are systemized into folders using a distinct labelling system: ’CG’ represents the control group, while ’DM’ indicates the diabetic group. Each file is assigned a unique three-digit code, followed by a letter denoting the subject’s gender (male or female). The dataset features thermograms of both the right and left feet from subjects in the control and diabetic groups. Figure 3 illustrates the input images employed in the classification process.
Fig. 3.

Representative thermogram images utilized for diabetic foot ulcer prediction (a) control group- female -right (b) control group- female -left (c) diabetic group – male -right (d) diabetic groups-female-left
Data pre-processing
The pre-processing method is used to eliminate noise pixels and low-quality pixels that hinders the identification of foot ulcers. A pixel-intensive testing procedure has been implemented to neglect unreliable and noisy pixels from the input thermal images. Additionally, image histogram techniques are utilized to enhance image quality, as they perform effectively across various image types.
Data augmentation
After the initial pre-processing of input images, the proposed architecture uses an image augmentation process. Neural networks often encounter overfitting issues when there is a limited labelled data available. A highly effective approach to tackle this issue is by implementing data augmentation. During this stage, every image is examined to several modifications, leading to a significant boost in the quantity of corrected training image samples. As noted in [31], affine transformations are leveraged for effective data augmentation. Methods like translation, scaling, and rotation are categorized as affine transformations. Generally, the training image samples produced from the augmentation process display some level of correlation; therefore, this step is advisable to mitigate overfitting concerns.
Cross-validation and data diversity
To address the limitation of the available dataset comprising 122 diabetic and 45 control subjects, a five-fold cross-validation strategy was adopted to ensure unbiased model training and reliable performance evaluation. This technique divides the entire dataset into five equal subsets (folds) of approximately equal size and distribution of diabetic and control subjects as shown in Fig. 4.
Fig. 4.

Illustration of the five-fold cross-validation strategy used in the proposed U-MST + Capsule + ELM framework
In each iteration:
Four folds (80%) are used for training the proposed model.
One fold (20%) is reserved for testing and independent evaluation.
The process repeats five times, ensuring that each subset serves as the test set exactly once.
After completing all five iterations, the average of the five test performances (Accuracy, Dice Score, IoU, Precision, Recall, and F1-score) is computed to represent the model’s overall capability. This approach minimizes overfitting, ensures non-overlapping subject data between training and testing, and provides a more robust estimate of model generalization, particularly important for medical datasets with limited samples.
Segmentation model
The segmentation model incorporated as the backbone architecture for the segregation of diabetic thermal foot images is depicted in Fig. 5. A Modified Swin Transformer utilizing encoder-decoder architecture is introduced to achieve precise segmentation of diabetic thermal images. This architecture incorporates a feature fusion module at the end of the encoder to maximize the utilization of features across different levels.In this framework, feature maps are aggregated to enhance the transferable local attributes from various stages through multilevel feature fusion. This enables the proposed framework to localize the different ulcers, thus improving segmentation and classification accuracy. In addition, hybrid convolutional attention module (HCAM) is inserted into the modified swin transformer to support for dense building segmentation. Finally, similar architecture is incorporated as the final decoder.
Fig. 5.
Proposed segmentation model in swin-CAP-DF NETS
Swin transformer model
The schematic representation of the Swin Transformer module is represented in Fig. 6.
Fig. 6.

Framework for swin transformer schema
Every Swin transformer block consists of layer–normalization (LN), regular window multi-headed self attention (W-MHSA) module and multi-layer perceptron (MLP). These modules are used alternatively to form the complete swin transformer blocks. The formula for the Swin transformer block can be outlined as,
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
Modified swin transformer model
Figure 7 illustrates the modified swin transformer (MST) model utilized as the backbone for the proposed structure. Each MST contains a batch normalization (BN) layer, SW-MHSA, a shift window relied hybrid multi-head convolutional attention (SW-HCMA) module, a residual connection, BN layer and a MLP with residual connection. The SW-HCMA is introduced in the proposed block, which can significantly distinguish the feature maps of the thermal foot images. The SW-HCMA is constructed by combining the channel attention and spatial attention which helps to achieve the adaptive attention that focus on an accurate segmentation of the images. The proposed block integrates the Hybrid Channel-Memory Attention (HCMA) with shift window operations, mitigating adverse impacts on modelling capability, which facilitates improved accuracy. To achieve optimal performance on thermal images, the Shifted Window Multi-Head Self-Attention (SW-MHSA) and SW-HCMA are used alternately within the proposed Swin Transformer block. The mathematical formulation of the proposed Swin Transformer (SWT) can be represented as follows:
Fig. 7.
Framework of modified swin transformer schema
![]() |
5 |
![]() |
6 |
![]() |
7 |
![]() |
8 |
U-SWIN core design
Figure 8 shows the U-SWIN model which contains of Modified Swin transformer as a backbone with encoder, decoder and feature fusion module. The encoder-decoder design is formed as identical to the U-NET structure to handle the thermal images.
Fig. 8.
Structure of U-SWIN for handling the input thermal images in single cycle
Encoder design
In the design of the encoder, images are fed into the Patch Partition Layers (PPL) of the Modified Swin Transformer (MST), which divides them into non-overlapping patches. Every patch’s feature is constructed by concatenating the original RGB pixel values, resulting in a feature dimension corresponding to the input size. The Modified Swin Transformer Block is then utilized for hybrid attention computation, facilitating feature extraction. The Patch Merging Layers (PML) has a major impact in decreasing the number of tokens, enabling the generation of feature representations at various scales, as illustrated in Fig. 8. A combination of a patch merging layer and multiple Swin Transformer blocks is utilized to carry out feature extraction on the encoder side. Unlike the Tiny Swin Transformer model, the proposed architecture incorporates three stages, with the PPL and MST operating in Stage-1. Stage-2 and stage 3 consists of a PML module, MST blocks respectively.
Feature fusion module
In contrast to conventional segmentation method, dataset examined for this research is thermal images with complex background with the strong refracted illuminations and different geometries. The different techniques that adopted convolutions operations in transformer provided the deeper attributes of the input images that considerably refine the segmentation quality. A feature fusion module is implemented after each phase of the Multi-Scale Transformer (MST) in the encoder. This module convolves the output feature maps and merges multi-scale features with each stage of the MST in the decoders. Despite the Swin Transformer’s hierarchical framework, there are no interactions among the feature maps at any stage. Thus, enhancing the extracted features is crucial, particularly for these unique characteristics of thermal foot images. Figure 5 showcases four feature maps from the respective stages of the Swin Transformer backbone. The flat dimension of each stage is reduced by half, while the channel dimension is increased by twofold. Each stage’s feature map is down sampled using a 2 × 2 convolutional kernel with a sliding stride of two, and then incorporated with the following stage in the channel direction.
The output from the proposed encoder is configured to match the dimensions of the feature map located on the left side of the architecture, which is denoted as O(e). Subsequently, a merging operation is executed to combine the paths. Thus, the formulation describing the interconnections among each encoder block is represented as below:
![]() |
9 |
For the proposed model,
![]() |
10 |
F(dp) is the outputs from the path merging layers.
Decoder design
The decoder mainly consists of three stages. Contrary to the previous variants of U-net, the proposed decoder incorporates the proposed transformer block followed by the up-sampling and skip connection. The output generated by the encoder serves as the input for the decoder. In every stage of the decoder, the input features are increased in size by a factor of 2 and subsequently combined with the feature maps from the encoder at the corresponding stage, which are then processed through the MST block.
Enables the decoder to fully leverage the characteristics from the encoder and enhance up-sampling.
Enhances decoding efficiency.
To achieve well-defined segmented images, patch merging layers are integrated to unify features that enhance the segmentation procedure. The quantitative formulation for each stage of the decoder is presented as follows:
![]() |
11 |
For the Proposed model, decoder stage is given as follows,
![]() |
12 |
Capsule-based feature extraction
In the subsequent phase, the images obtained from the U-SWIN module are processed through the capsule network. This Network [32] has been recently introduced to overcome the shortcomings of traditional CNN architectures. Capsules consist of clusters of neurons that represent spatial details along with the likelihood of an object’s existence. Within a capsule network, a capsule is dedicated to each entity within an image, providing:
Likelihood of presence in entities.
Parameters for instantiation of entities.
The capsule network is partitioned into three segments: a lower capsule tier, an upper capsule tier, and a classification tier. To minimize the build-up of errors, global parameter sharing is utilized, alongside an enhanced dynamic routing algorithm that updates parameters iteratively. To capture the essential spatial relationships among low-level and high-level convolutional features in the image, the product of the input vectors’ matrix and the weight matrix is computed as below:
![]() |
13 |
![]() |
14 |
Non-linearity is ultimately introduced through the squashing function, which ensures that the vector’s direction is preserved while mapping its length to a maximum of one and a minimum of zero.
![]() |
15 |
To effectively classify thermal diabetic foot ulcers (DFUs), a capsule network can capture information located at various positions and establish relationships between features through the mathematical expression provided in Eq. (13). The primary capsule layers consist of convolutional layers, the specifications of which are detailed in the accompanying table. The output weights are computed using Eq. (14) and subsequently forwarded to the primary capsule region.
The squash function retains the original vector’s orientation while compressing its length to the range of (0, 1), as defined in Eq. (15). The subsequent phase integrates the dot product between similar capsules and their outputs, utilizing optimized dynamic routing. This iterative process upgrades the network’s weights to create feature maps. The dimensionality of these feature maps is then reduced via fully connected layers, transforming them into single-dimensional feature maps by utilizing flatten layers. The mathematical representation of the features is expressed in Eq. (16).
![]() |
16 |
Classification layers
The proposed approach leverages the principle of Extreme Learning Machines (ELM) proposed by G.B. Huang [33, 34] for rapid and precise classification of various grades. This type of neural network employs a single hidden layer, which does not require mandatory tuning. ELM utilizes a kernel function to achieve high accuracy and enhance performance. The vital benefits of ELM include minimal training error and improved approximation. With its auto-tuning of weight biases and non-zero activation functions, ELM is effectively applied in classification tasks and determining classification values. A comprehensive explanation of the ELM’s operational mechanism laid in references.
In this framework, the ‘L’ neurons in the hidden layer must utilize an activation function that is highly differentiable (such as the sigmoid function), while the output layer employs a linear activation function. The weights of the hidden layer can be allocated arbitrarily (including bias weights). Although the hidden nodes are relevant, the parameters of the hidden neurons can be randomly generated even prior to processing the training dataset. For a single-hidden layer ELM, the system output is described by Eq. (17).
![]() |
17 |
Where x ◊ input features from encoder-decoder.
◊ output weight vector
![]() |
18 |
h(x)◊ output hidden layer
![]() |
19 |
![]() |
20 |
The fundamental execution of Extreme Learning Machines (ELM) employs the simplest non-linear least squares approach, as depicted in Eq. (21).
![]() |
21 |
Where H∗◊ inverse of H known as Moore − Penrose generalized inverse.
![]() |
22 |
![]() |
23 |
The input feature maps, represented as h(x), form a temporal matrix, which is computed using the Moore-Penrose generalized inverse theorem. In this context, C represents a constant, while B and O signify the weights and bias parameters of the neural network. Ultimately, the likelihood of each class occurrence is determined using the softmax function, as portrayed in Eq. (24).
![]() |
24 |
The output variable Y′ is used to forecast the diabetic foot ulcer (DFU) mechanism based on established datasets. The loss function is computed using the cross-entropy function, represented mathematically as follows:
![]() |
25 |
Where K is the dimensional capsule feature length,
co-efficient and
is the constant. The functioning process of the suggested framework is outlined in Algorithm-1.
Implementation
The network architecture introduced was formulated by utilizing Keras, with TensorFlow serving as the backend framework. Table 2 displays the hyperparameters utilized for training the schema. In training phase, the early stopping technique was utilized to halt the training process prematurely, thereby mitigating the risk of overfitting.
Table 2.
Hyper parameters utilizing for training the proposed model
| Hyperparameters used | Specifications |
|---|---|
| Initial learning rate | 0.001 |
| Epoch count | 289 |
| Batch Size | 30 |
| Optimizer | ADAM |
| Momentum | 0.02 |
Evaluation metrices
The assessment of the proposed research is conducted by utilizing two primary metrics: segmentation metrics and classification metrics. The evaluation of the segmentation model utilizes metrics to determine the disparity among the predicted outcomes and the actual ground truth. These metrics include the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Both DSC and IoU serve as tools for measuring the overall variance between the model’s predictions and the ground truth. On the other hand, classification metrics—such as accuracy, precision, recall, specificity, and F1-score—are utilized to evaluate the proficiency of the algorithm in categorizing various levels of diabetes disorders. Table 3 provides the quantitative formulas for measuring the segmentation metrics, while Table 4 presents the equations for determining the classification metrics.
Table 3.
Mathematical expressions for calculating the segmentation metrics
| Segmentation metrics | Expression |
|---|---|
| DICE(DSC) | ![]() |
| IOU | ![]() |
| Segmentation precision | |SnT)/S |
| Segmentation recall | SnT/|S| |
Table 4.
Mathematical expression for measuring the classification metrics
| Performance metrics | Mathematical expression |
|---|---|
| Accuracy | ![]() |
| Sensitivity or recall | ![]() |
| Precision | ![]() |
| F1-Score | ![]() |
Here, S and T denote the actual ground truth and the outcomes produced by the model. The Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) values range from 0 to 1, where 0 indicates no overlap and 1 signifies a perfect match in segmentation. A greater value for these four metrics suggests an increased overlap among the model’s predictions and the actual ground truth, indicating a higher degree of similarity and improved accuracy in segmentation outcomes.
Result and discussion
Table 5 includes recent transformer-based and hybrid deep learning models (2022–2025) relevant to medical imaging and DFU analysis. The proposed approach was assessed with the numerous advanced DL schemes in terms of performance such as respectively. Advanced deep learning approaches like TransUNet [31], Swin-UNETR [35], MedT (Gated Axial Transformer) [36], DE-ResNets [37], hybrid CNN models [38] and VIT-Caps [39] are utilized for assessment. It is vital to acknowledge that all models were trained with identical experimental parameters, utilizing the datasets organized and metrics specified in the table for analysis. The models that were trained were subsequently validated and assessed by utilizing the test data. In all instances, the datasets were partitioned, allocating 80% for training and 20% for testing purposes.
Table 5.
Comparison with state-of-the-art transformer and hybrid
| Model/Architecture | Core technique | Dataset/application | Accuracy (%) | DSC/IoU (%) | Remarks |
|---|---|---|---|---|---|
| TransUNet | CNN + Vision Transformer Hybrid | Medical lesion segmentation (retina & DFU thermography) | 95.4 | 94.6 | Combines CNN encoder with ViT decoder for improved feature attention |
| Swin-UNETR | Hierarchical Swin Transformer | Thermal image segmentation | 96.1 | 95.2 | Efficient hierarchical feature fusion; improved boundary delineation |
| MedT (Gated Axial Transformer) | Transformer with gated axial attention | Generic medical image segmentation | 96.5 | 95.4 | Improves small lesion segmentation and feature aggregation |
| DE-ResUNet | Double Encoder + Residual U-Net | DFU thermal image segmentation | 97.0 | 97.0 | Fusion of RGB and thermal data; improved ulcer boundary localization |
| Hybrid CNN-ViT (ResNet-ViT) | Residual CNN + ViT Fusion | Thermal foot ulcer recognition | 97.8 | 96.5 | Captures both global transformer context and local CNN textures |
| ViT-Caps | Vision Transformer + Capsule Network | Finger vein and skin ulcer classification | 97.2 | 96.8 | Integrates transformer attention with capsule routing for structural awareness |
Ablation experiments were conducted to understand the contribution of each architectural component shown in Table 6. This enhanced analysis demonstrates each module’s contribution to the network’s stability and precision.
Table 6.
Ablation experiments
| Configuration | DSC (%) | IoU (%) | Accuracy (%) | Remarks |
|---|---|---|---|---|
| Without HCAM | 94.3 | 92.7 | 95.1 | Missing hybrid attention causes poor localization |
| Without Capsule Layer | 96.8 | 95.4 | 97.0 | Feature relationships under-represented |
| Without ELM Classifier | 97.2 | 96.1 | 97.8 | Slight drop in convergence speed |
| Full Model (Proposed) | 98.5 | 97.5 | 99.0 | Best performance with full integration |
Ablation analysis of the proposed algorithm (segmentation)
In this section, ablation studies were undertaken to demonstrate the impact of every element in the suggested architecture. In training phase, the variations in loss for each model were documented to assess the model’s efficacy, and the results are illustrated in Fig. 8. As illustrated, the enhanced Swin Transformer networks substantially enhanced the segmentation aspect of the model, leading to a lessened loss value throughout the training period. The ablation analysis has validated the efficacy of the proposed architecture in segregating diabetic thermal images and shows superior performance compared to other existing algorithms. Table 6 portrays the segmentation performance of the varied schemas.
Classification results
The experimentation was carried out under two different conditions. The initial condition focused on the impact of drop-out effects on the efficacy of the proposed approach. The second scenario is calculating the classification of the advised approach and comparing with the varied residing DL algorithms in classification. Figure 9 demonstrates the effectiveness of the proposed scheme with the varied drop-out ratios. Normally the increase in the drop-outs degrades the performance of learning model. But from the Fig. 9, it is clearly shows that the effectiveness of the proposed scheme remains stable even though the drop-outs increase. The inclusion of the feed-forward layers in proposed scheme has shown the uniform effectiveness which can also overcome the over-fitting problems during training and testing (Table 7).
Fig. 9.
Performance validation of the recommended means with the varied drop-outs
Table 7.
Outcomes of segmentation for various algorithms following the ablation study utilizing test datasets
| Algorithms | DSC(%) | IoU(%) | Precision(%) | Recall(%) |
|---|---|---|---|---|
| TransUNet | 77.31% | 77.80% | 76.4% | 73.12% |
| Swin-UNETR | 80.24% | 81.3% | 80.45% | 79.9% |
| MedT (Gated Axial Transformer) | 80.6% | 80.21% | 78.91% | 79.8% |
| DE-ResUNet | 82.3% | 82.6% | 81.59% | 80.5% |
| Hybrid CNN-ViT (ResNet-ViT) | 85.09% | 84.92% | 83.70% | 82.03% |
| ViT-Caps | 90.74% | 90.2% | 90.45% | 89.41% |
| Proposed Model | 98.5% | 97.5% | 97.07% | 98.65% |
Moreover, the performance of the proposed schema is validated in Fig. 8. It is noticeable from Fig. 10, RMSE (root mean square error) is found very less (0.0001) between the training and testing accuracy which confirms the stability of the proposed schema in classifying the normal and sick patients from the thermal images from MRI image datasets.
Fig. 10.
Validation performance of therecommended approach (a) 20% testing data (b) 30% training data
Tables 8 and 9 portray the examined findings with varied schemes in classifying the thermal images. The learning model without any attention network has produced the least classification performance whereas the other hybrid models proposed from [42–45] has significantly yielded superior results compared to conventional learning approaches. The integration of transformer and feed-forward layers utilizing Extreme Learning Machines (ELM) has yielded the highest classification accuracy, reaching an impressive 0.987, significantly surpassing varied approaches by a considerable margin.
Table 8.
Average performance metrices of the cutting edge approaches for the thermal image in detecting CG subjects
| Algorithms | Performance metrics | ||||
|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Computational cost (sec/image) | |
| TransUNet | 0.832 | 0.86 | 0.81 | 0.86 | 0.45 |
| Swin-UNETR | 0.83 | 0.70 | 0.82 | 0.733 | 0.52 |
| MedT (Gated Axial Transformer) | 0.88 | 0.812 | 0.88 | 0.825 | 0.49 |
| DE-ResUNet | 0.81 | 0.85 | 0.84 | 0.89 | 0.58 |
| Hybrid CNN-ViT (ResNet-ViT) | 0.82 | 0.80 | 0.89 | 0.89 | 0.51 |
| ViT-Caps | 0.90 | 0.85 | 0.82 | 0.845 | 0.43 |
| Proposed Model | 0.958 | 0.932 | 0.98 | 0.97 | 0.37 |
Table 9.
Average performance metrics of the cutting edge approaches for the thermal image in detecting DG subjects
| Algorithms | Performance metrics | ||||
|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | Computational cost (sec/image) | |
| TransUNet | 0.86 | 0.78 | 0.89 | 0.83 | 0.46 |
| Swin-UNETR | 0.83 | 0.79 | 0.752 | 0.782 | 0.50 |
| MedT (Gated Axial Transformer) | 0.872 | 0.87 | 0.767 | 0.77 | 0.47 |
| DE-ResUNet | 0.80 | 0.79 | 0.79 | 0.775 | 0.57 |
| Hybrid CNN-ViT (ResNet-ViT) | 0.89 | 0.841 | 0.85 | 0.894 | 0.49 |
| ViT-Caps | 0.975 | 0.94 | 0.87 | 0.85 | 0.42 |
| Proposed Model | 0.987 | 0.97 | 0.95 | 0.952 | 0.37 |
The computational cost analysis highlights that, while most transformer-based and CNN-based networks require 0.45–0.58 s per image for inference, the proposed U-MST + Capsule + ELM framework achieves superior accuracy with an average cost of only 0.37 s per image, demonstrating higher computational efficiency suitable for real-time clinical deployment.
Conclusion and future enhancement
A Novel segmentation and classification model was proposed for the diagnosis of diabetics using thermal foot images. In this research article, diabetic detection system based on capsule and modified swin transformer was proposed. Furthermore, the article combines the benefits of Swin transformer in examining the foundational perspective comes with several benefits of capsule networks integrated with dense feed forward training networks to aid for the better segmentation and classification. Thorough tests were conducted utilizing thermography datasets, evaluating performance indicators like accuracy, precision, recall, and F1-score, DICE, PSNR, IoU and segmentation are evaluated. To prove the excellence of the proposed scheme, performance is examined with the varied residing Cutting edge DL approaches and existing hybrid architectures. Results demonstrate that proposed model showed significant improvements over the other CNN by achieving the prediction performance of 99%, 98.6% precision, 98.4%recall, 99% specificity and 99% F1-score respectively. The proposed model endowed with a novel ensemble deep learning architecture, which provided the brighter light in the direction of diagnosis system. Although the proposed system achieved a superior segmentation and prediction accuracy of 99%, future research will aim to expand the dataset by incorporating multi-center and multi-ethnic populations to ensure wider clinical applicability. The model will also be validated using external open-access datasets to enhance its generalizability and robustness. Further work will explore model compression and quantization techniques to enable deployment on portable and low-power medical devices. Additionally, interpretability will be improved through advanced explainable AI methods such as Grad-CAM + + and SHAP to provide clearer insights into decision-making processes.
Acknowledgements
The authors wish to thank the Innovation and Commercial Center (ICC) of UniversitiTun Hussein Onn Malaysia (UTHM) for supporting this work through the consultancy project fund, an international research grant. The authors would also like to thank and appreciate the management, and administrative officials of UTHM, Malaysia and M/s. Edwin Medicals, Monika Diabetic centre from India for their support in carrying out the research work successfully.
Author contributions
All authors contributed to the conception and design of the study. Yamunarani Thanikachalam was responsible for preparing materials, data collection, and analysis. Wan Suhaimizan Bin Wan Zaki, Ashok Vajravelu and Prabu Rathinam provided valuable input through technical discussions and contributed to the analysis of experimental results. Dr. E. Thangavelu offered insights from a clinical perspective. All authors have read and approved the final manuscript.
Funding
Open access funding provided by The Ministry of Higher Education Malaysia and Universiti Tun Hussein Onn Malaysia. This research did not receive any specific grant from any funding agency in the public, commercial or not-for-profit sector.
Data availability
The datasets used for training and evaluation of the pipeline are available upon request.
Declarations
Ethics approval
This research study involves human subject participation. Ethical approval will be obtained from the Institutional Ethics Committee prior to participant recruitment, and informed consent will be collected from all participants.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Cho NH, Kirigia J, Mbanya JC, Ogurstova K, Guariguata L, Rathmann W, Roglic G, Forouhi N, Dajani R, Esteghamatil A, et al. IDF diabetes atlas. Volume IDF, 8th ed. Belgium: Brussels; 2017. [Google Scholar]
- 2.Sims DS Jr., Cavanagh PR, Ulbrecht JS. Risk factors in the diabetic foot: recognition and management. Phys Ther. 1998;68:1887–902. 10.1093/ptj/78.12.1887. [DOI] [PubMed] [Google Scholar]
- 3.Iversen M, Tell G, Riise T, Hanestad B, Østbye T, Graue M, et al. History of foot ulcer increases mortality among individuals with diabetes: ten-year follow-up of the Nord-Trøndelag Health Study, Norway. Diabetes Care. 2009;32:2193–9. 10.2337/dc09-0321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ring F. Thermal imaging today and its relevance to diabetes. J Diabetes Sci Technol. 2010;4:857–62. [CrossRef] [PubMed]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hernandez-Contreras D, Peregrina-Barreto H, Rangel-Magdaleno J, Gonzalez-Bernal J. Narrative review: diabetic foot and infrared thermography. Infrared Phys Technol. 2016;78:105–17. 10.1016/j.infrared.2016.09.001. [Google Scholar]
- 6.Martín-Vaquero J, Hernández Encinas A, Queiruga-Dios A, José Bullón J, Martínez-Nova A, Torreblanca González J, Bullón-Carbajo C. Review on wearables to monitor foot temperature in diabetic patients. Sensors. 2019;19:776. [CrossRef]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Armstrong DG, Holtz-Neiderer K, Wendel C, Mohler MJ, Kimbriel HR, Lavery LA. Skin temperature monitoring reduces the risk for diabetic foot ulceration in high-risk patients. Am J Med. 2007;120:1042–6. 10.1016/j.amjmed.2007.06.022. [DOI] [PubMed] [Google Scholar]
- 8.Bagavathiappan S, Philip J, Jayakumar T, Raj B, Rao PNS, Varalakshmi M, Mohan V. Correlation between plantar foot temperature and diabetic neuropathy: a case study by using an infrared thermal imaging technique. J Diabetes Sci Technol. 2010;4:1386–92. [CrossRef]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Roback K, Johansson M, Starkhammar A. Feasibility of a thermographic method for early detection of foot disorders in diabetes. Diabetes Technol Ther. 2009;11:663–7. 10.1089/dia.2009.0020. [DOI] [PubMed] [Google Scholar]
- 10.Lavery LA, Higgins KR, Lanctot DR, Constantinides GP, Zamorano RG, Athanasiou KA, Armstrong DG, Agrawal CM. Preventing diabetic foot ulcer recurrence in high-risk p. [DOI] [PubMed]
- 11.Chan AW, MacFarlane IA, Bowsher DR. Contact thermography of painful diabetic neuropathic foot. Diabetes Care. 1991;14:918–22. [CrossRef]. [DOI] [PubMed] [Google Scholar]
- 12.Sarvalingam P, Vajravelu A, Selvam J, Bin Ponniran A, Bin Wan Zaki WS, PK. A novel end-to-end learning framework based on optimised residual gated units and stacked pre-trained layers for detection of autism spectrum disorders (ASD). Brain Broad Research in Artificial Intelligence and Neuroscience. 2025;16(1):286–305. 10.70594/brain/16.1/21. [Google Scholar]
- 13.Kalaiyarasi D, Leopauline S, Krishnaveni S, Vajravelu A. Cloud computing-based computer forensics: a deep learning technique. Int J Electron Secur Digit Forensics. 2024;16(3):317–28. 10.1504/IJESDF.2024.138355. [Google Scholar]
- 14.Tasci B, Acharya MR, Baygin M, Dogan S, Tuncer T, Belhaouari SB. InCR: inception and concatenation residual block-based deep learning network for damaged building detection using remote sensing images. Int J Appl Earth Obs Geoinf. 2023;123:103483. [Google Scholar]
- 15.Nagase T, Sanada H, Takehara K, Oe M, Iizaka S, Ohashi Y, et al. Variations of plantar thermographic patterns in normal controls and non-ulcer diabetic patients: novel classification using angiosome concept. J Plast Reconstr Aesthet Surg. 2011;64:860–6. 10.1016/j.bjps.2010.12.060. [DOI] [PubMed] [Google Scholar]
- 16.Mori T, Nagase T, Takehara K, Oe M, Ohashi Y, Amemiya A, Noguchi H, Ueki K, Kadowaki T, Sanada H. Morphological pattern classification system for plantar thermography of patients with diabetes. J Diabetes Sci Technol. 2013;7:1102–12. [CrossRef] [PubMed]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jones BF. A reappraisal of the use of infrared thermal image analysis in medicine. IEEE Trans Med Imaging. 1998;17:1019–27. 10.1109/42.730419. [DOI] [PubMed] [Google Scholar]
- 18.Kaabouch N, Chen Y, Anderson J, Ames F, Paulson R. Asymmetry analysis based on genetic algorithms for the prediction of foot ulcers. In: Proceedings of the IS&T/SPIE electronic imaging, visualization and data analysis. San Jose, CA, USA; 2009. p. 724304.
- 19.Xing L, Li L, Wang Z, Ma H. An improved UNet model for foot ulcer image segmentation. In: 2022 7th International Conference on Image, Vision and Computing (ICIVC); 2022. pp. 393–397. 10.1109/ICIVC55077.2022.9886343
- 20.Mahbod A, Schaefer G, Ecker R, Ellinger I. Automatic foot ulcer segmentation using an ensemble of convolutional neural networks. In: 2022 26th Int Conf Pattern Recognit (ICPR); 2022. pp. 4358–64. 10.1109/ICPR56361.2022.9956253.
- 21.Bouallal D, Douzi H, Harba R. Diabetic foot thermal image segmentation using Double Encoder-ResUnet (DE-ResUnet). J Med Eng Technol. 2022;46(5):378–92. 10.1080/03091902.2022.2077997. [DOI] [PubMed] [Google Scholar]
- 22.Munadi K, Saddami K, Oktiana M, Roslidar R, Muchtar K, Melinda M, Muharar R, Syukri M, Abidin TF, Arnia F. A deep learning method for early detection of diabetic foot using decision fusion and thermal images. Appl Sci. 2022;12:7524. 10.3390/app12157524. [Google Scholar]
- 23.Bouallal D, et al. Segmentation of plantar foot thermal images: application to diabetic foot diagnosis. In: 2020 Int Conf Syst Signals Image Process (IWSSIP); 2020. pp. 116–21. 10.1109/IWSSIP48289.2020.9145167.
- 24.Amin J, Sharif M, Anjum MA, Khan HU, Malik MSA, Kadry S. An integrated design for classification and localization of diabetic foot ulcer based on CNN and YOLOv2-DFU models. IEEE Access. 2020;8:228586–97. 10.1109/ACCESS.2020.3045732. [Google Scholar]
- 25.Taşcı B, Acharya MR, Barua PD, Yildiz AM, Gun MV, Keles T, et al. A new lateral geniculate nucleus pattern-based environmental sound classification using a new large sound dataset. Appl Acoust. 2022;196:108897. [Google Scholar]
- 26.Tuncer T, Dogan S, Baygin M, Tasci I, Mungen B, Tasci B, et al. Directed Lobish-based explainable feature engineering model with TTPat and CWINCA for EEG artifact classification. Knowl Based Syst. 2024;305:112555. [Google Scholar]
- 27.Alzubaidi L, Fadhel Mohammed A., Oleiwi Sameer R., Al-Shamma Omran, Zhang Jinglan. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimedia Tools Appl. 2020;79(21–22):15655–77. 10.1007/s11042-019-07820-w. [Google Scholar]
- 28.Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:200000–17. 10.1109/ACCESS.2020.3035327. [Google Scholar]
- 29.Cui C, et al. Diabetic wound segmentation using convolutional neural networks. In: 2019 41st Annual Int Conf IEEE Eng Med Biology Soc (EMBC); 2019. pp. 1002–5. 10.1109/EMBC.2019.8856665. [DOI] [PMC free article] [PubMed]
- 30.Kaabouch N, Chen Y, Hu W-C, Anderson JW, Ames F, Paulson R. Enhancement of the asymmetry-based overlapping analysis rough features extraction. J Electron Imag. 2011;20:013012. 10.1117/1.JEI.20.1.013012. [Google Scholar]
- 31.Liu C, van Netten JJ, van Baal JG, Bus SA, van der Heijden F. Automatic detection of diabetic foot complications with infrared thermography by asymmetric analysis. J Biomed Opt. 2015;20:026003. [CrossRef]. [DOI] [PubMed] [Google Scholar]
- 32.Hernandez-Contreras D, Peregrina-Barreto H, Rangel-Magdaleno J, Ramirez-Cortes J, Renero-Carrillo F. Automatic classification of thermal patterns in diabetic foot based on morphological pattern spectrum. Infrared Phys Technol. 2015;73:149–57. 10.1016/j.infrared.2015.08.014. [Google Scholar]
- 33.Wu L, Huang R, He X, Tang L, Ma X. Advances in machine learning-aided thermal imaging for early detection of diabetic foot ulcers: a review. Biosensors. 2024;14(12):614. 10.3390/bios14120614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hernandez-Contreras DA, Peregrina-Barreto H, Rangel-Magdaleno J, Orihuela-Espina F. Statistical approximation of plantar temperature distribution on diabetic subjects based on beta mixture model. IEEE Access. 2019;7:28383–91. [CrossRef]. [Google Scholar]
- 35.Cruz-Vega Israel, Hernandez-Contreras Daniel, Peregrina-Barreto Hayde, Rangel-Magdaleno Jose d.J., Ramirez-Cortes Juan M. Deep learning classification for diabetic foot thermograms. Sensors. 2020;20(6):1762. 10.3390/s20061762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. DFUNet: convolutional neural networks for diabetic foot ulcer classification. In: IEEE Transactions on Emerging Topics in Computational Intelligence; 2020. vol. 4, no. 5, pp. 728–739. 10.1109/TETCI.2018.2866254
- 37.da Oliveira ALC, Dantas DO. Faster R-CNN approach for diabetic foot ulcer detection. Dept. Comput, Univ. Federal de Sergipe, São Cristóvão, Brazil, pp. 1–10. Available [Online]: http://grand-challengepublic.s3.amazonaws.com/evaluationsupplementary/532/3c68e2c6-c6bd-456ebe40-68e11d5c3f6a/Faster_RCNN_DFU.pdf. Accessed 15 Dec 2020.
- 38.Rania N, Douzi H, Yves L, Sylvie T. Semantic segmentation of diabetic foot ulcer images: Dealing with small dataset in DL approaches. In: Proc Int Conf Image Signal Process Cham. Switzerland: Springer; 2020. pp. 162–169.
- 39.Li Y, Lu H, Wang Y, Gao R, Zhao C. Vit-cap: a novel vision Transformer-based capsule network model for finger vein recognition. Appl Sci. 2022;12:10364. 10.3390/app122010364. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used for training and evaluation of the pipeline are available upon request.






































