Skip to main content
. 2026 Jan 3;25(1):17. doi: 10.1007/s40200-025-01813-3

Table 5.

Comparison with state-of-the-art transformer and hybrid

Model/Architecture Core technique Dataset/application Accuracy (%) DSC/IoU (%) Remarks
TransUNet CNN + Vision Transformer Hybrid Medical lesion segmentation (retina & DFU thermography) 95.4 94.6 Combines CNN encoder with ViT decoder for improved feature attention
Swin-UNETR Hierarchical Swin Transformer Thermal image segmentation 96.1 95.2 Efficient hierarchical feature fusion; improved boundary delineation
MedT (Gated Axial Transformer) Transformer with gated axial attention Generic medical image segmentation 96.5 95.4 Improves small lesion segmentation and feature aggregation
DE-ResUNet Double Encoder + Residual U-Net DFU thermal image segmentation 97.0 97.0 Fusion of RGB and thermal data; improved ulcer boundary localization
Hybrid CNN-ViT (ResNet-ViT) Residual CNN + ViT Fusion Thermal foot ulcer recognition 97.8 96.5 Captures both global transformer context and local CNN textures
ViT-Caps Vision Transformer + Capsule Network Finger vein and skin ulcer classification 97.2 96.8 Integrates transformer attention with capsule routing for structural awareness