. 2026 Jan 3;25(1):17. doi: 10.1007/s40200-025-01813-3

Table 5.

Comparison with state-of-the-art transformer and hybrid

Model/Architecture	Core technique	Dataset/application	Accuracy (%)	DSC/IoU (%)	Remarks
TransUNet	CNN + Vision Transformer Hybrid	Medical lesion segmentation (retina & DFU thermography)	95.4	94.6	Combines CNN encoder with ViT decoder for improved feature attention
Swin-UNETR	Hierarchical Swin Transformer	Thermal image segmentation	96.1	95.2	Efficient hierarchical feature fusion; improved boundary delineation
MedT (Gated Axial Transformer)	Transformer with gated axial attention	Generic medical image segmentation	96.5	95.4	Improves small lesion segmentation and feature aggregation
DE-ResUNet	Double Encoder + Residual U-Net	DFU thermal image segmentation	97.0	97.0	Fusion of RGB and thermal data; improved ulcer boundary localization
Hybrid CNN-ViT (ResNet-ViT)	Residual CNN + ViT Fusion	Thermal foot ulcer recognition	97.8	96.5	Captures both global transformer context and local CNN textures
ViT-Caps	Vision Transformer + Capsule Network	Finger vein and skin ulcer classification	97.2	96.8	Integrates transformer attention with capsule routing for structural awareness