Table 5.
Comparison with state-of-the-art transformer and hybrid
| Model/Architecture | Core technique | Dataset/application | Accuracy (%) | DSC/IoU (%) | Remarks |
|---|---|---|---|---|---|
| TransUNet | CNN + Vision Transformer Hybrid | Medical lesion segmentation (retina & DFU thermography) | 95.4 | 94.6 | Combines CNN encoder with ViT decoder for improved feature attention |
| Swin-UNETR | Hierarchical Swin Transformer | Thermal image segmentation | 96.1 | 95.2 | Efficient hierarchical feature fusion; improved boundary delineation |
| MedT (Gated Axial Transformer) | Transformer with gated axial attention | Generic medical image segmentation | 96.5 | 95.4 | Improves small lesion segmentation and feature aggregation |
| DE-ResUNet | Double Encoder + Residual U-Net | DFU thermal image segmentation | 97.0 | 97.0 | Fusion of RGB and thermal data; improved ulcer boundary localization |
| Hybrid CNN-ViT (ResNet-ViT) | Residual CNN + ViT Fusion | Thermal foot ulcer recognition | 97.8 | 96.5 | Captures both global transformer context and local CNN textures |
| ViT-Caps | Vision Transformer + Capsule Network | Finger vein and skin ulcer classification | 97.2 | 96.8 | Integrates transformer attention with capsule routing for structural awareness |