Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 11.
Published in final edited form as: Proc IEEE Int Symp Biomed Imaging. 2025 May 12;2025:10.1109/isbi60581.2025.10981108. doi: 10.1109/isbi60581.2025.10981108

EVIT-UNET: U-NET LIKE EFFICIENT VISION TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION ON MOBILE AND EDGE DEVICES

Xin Li 1, Wenhui Zhu 1, Xuanzhao Dong 1, Oana M Dumitrascu 2, Yalin Wang 1
PMCID: PMC12337706  NIHMSID: NIHMS2046777  PMID: 40791942

Abstract

With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity. The code is available at https://github.com/Retinal-Research/EVIT-UNET.

Index Terms—: UNet, Vison Transformer(ViT), Segmentation, Attention, Computational Efficiency

1. INTRODUCTION

With the rapid development of deep learning, the field of medical image analysis has made significant progress, especially in image segmentation tasks. U-shape networks have become the mainstream model for segmentation networks. The structure integrates a symmetric encoder-decoder configuration with a bottleneck layer and skip-connection, constituting the classic U-shaped segmentation network [1]. The classical U-Net architecture employs convolutional downsampling to capture multi-level features. It incorporates skip connections to preserve spatial details, followed by decoder-based upsampling to reconstruct image resolution for precise pixel-level segmentation. This structure has succeeded greatly in various medical image segmentation tasks, such as heart, organ, and lesion segmentation. Some works based on U-Net aim to enhance network performance, such as U-Net++ [2], which improves network efficiency by refining the skip connection mechanism. Furthermore, DeepLabV3 [3] enhances the processing of multi-scale features by optimizing convolutional operations. However, CNN-based UNet still needs to improve in capturing global semantic information and handling complex feature interactions.

The introduction of Vision Transformers (ViTs) addresses the limitations of CNNs in capturing global information. The self-attention mechanism in ViTs effectively captures global context [4], providing significant advantages over traditional CNNs in managing global features and long-range dependencies. Researchers have started exploring their application in medical image segmentation. For instance, Att-UNet [5] and TransUNet [6] introduced Transformers to the UNet network, and hybrid CNN-Transformer models like HiFormer [7] and UCTransUNet [8] were designed to improve network speed. Furthermore, SwinUNet [9] and MedT [10] employ specialized Transformer architectures to enhance network speed and performance. The self-attention mechanism in ViTs improves segmentation accuracy and robustness, further advancing effectiveness in medical image segmentation tasks.

Although ViTs exceptionally perform in vision tasks, their high computational complexity limits their applicability on resource-constrained devices [11, 12]. In medical image segmentation tasks, reducing computational complexity while maintaining accuracy has always been a key pursuit in segmentation tasks for resource-constrained devices. [9, 7] Thus, we propose EViT-UNet, an efficient U-shaped network based on ViT for medical image segmentation on mobile and edge devices. It inherits ViT’s ability to capture global information while reducing computational complexity through the combination of convolution and self-attention mechanisms, ensuring high accuracy while minimizing computational cost, making it ideal for mobile and edge devices. Tested on multiple datasets, EViT-UNet demonstrated superior segmentation accuracy and outperformed other popular segmentation frameworks. Our key contributions are as follows: (1) Developing an efficient U-shaped segmentation framework based on ViT that integrates an encoder, a decoder, and skip connections, which has shown outstanding performance across diverse datasets. (2) Achieving the best computational efficiency in comparative analyses with multiple networks. (3) Successfully reducing computational complexity while maintaining high accuracy, enhancing the feasibility of deploying this technology in resource-limited settings for medical image segmentation tasks.

2. METHOD

2.1. Architecture overview

The overall architecture of our network is illustrated in Fig. 2(a). The design consists of an encoder, decoder, bottleneck layer, and skip connections during the upsampling. Both the encoder and decoder are structured into four stages, and we employ EfficientFormerV2 block [16] as the basic unit. The input through initial feature extraction block(stem) and downsampling to the size tp 40×H4×W4. Then, the input is downsampled after the block in each stage of the encoder, with downsampling rates of 2. The encoder adopts the channel configuration shown in Fig. 2(a). Global feature fusion occurs in the encoder’s final stage, and the features are passed to the decoder. We designed a decoder that is symmetric to the encoder. The decoder features are combined with the encoder features through skip connections, restoring the image features, and then performing 2x upsampling in each stage. Finally, the upsampling module performs 4x upsampling and outputs pixel-level predictions.

Fig. 2.

Fig. 2.

(a) The architecture of EViT-Unet, which is composed of encoder, bottleneck, decoder, and skip connections. ”Local” blocks use convolution, while ”Global+Local” blocks use the combination of convolution and self-attention. (b) The details of ”Local” blocks. ”DW.Conv,” is the depthwise convolution [13]. (c) The details of ”Global+Local” blocks.

2.2. Efficientformer block

Different from models that purely use Transformer and self-attention as encoders, our network adopts a hybrid approach, combining convolution with self-attention modules. In the high-resolution stages, where self-attention requires calculating interactions between all pixels, leading to significant computational overhead [4], our blocks employ the depthwise(DW) convolution [13] to construct its feed-forward network (FFN) to extract local features, as illustrated in Fig 2(b). Compared to standard convolution, DW convolution applies one filter per input channel, significantly reducing the computational complexity and enhancing the local features. The process can be described as:

Xi+1,j=Si,jFFN(Xi,j)+Xi,j

where the i is the ith layer in the j stage, And the S is a learnable layer scale [17].

In the low-resolution stages, the computational burden of the self-attention mechanism is significantly reduced. Our blocks introduce the multi-head self-attention (MHSA) mechanism [4](Fig. 2(c)), which enhances the ability to capture global features and enriches multi-scale features in the encoder. In the decoder, the multi-head attention mechanism improves the accuracy of image reconstruction by aggregating global and local features. This approach effectively balances accuracy and computational efficiency, allowing the model to capture complex global dependencies without significantly increasing the computational load. This process can be described as follows:

Xi+1,j=Si,jMHSA(Proj(Xi,j))+Xi,j

The input features are projected through the mapping function Proj(Xi,j) via linear transformations to obtain the query (Q), key (K), and value (V) in the attention mechanism:

MHSA(Q,K,V)=Softmax(QKT+ab)V

where ab is the learnable attention bias for position encoding.

2.3. Downsampling and Upsampling

During the downsampling process, convolution is similarly employed in the high-resolution stages for efficient downsampling and to reduce the size of feature maps. In the low-resolution stages, we adopt the self-attention mechanisms for downsampling, which adjusts the number of query tokens to effectively capture global dependencies and multi-scale features during downsampling. This approach balances the computational complexity of the self-attention mechanism due to the reduced resolution. We propose a symmetric design for the decoder, utilizing self-attention upsampling in the low-resolution stages by adjusting the number of query tokens. At the same time, convolution operations are used to upsample the high-resolution stages. While ensuring accurate image reconstruction, it reduces computational complexity. The self-attention downsampling/upsampling can be described as:

Out[B,H,r×N,C]=(Q[B,H,r×N,C]K[B,H,C,N]T)V[B,H,N,C]

where r is the scaling factor, When r=12, it represents downsampling, when r = 2, it represents upsampling.

2.4. Skip Connection

In U-shaped segmentation networks, skip connections play a crucial role by passing the features collected from the encoder to the decoder to help retain low-level features effectively [1]. However, recent studies found some limitations of traditional skip connections [8]. Simply concatenating the encoder and decoder features may introduce redundancy, and since skip connections primarily pass local features, they struggle to capture global dependencies in more complex segmentation tasks. However, some studies introduced attention mechanisms into skip connections and have achieved promising results [8, 5]. Based on the investigation and research, we introduced channel attention [8] into skip connections. This method enhances feature fusion by applying attention to emphasize important feature channels and suppress redundancy. It also facilitates better global dependency modeling across different feature scales, all while introducing minimal additional computational overhead. The channel-based skip connection can be described as follows:

attavg=MLPx(AVGPool(x))+MLPg(AVGPool(g))2
out=ReLU(x×Sigmod(attavg))

Here, x is the current feature map from the previous layer, and g is the skip connection feature map from the encoder during the downsampling process.

3. EXPERIMENTS AND RESULTS

Synapse multi-organ segmentation dataset (Synapse)

Synapse multi-organ segmentation dataset (Synapse) [19] consists of 30 cases, with a total of 3779 axial abdominal clinical CT images. The dataset is split into 18 training and 12 testing samples. Our method is evaluated on eight abdominal organs, including the aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach.

Gland segmentation dataset (GlaS) [20] and Multi-Organ Nucleus Segmentation (MoNuSeg) [21]

Gland segmentation dataset (GlaS) [20] and Multi-Organ Nucleus Segmentation (MoNuSeg) [21] contains 85 images for training, 80 for testing, and 30 images for training and 14 for testing, respectively. We perform 5-fold cross-validation on the GlaS and MoNuSeg datasets.

Experiments Implement

Our network is implemented based on Python 3.10 and PyTorch 2.0. Input images with size of 224×224, with a batch size of 32, and training is conducted on a single Nvidia A100 GPU. We employed the pre-trained weights of EfficientFormerV2 from ImageNet, adapting and loading them into both the encoder and decoder to initialize the model parameters as much as possible. Optimization is performed using the SGD optimizer with backpropagation.

Results

The results are shown in Table 1 for Synapse dataset, and Table 2 for Glas and MoNuSeg datasets. Our model outperforms many popular current methods, achieving superior performance with 80.87%. Specifically, it surpasses the best performance in our comparison method by 0.33% in average DSC in the Synapse dataset. The results on the Glas and MoNuSeg datasets show that our method performs well on both datasets. Specifically, on the Glas dataset, our model achieved the best DSC of 92.44% and an IOU of 86.50%. On the MoNuSeg dataset, our model also achieved a DSC of 79.27% and an IOU of 65.87%, outperforming many popular comparison methods. We also obtain the visualization results for the Synapse dataset(Fig. 3(A)), Glas dataset(Fig. 3(B)), and MoNuSeg dataset(Fig. 3(C)) to illustrate the performance of our method. Most importantly, we conducted a comparison of computational complexity on the Synapse dataset; our method outperforms all compared approaches in terms of computational efficiency, and the computational complexity is only 6.39 GMac(Fig. 1).

Table 1.

Comparison with other methods on Synapse multi-organ CT dataset. The DSC denote dice similarity coefficient.

Methods Average DSC Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
R50 U-Net [1] 74.68 87.74 63.66 80.60 78.19 93.74 56.90 85.87 74.16
U-Net [1] 76.85 89.07 69.72 77.77 68.60 93.43 53.98 86.67 75.58
R50 Att-UNet [5] 75.57 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
Att-UNet [5] 77.77 89.55 68.88 77.98 71.11 93.57 58.04 87.30 75.75
TransUNet [6] 77.48 87.23 63.13 81.87 77.02 94.08 55.86 85.08 75.62
SwinUNet [9] 79.13 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.60
LeViT-UNet [14] 78.53 78.53 62.23 84.61 80.25 93.11 59.07 88.86 72.76
DeepLabv3 [3] 77.63 88.04 66.51 82.76 74.21 91.23 58.32 87.43 73.53
HiFormer [7] 80.29 85.63 73.29 82.39 64.84 94.22 60.84 91.03 78.07
SelfReg + UNet [15, 1] 80.34 88.74 71.78 85.32 80.71 93.80 62.22 84.78 75.39
SelfReg + SwinUNet [15, 9] 80.54 86.07 69.65 85.12 82.58 94.18 61.08 87.42 78.22
Ours 80.87 87.13 66.53 85.45 83.14 94.92 62.92 89.66 77.18

Table 2.

Comparison of different methods in Glas and MoNuSeg datasets.

Method Glas MoNuSeg
DSC (%) IOU (%) DSC (%) IOU (%)
U-Net [1] 85.45±1.25 74.78±1.67 76.45±2.62 62.86±3.00
UNet++ [2] 87.56±1.17 79.13±1.70 77.01±2.10 63.04±2.54
AttUNet [5] 88.80±1.07 80.69±1.66 76.67±1.06 63.47±1.16
MRUNet [18] 88.73±1.17 80.89±1.67 78.22±2.47 64.83±2.87
TransUNet [6] 88.40±0.74 80.40±1.04 78.53±1.06 65.05±1.28
MedT [10] 85.92±2.93 75.47±3.46 77.46±2.38 63.37±3.11
SwimUNet [9] 89.58±0.57 82.06±0.73 77.69±0.94 63.77±1.15
UCTransNet [8] 90.18±0.71 82.96±1.06 79.08±0.67 65.50±0.91
SelfReg + SwinUNet [15] 91.62±0.16 85.29±0.30 79.38±0.15 65.87±0.2
Ours 92.44±0.23 86.50±0.38 79.27±0.24 65.87±0.21

Fig. 3.

Fig. 3.

Comparison of segmentation results in Synapse(A), Glas(B) and MoNuSeg(C) dataset.

Fig. 1.

Fig. 1.

Comparison of model size, FLOPs, and performance (Dice on Synapse dataset). The area of each circle is proportional to the number of parameters (model size).

4. CONCLUSION AND DISCUSSION

In conclusion, we have developed a segmentation framework that achieves outstanding performance and offers superior computational efficiency. Our model surpasses numerous state-of-the-art methods in accuracy while maintaining a lower computational burden, making it particularly suited for limited computational resources devices, such as medical devices. These qualities emphasize the model’s suitability for performance-critical, real-world applications.

Despite the model’s outstanding performance and high efficiency, there are still some limitations regarding adaptability and practicality for medical devices. Medical devices have complex requirements, and while our framework performs well in experiments, further optimization is needed for broader deployment in embedded and portable systems. Looking ahead, our research can focus on fine-tuning the model for specific hardware implementations, thereby enhancing its applicability in real-world medical devices.

ACKNOWLEDGMENTS

This work was supported by grants from the National Institutes of Health (R01EY032125 and R01DE030286) and the State of Arizona via the Arizona Alzheimer Consortium.

Footnotes

COMPLIANCE WITH ETHICAL STANDARDS

This research study was conducted retrospectively using human subject data available in open access by Synapse [19], GlaS [20] and MoNuSeg [21]. Ethical approval was not required as confirmed by the license attached with the open-access data.

REFERENCES

  • [1].Ronneberger Olaf, Fischer Philipp, and Brox Thomas, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241. [Google Scholar]
  • [2].Zhou Zongwei, Siddiquee Md Mahfuzur Rahman, Tajbakhsh Nima, and Liang Jianming, “Unet++: A nested unet architecture for medical image segmentation,” in MICCAI. Springer, 2018, pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Chen Liang-Chieh, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017. [Google Scholar]
  • [4].Vaswani A, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017. [Google Scholar]
  • [5].Oktay Ozan, Schlemper Jo, Le Folgoc Loic, Lee Matthew, Heinrich, et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. [Google Scholar]
  • [6].Chen Jieneng, Lu Yongyi, Yu Qihang, Luo Xiangde, Adeli Ehsan, Wang, et al. , “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021. [Google Scholar]
  • [7].Heidari Moein, Kazerouni Amirhossein, Soltany Milad, Azad, et al. , “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” in CVPR, 2023, pp. 6202–6212. [Google Scholar]
  • [8].Wang Haonan, Cao Peng, Wang Jiaqi, and Zaiane Osmar R, “Uctransnet: rethinking the skip connections in unet from a channel-wise perspective with transformer,” in AAAI, 2022, vol. 36, pp. 2441–2449. [Google Scholar]
  • [9].Cao Hu, Wang Yueyue, Chen Joy, Jiang Dongsheng, Zhang Xiaopeng, Tian Qi, and Wang Manning, “Swinunet: Unet-like pure transformer for medical image segmentation,” in ECCV. Springer, 2022, pp. 205–218. [Google Scholar]
  • [10].Qi Qingfu, Lin Liyuan, Zhang Rui, and Xue Chengrong, “Medt: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis,” IEEE Access, vol. 10, pp. 28750–28759, 2022. [Google Scholar]
  • [11].Fan Qihang, Huang Huaibo, Zhou Xiaoqiang, and He Ran, “Lightweight vision transformer with bidirectional interaction,” Advances in Neural Information Processing Systems, vol. 36, 2024. [Google Scholar]
  • [12].Valanarasu Jeya Maria Jose and Patel Vishal M, “Unext: Mlp-based rapid medical image segmentation network,” in MICCAI. Springer, 2022, pp. 23–33. [Google Scholar]
  • [13].Howard Andrew G, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [Google Scholar]
  • [14].Xu Guoping, Zhang Xuan, He Xinwei, and Wu Xinglong, “Levit-unet: Make faster encoders with transformer for medical image segmentation,” in PRCV. Springer, 2023, pp. 42–53. [Google Scholar]
  • [15].Zhu Wenhui, Chen Xiwen, Qiu Peijie, Farazi Mohammad, et al. , “Selfreg-unet: Self-regularized unet for medical image segmentation,” arXiv preprint arXiv:2406.14896, 2024. [Google Scholar]
  • [16].Li Yanyu, Hu Ju, Wen Yang, Evangelidis Georgios, Salahi Kamyar, Wang Yanzhi, Tulyakov Sergey, and Ren Jian, “Rethinking vision transformers for mobilenet size and speed,” in CVPR, 2023, pp. 16889–16900. [Google Scholar]
  • [17].Yu Weihao, Luo Mi, Zhou Pan, Si Chenyang, Zhou Yichen, Wang, et al. , “Metaformer is actually what you need for vision,” in CVPR, 2022, pp. 10819–10829. [Google Scholar]
  • [18].Ding Hongwei, Cui Xiaohui, Chen Leiyang, and Zhao Kun, “Mru-net: a u-shaped network for retinal vessel segmentation,” Applied Sciences, vol. 10, no. 19, pp. 6823, 2020. [Google Scholar]
  • [19].Landman Bennett, Xu Zhoubing, J Igelsias Martin Styner, Langerak Thomas, and Klein Arno, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” in Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, 2015, vol. 5, p. 12. [Google Scholar]
  • [20].Sirinukunwattana K, Pluim JPW, Chen H, et al. , “Gland segmentation in colon histology images: The glas challenge contest,” Med Image Anal, vol. 35, pp. 489–502, Jan 2017. [DOI] [PubMed] [Google Scholar]
  • [21].Kumar N, Verma R, Sharma S, et al. , “A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology,” IEEE Trans Med Imaging, vol. 36, no. 7, pp. 1550–1560, Jul 2017. [DOI] [PubMed] [Google Scholar]

RESOURCES