Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 20;16:5807. doi: 10.1038/s41598-026-35771-4

Lightweight SwiM-UNet with multi-dimensional adaptor for efficient on-device medical image segmentation

Yeonwoo Noh 1, Seongwook Lee 2, Seyong Jin 2, Yunyoung Chang 3, Dong-Ok Won 4,7,8, Minwoo Lee 6,8, Wonjong Noh 5,8,
PMCID: PMC12894686  PMID: 41559378

Abstract

For medical image segmentation, transformer-based models have demonstrated superior performance. However, their high computational complexity remains a significant challenge. In contrast, Mamba provides a more computationally efficient alternative, though its segmentation performance is generally inferior to that of transformers. This study proposes a novel lightweight hybrid model based on U-Net, named SwiM-UNet, which represents the first Mamba–transformer hybrid model specifically designed for processing three-dimensional data. Specifically, efficient TSMamba (eTSMamba) blocks are incorporated in the early stages of the U-Net architecture to effectively manage computational overhead, while efficient Swin transformer (eSwin) blocks are employed in the later stages to capture long-range dependencies and local contextual information. Additionally, the model strategically integrates both the Mamba and Swin transformer architectures through a Mamba–Swin adapter (MS-adapter). The proposed MS-adapter comprises three sub-adapters that emphasize local information along the Inline graphic-, Inline graphic-, and Inline graphic-axes, as well as channel-wise features between the eTSMamba and eSwin modules, and includes gating mechanisms to balance the contributions of the sub-adapters. Moreover, a low-rank MLP is utilized in the encoder, and channel reduction is applied in the decoder to further enhance computational efficiency. Performance evaluations conducted on the publicly available BraTS2023 and BraTS2024 datasets demonstrate that the proposed model surpasses state-of-the-art benchmark models while maintaining low computational complexity.

Keywords: Adaptor, Brain tumor segmentation, Hybrid model, Lightweight model, Mamba, Swin transformer

Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing

Introduction

Backgrounds

Medical image segmentation plays a critical role in clinical decision-making, supporting essential tasks such as disease diagnosis, treatment planning, surgical navigation, and radiation therapy. As modern healthcare increasingly relies on imaging across diverse modalities-including CT, MRI, and ultrasound-the demand for segmentation methods that are both highly accurate and fast has grown substantially.

Recent advances in artificial intelligence have significantly improved segmentation performance, with deep learning models achieving remarkable precision in delineating anatomical structures and pathological regions. This progress has been largely driven by advanced neural architectures, ranging from Convolutional Neural Networks (CNNs) to Transformer-based models, which have enabled more effective and scalable visual representation learning.

Among these, Transformer-based architectures-such as Vision Transformers (ViTs)1 and Swin Transformers2-have had a profound impact on computer vision. Since the introduction of ViT, Transformers have shifted the paradigm of visual modeling by leveraging self-attention to capture global contextual information. However, self-attention inherently struggles to model fine-grained local features. To address this, enhanced local self-attention mechanisms3 have been integrated into architectures like the Swin Transformer, which employs a shifted-window strategy. The effectiveness of Swin Transformers has been demonstrated in practice, including their role in the winning model of the 2023 BraTS challenge4. Consequently, Swin-based and other Transformer-enhanced U-Net variants have become state-of-the-art approaches in medical image segmentation.

Despite these strengths, Transformers suffer from a notable drawback: their computational cost scales quadratically with image size due to the attention mechanism. This makes real-time or resource-constrained deployment challenging, especially in clinical environments requiring fast, reliable inference.

To overcome these computational limitations, the Mamba architecture5 was recently introduced. Mamba replaces the traditional attention mechanism with a state-space model (SSM)6, achieving far greater computational efficiency-particularly for long sequences. Building on this, Vision Mamba7 has been developed for visual tasks and has gained significant attention in imaging applications. However, despite its efficiency advantages, Mamba’s segmentation performance still generally lags behind that of leading Transformer-based models.

Given these constraints, recent research has increasingly focused on improving efficiency through lightweight architectures, model compression, and structural optimization. These advancements have made it more feasible to perform segmentation directly on local devices.

With the rise of edge computing, on-device medical image segmentation-where inference is executed on-site rather than relying on cloud servers-has become an especially attractive solution. This approach reduces latency, mitigates issues related to network instability, and enhances patient data privacy by keeping sensitive information within the clinical environment. As a result, on-device segmentation is emerging as a key technology for enabling real-time, secure, and dependable medical AI across a wide range of clinical workflows.

Related work

Transformer-based models in vision

Recent works have substantially expanded the modeling capabilities and robustness of Transformer-based architectures. Shi et al.8 proposes a robust foveal mechanism, TransNeXt, to enhance visual perception under distribution shifts, while Han et al.9 provides a theoretical and empirical comparison between linear attention and state-space models, highlighting new insights into long-range representation learning. Additionally, Ye et al.10 unifies softmax and linear attention mechanisms through a principled agent-based formulation, improving both efficiency and expressiveness. Lou et al.11 further advances hybrid token-mixing approaches, which is referred to as TransXNet, by combining global and local dynamics in a dual-path design. These advances collectively show that hierarchical Transformers remain highly effective for global context modeling.

Some studies have investigated integrating Swin transformers into other architectures, such as U-Net12 and U-Net++13. Hatamizadeh et al.14 proposed the Swin UNETR model, which incorporates the Swin transformer into the encoder of the U-Net architecture. This model demonstrated exceptional performance, securing first place in the BraTS 2023 Adult Glioma challenge by combining nn-UNet15 with Generative Adversarial Networks (GANs)16. ZongRen et al.17 applied the Swin transformer to the encoder of U-Net++13, an enhanced variant of the U-Net architecture. Nonetheless, despite these advancements, the computational inefficiency of transformers remains an unresolved challenge.

State-space models and Mamba in vision

Mamba was introduced to overcome the computational limitations of Transformers. Built on State-Space Model (SSM) principles, it enables efficient sequence processing with significantly reduced complexity. The recent rise of Mamba-based architectures has opened new avenues for modeling long-range dependencies in a more scalable and efficient manner. Li et al.18 introduces a non-causal dual-state formulation, VSSD, to bridge temporal and spatial contexts in visual sequences, while Lou et al.19 demonstrates the power of sparse cross-layer connections in enhancing hierarchical state-space models. Dong et al.20 extends the Mamba architecture into a multi-resolution hierarchy, Multi-Scale VMamba, capturing cross-scale interactions in a Transformer-like fashion but with significantly improved computational efficiency. Fu et al.21 combines omni-scale state-space modeling with local attention, which is referred to as SegMAN, demonstrating strong performance in semantic segmentation. These developments support our architectural decision to place Mamba in the early, high-resolution encoder stages, where linear-time state-space operations yield substantial benefits in computational efficiency without sacrificing representation quality.

Additionally, Vision Mamba7 has recently been investigated and applied to medical image segmentation, and several studies have focused on further enhancing its effectiveness in this domain. Lai et al.22 employed transfer learning23 to further improve the performance of Vision Mamba. Dang et al.24 proposed LoG-VMamba, an architecture that integrates a local token extractor and a global token extractor within Mamba’s SSM. This design enables the simultaneous learning of local and global features, effectively capturing multiple levels of information. Xing et al.25 incorporated Vision Mamba into the encoder of the U-Net architecture for segmentation tasks. Their model includes several enhancements, such as gated spatial convolution, tri-orientated Mamba (ToM), and feature-level uncertainty estimation, to improve segmentation performance. Despite these advancements, Vision Mamba still underperforms compared with transformers; consequently, ongoing research is focused on strategies to bridge this performance gap in brain tumor segmentation.

CNN-based large-kernel and dynamic-kernel architectures in vision

Despite the rise of Transformers and Mamba, convolutional architectures also continue to evolve significantly. Ding et al.26 and Zhang27 proposed large-kernel networks such as UniRepLKNet and showed that extremely large receptive fields (e.g. 31Inline graphic31) can rival or exceed Transformer performance by capturing long-range spatial structure using purely convolutional computation. More recently, Yu et al.28 introduces dynamic context-mixing kernels, OverLoCK, that mimic human visual attention, providing strong performance on dense prediction tasks. These works reveal a growing convergence between convolutional, state-space, and attention-based models: all aim to blend local detail extraction with global dependency modeling.

Hybrid Mamba model

Owing to the performance limitations of using Mamba alone, active research has focused on hybrid models that integrate additional architectures. Zhou et al.29 combined Mamba with CNNs to leverage the strengths of both architectures. This approach employed cascade residual multi-scale convolutions with filters of varying sizes to effectively capture tumors at different scales. Hatamizadeh et al.30 proposed MambaVision, which uses residual convolution blocks in the initial stages for rapid feature extraction while incorporating both Mamba and transformers in the later stages. Zhang et al.31 introduced HMT-UNet, integrating the MambaVision structure into both the encoder and decoder of a U-Net architecture. However, both MambaVision30 and HMT-UNet31 are restricted to processing two-dimensional data. To overcome this limitation, Cao et al.32 proposed MedSegMamba, a CNN-Mamba hybrid model capable of handling 3D input data. Despite these advances, research on hybrid models that combine Mamba with other architectures remains limited, largely because Mamba is still a relatively recent development.

Motivations and contributions

In medical imaging, in particular, Transformer-based model such as Swin-UNETR and Mamba-based model such as SegMamba have demonstrated strong performance but typically at either the cost of high computational overhead (Transformers) or limited global reasoning capability (Mamba). However, the recent breakthroughs across CNNs, Transformers, and state-space models provide strong theoretical and architectural motivation for the hybrid design strategy. In this study, we aimed to address the following research questions:

  • Can the inferior performance of Mamba be improved by integrating it with a Transformer?

  • Is it possible to design a novel lightweight segmentation model that can run on-device while achieving superior performance compared to state-of-the-art (SOTA) models?

The main contributions of this study are summarized as follows:

  • We proposed a novel U-Net-based hybrid model, SwiM-UNet, which integrates both Mamba and Transformer architectures to overcome their respective limitations. It is the first Mamba–Transformer hybrid model specifically designed for three-dimensional (3D) medical image segmentation. It is motivated by the convergence observed in recent computer vision research: efficient early-stage modeling through state-space architectures and semantically rich global modeling through hierarchical Transformers.

  • Computational complexity was reduced in two ways: (1) by employing efficient TSMamba (eTSMamba) blocks and efficient Swin transformer (eSwin) blocks, and (2) by implementing a bottleneck structure with channel reduction in the decoder.

  • eTSMamba blocks were used in the early stages of the U-Net, where computational efficiency is most critical, while eSwin blocks are applied in the later stages to effectively capture long-range dependencies and local contextual information.

  • We designed an MS-adapter that integrates multi-scale spatial aggregation and channel-wise modulation as a bridge between Mamba and Swin Transformer modules. It comprises sub-adapters that process different axes of the eTSMamba block’s output before passing it to the eSwin block, along with gating mechanisms that balance the contributions of each sub-adapter.

  • Extensive experiments were conducted using the publicly available BraTS2023 and BraTS2024 datasets. Results demonstrate that SwiM-UNet outperforms SOTA benchmark models in terms of Dice score and HD95, while maintaining low computational complexity.

The proposed model: SwiM-UNet

We propose the lightweight SwiM-UNet model consisting of three components: (1) An encoder that incorporates efficient TSMamba (eTSMamba) blocks in the early stages, efficient Swin transformer (eSwin) blocks in the later stages, and an MS-adapter between them. (2) A CNN-based decoder predicts segmentation results. (3) Skip connections that link the encoder and decoder. Here, we assume the four-stage U-Net architecture as a basic backbone. The overall architecture is shown in Fig. 1.

Fig. 1.

Fig. 1

Overall architecture of the proposed SwiM-UNet model.

Encoder

The encoder uses a preprocessed 3D brain MRI volume as the input. The first component of the encoder is the stem layer, which consists of a depth-wise convolution with a kernel size of Inline graphic, padding of Inline graphic, and a stride value of Inline graphic. When an input with size Inline graphic passes through the stem layer, the output size is Inline graphic.

The output of the stem layer was processed using the proposed eTSMamba blocks. The eTSMamba block is a modified version of the conventional TSMamba block introduced by Xing et al.25. The original TSMamba block comprises ToM, which computes feature dependencies along three directions, along with linear normalization (LN) and a multi-layer perceptron (MLP). In our design, we replaced the conventional MLP with a low-rank MLP (lr-MLP) to enhance computational efficiency.

Specifically, the input feature map is first projected into a lower-dimensional space using a Inline graphic convolution with reduced rank (Inline graphic), and subsequently expanded to the desired MLP dimension via another Inline graphic convolution. This approach substantially reduces the number of parameters and FLOPs, particularly in deeper layers with large channel sizes, while preserving non-linear transformation capabilities through GELU activation.

Figure 1, the eTSMamba block is applied twice at each stage of the encoder, indexed by Inline graphic. Denoting the output from the previous upper stage as Inline graphic, the transformation through the eTSMamba block is expressed by the following equation.

graphic file with name d33e588.gif 1

The computational process of ToM is as follows:

graphic file with name d33e593.gif 2

where Mamba is a module from the mamba_ssm library [https://github.com/state-spaces/mamba/tree/main/mamba_ssm], and the symbols Inline graphic, Inline graphic, and Inline graphic represent the forward, reverse, and interslice sequences of the input information z, respectively. The forward and reverse sequences were obtained by flattening the 3D volume along the depth axis (Z-axis) in opposite order, whereas the inter-slice sequence was constructed by extracting voxel values at the same spatial location across slices.

The proposed adapter model

Then, output obtained after passing through all the eTSMamba blocks is fed into the MS-adapter, as shown in Fig. 2. The proposed MS-adapter consists of three sub-adapters (sub-adapter A, B, C) and gates, each designed to enhance the feature representation in different ways.

Fig. 2.

Fig. 2

The architectures of the proposed MS-adapter.

Sub-adapter A aims to capture local and contextual information along the x and y axes, while preserving the structural integrity of the z-axis (slice dimension). It consists of three parallel branches: (i) a 3D average pooling layer with a kernel size of Inline graphic, (ii) a 3D max pooling layer with a kernel size of Inline graphic, (iii) a 3D average pooling layer with a kernel size of Inline graphic. All pooling layers use a stride of Inline graphic and appropriate padding to maintain spatial dimensions. The outputs from these branches are concatenated with the original input and passed through two consecutive 3D convolutional layers, each with a kernel size of Inline graphic, followed by batch normalization and GELU activation. This structure enables the module to enhance intra-slice spatial representation without altering inter-slice dependencies.

Sub-adapter B is designed to refine feature representations by exclusively focusing on inter-slice (z-axis) contextual information. It adopts a bottleneck architecture comprising three 3D convolutional layers, batch normalization, and GELU activation. The first and last convolutional layers use a kernel size of Inline graphic to adjust the channel dimensions. The intermediate convolutional layer is specialized to capture slice-wise dependencies, employing a kernel size of Inline graphic with padding Inline graphic, thereby operating only along the z-axis. During the forward pass, the input is added to the transformed features via a residual connection, allowing the model to enhance inter-slice representations without affecting intra-slice spatial features.

Sub-adapter C aims to emphasize important feature channels by applying a channel-wise attention mechanism. To generate channel attention weights, global contextual information is first aggregated using adaptive average pooling, followed by two consecutive Inline graphic convolutional layers with GELU activation and a sigmoid function. The resulting attention vector, shaped Inline graphic, is then used to reweight the output of a feature transformation block, which consists of a Inline graphic convolution, batch normalization, and GELU activation. The re-weighted feature is finally added to the original input through a residual connection, allowing the model to selectively enhance important channels while maintaining stability.

In summary, sub-adapter A is designed to capture spatial context along the Inline graphic and Inline graphic axes, the sub-adapter B focuses on contextual modeling along the Inline graphic-axis, and the sub-adapter C targets channel-wise re-weighting, enabling the model to comprehensively enhance feature representations across spatial, depth, and channel dimensions.

The gate module adaptively balances the contribution of the sub-adapter-transformed features and the original features. Consequently, the network can selectively suppress or enhance the adapter’s influence based on the current input content and context. The gate module comprises a 3D adaptive average pooling layer with an output size of Inline graphic, followed by a 3D convolutional layer with a kernel size of Inline graphic and a sigmoid activation function. The feature transformation through the gate module is given by:

graphic file with name d33e709.gif 3

To further enhance flexibility, each sub-adapter can be modulated by a dedicated gate. This multi-gate mechanism enables the network to control the contribution of each sub-adapter independently, facilitating more precise and dynamic feature adaptation.

eSwin block

The output enhanced with local information through the MS-adapter serves as the input to the eSwin block. The eSwin block is a modified version of the conventional Swin Transformer block introduced by Hatamizadeh et al.14. Similar to the eTSMamba block, the MLP layer of the Swin Transformer block is replaced by the lr-MLP.

Let Inline graphic denote the output of the previous upper stage. The outputs are computed as follows:

graphic file with name d33e728.gif 4

Figure 1 illustrates that the eSwin block consists of two sequential mechanisms: one incorporating a window-based multi-head self-attention (W-MSA) mechanism and the other utilizing a shifted W-MSA (SW-MSA) mechanism. Each mechanism includes layer normalization (LN) and an MLP. In the first mechanism, in which the W-MSA is applied, the 3D token is partitioned into Inline graphic regions by using windows of size Inline graphic. In the following mechanism, in which SW-MSA is applied, the partitioned windows are shifted by Inline graphic voxels. To efficiently implement the shifted window mechanism in 3D, a 3D cyclic shifting strategy is employed. As shown in Fig. 1, the eSwin block is repeated twice, which is indexed by Inline graphic in (4) to (7).

Table 1.

Variants of the proposed SwiM-UNet model.

Methods Composition of encoder
Model I eTSMamba Block Inline graphic 1 + eSwin Block Inline graphic 3
Model II eTSMamba Block Inline graphic 2 + eSwin Block Inline graphic 2
Model III eTSMamba Block Inline graphic 3 + eSwin Block Inline graphic 1

Decoder

The decoder utilizes skip connections to link with the encoder, enabling the transfer of feature representations between corresponding layers. These feature representations were first processed through Residual Block 1, which consisted of a Inline graphic 3D convolutional layer, instance normalization, and Leaky ReLU (LReLU) activation. Subsequently, the feature processed by the residual block is concatenated with the output from the previous stage. After concatenation, the output passes through Residual Block 2, which is essential for computational efficiency. Unlike conventional decoders, Residual Block 2 adopts a bottleneck structure with channel reduction (Inline graphic) and GELU activation for improved efficiency. The final output of the decoder passes through the segmentation head, which consists of a Inline graphic convolutional layer followed by a sigmoid activation function. It completes the segmentation task.

Experiments

Datasets and implementation details

Dataset

In this study, we used the publicly available 1251 BraTS 2023 [https://www.kaggle.com/datasets/shakilrana/brats-2023-adult-glioma] and 1350 Brats 2024 datasets. Each MRI volume has a size of Inline graphic, divided into four modalities: native (T1), T1-weighted (T1Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR). We divided the dataset into training, validation, and test sets in a ratio of 7:1:2. Segmentation targets were classified into three types: whole tumor (WT), tumor core (TC), and enhancing tumor (ET).

Pre-processing

The training data underwent spatial augmentations, including random rotations (Inline graphic about each axis, in radians) and random scaling (with factors ranging from 0.7 to 1.4, using constant border padding). ntensity augmentations comprised the addition of Gaussian noise (applied with 10% probability per sample), Gaussian blur (with sigma sampled from [0.5, 1.0] and applied at 20% probability per sample and 50% probability per channel), multiplicative brightness modulation (with factors between 0.75 and 1.25 at 15% probability per sample), contrast adjustment (15% probability per sample), simulation of low-resolution imaging via downsampling followed by upsampling (with zoom factors from 0.5 to 1.0 at 25% probability per sample and 50% probability per channel), and two stages of gamma correction (with parameter ranges of [0.7, 1.5] applied at probabilities of 10% and 30%, respectively).

Training setup

The models were implemented using PyTorch 2.5.1 with CUDA 11.8 and MONAI 1.4.0. TCross-entropy loss was employed as the objective function for SwiM-UNet. All models were trained using a stochastic gradient descent (SGD) optimizer with a polynomial learning rate scheduler (initial learning rate of Inline graphic, weight decay of Inline graphic, and momentum of 0.99). Training was conducted for 1000 epochs across all datasets, incorporating data augmentations, such as additive brightness, gamma correction, rotation, scaling, mirroring, and elastic deformation. A batch size of 2 was used, and training was performed on an NVIDIA GeForce RTX 4090 with 36 GB of memory. In total, ll models were executed under the same computing environment and training protocol to ensure a fair comparison.

Evaluation metrics

The models were evaluated in terms of the Dice score and Hausdorff Distance 95 (HD95). In the following, Inline graphic represents the predicted segmentation and Inline graphic represents the ground truth. The Dice score measures how well the model captures tumor regions compared with the ground truth, which is defined as

graphic file with name d33e820.gif 5

The HD measures the maximum distance between two point sets and is defined as

graphic file with name d33e825.gif 6

HD95 represents the 95th percentile of the maximum Euclidean distance HD, which provides insight into how well the segmented tumor shape aligns with the ground truth.

Experimental results and discussion

Encoder composition comparison

The proposed model is categorized into three variants: Model I, Model II, and Model III, based on the composition of the encoder with Mamba and Swin transformer layers. Model I incorporates an eTSMamba block in the first layer, followed by eSwin blocks in the second, third, and fourth layers. Model II employs eTSMamba blocks in the first and second layers, with eSwin blocks in the third and fourth layers. Model III consists of eTSMamba blocks in the first three layers and an eSwin block in the fourth layer. An initial comparison was conducted among these variants-Model I, Model II, and Model III-without integrating the proposed MS-adapter module. Table 1 summarizes the configurations of the different model variants.

Table 2 compares the number of parameters and FLOPs of the proposed models. Model III, comprising the largest number of TSMamba blocks, had the lowest number of parameters, FLOPs, and latency. Conversely, Model I, comprising the largest number of Swin transformer blocks, exhibited the highest number of parameters, FLOPs, and latency.

Table 2.

Comparison of computational complexity across proposed models.

Methods Parameters FLOPs Latency (Inference)
Model I 42.72 M 253.29 G 56.06 ms
Model II 41.56 M 240.50 G 51.09 ms
Model III 41.02 M 233.74 G 44.95 ms

Table 3 lists the performance of each model. In terms of the Dice score, Model III achieved the highest scores for WT, TC, and ET, resulting in the highest mean Dice score overall. This is followed by Models I and II. Regarding HD95, Model I recorded the lowest WT and ET values, leading to the lowest mean HD95. Models III and II followed in performance.

Table 3.

Comparison of segmentation performance across different models.

Methods Dice score (%,Inline graphic) HD95 (mm,Inline graphic)
WT TC ET Mean WT TC ET Mean
Model I 0.933 0.890 0.854 0.892 4.086 4.415 4.503 4.335
Model II 0.926 0.888 0.849 0.888 4.197 4.759 4.840 4.599
Model III 0.933 0.890 0.855 0.893 4.314 4.510 4.753 4.526

Considering both computational complexity and performance, as presented in Tables 2 and 3, Model III demonstrates the best balance of efficiency and effectiveness.

Remark 1

In this paper, the design choices were determined empirically. The number of interconnected layers reflects a trade-off between performance and system overhead and should therefore be chosen based on the specific task objectives and device capabilities. Determining an optimal and generalizable layer configuration remains an open problem, which we plan to investigate further in future work.

Ablation study of MS-adapter

Ablation studies were performed on the MS-adapter. Here, the baseline model is Model III, which was identified as the most effective model in the previous experiments. First, sub-adapters A, B, and C were applied individually to evaluate their respective contributions to model performance. Second, models incorporating all sub-adapters-A, B, and C-simultaneously were compared, both with and without the addition of a gating mechanism.

According to Table 4, the number of parameters and FLOPs increased in the order of sub-adapters A, C, and B, respectively. The addition of the gating mechanism to each sub-adapter resulted in only a marginal increase in both parameters and FLOPs. Simultaneous use of sub-adapters A, B, and C with the gating mechanism exhibited the highest parameter count and FLOPs. The number of parameters in the model increased by 40.273% compared to the baseline model, while the increase in FLOPs was limited to only 3.393%. Importantly, the inference time remained nearly identical to that of the baseline model, even with the inclusion of the MS-adapter.

Table 4.

Ablation study of MS-adapter: computational complexity.

A B C Gate Parameters FLOPs Latency (Inference)
Inline graphic Inline graphic Inline graphic Inline graphic 41.02 M 233.74 G 44.95 ms
Inline graphic Inline graphic Inline graphic Inline graphic 50.16 M 238.42 G 44.87 ms
Inline graphic Inline graphic Inline graphic Inline graphic 43.38 M 234.95 G 44.92 ms
Inline graphic Inline graphic Inline graphic Inline graphic 45.04 M 235.80 G 44.97 ms
Inline graphic Inline graphic Inline graphic Inline graphic 57.10 M 241.67 G 45.63 ms
Inline graphic Inline graphic Inline graphic Inline graphic 57.54 M 241.67 G 45.02 ms

Table 5 compares the segmentation performance in terms of the Dice score and HD95. Regarding the Dice score, applying a single sub-adapter did not yield an improvement in the mean Dice score. In contrast, the Dice score improved when all three sub-adapters (A, B, and C) were used, with the best performance achieved when the multi-gate mechanism was additionally applied. Notably, the Dice scores significantly increased in the TC and ET regions. Regarding HD95, only the addition of sub-adapter A led to a reduction, whereas adding sub-adapter B or C did not result in any improvement. Similar to the Dice score, incorporating all three sub-adapters led to a reduction in HD95, with the lowest HD95 obtained when the multi-gate mechanism was further applied. In summary, the model achieved its highest performance when all sub-adapters and the multi-gate mechanism were implemented. This indicates that the sub-adapter modules complement each other, particularly in improving performance for local lesions, such as TC and ET.

Table 5.

Ablation study of MS-adapter: dice score and HD95.

A B C Gate Dice score (%,Inline graphic) HD95 (mm,Inline graphic)
WT TC ET Mean WT TC ET Mean
Inline graphic Inline graphic Inline graphic Inline graphic 0.933 0.890 0.855 0.893 4.314 4.510 4.753 4.526
Inline graphic Inline graphic Inline graphic Inline graphic 0.934 0.891 0.855 0.893 3.943 4.194 4.844 4.327
Inline graphic Inline graphic Inline graphic Inline graphic 0.933 0.885 0.855 0.891 4.250 4.863 5.311 4.808
Inline graphic Inline graphic Inline graphic Inline graphic 0.933 0.884 0.850 0.889 4.278 4.974 5.204 4.819
Inline graphic Inline graphic Inline graphic Inline graphic 0.934 0.895 0.859 0.896 4.185 4.127 4.591 4.301
Inline graphic Inline graphic Inline graphic Inline graphic 0.934 0.896 0.861 0.897 3.912 4.113 4.294 4.106

Comparison with SOTA

We compared the proposed model with the following: (1) VT-UNet33, a transformer-based U-Net architecture tailored for 3D medical image segmentation, (2) nnMamba34, an SSM-based architecture designed for efficient and scalable 3D biomedical image segmentation, (3) SegMamba25, which uses only conventional TSMamba blocks, (4) Swin UNETR14, which employs only conventional Swin transformer blocks, and (5) MedSegMamba32, a hybrid Mamba-CNN model designed to process 3D data.

First, we compared the computational complexity in terms of the number of parameters and FLOPs, as shown in Table  6. Among the compared models32, exhibited the largest number of parameters, the highest FLOPs, and the highest latency, while nnMamba34 had the lowest values for all three metrics. number of parameters and FLOPs, and second in terms of latencyAmong among the six models. Compared to Swin UNETR14, the proposed model reduces the number of parameters, FLOPs, and latency by 17.351%, 69.025%, and 62.676%, respectively. In comparison with SegMamba25, which uses the TSMamba block, the proposed model reduces parameters by 2.309%, FLOPs by 66.728%, and latency by 36.078%. Relative to the Mamba-based hybrid model MedSegMamba, the reductions reach 46.802% in parameters, 80.782% in FLOPs, and 72.778% in latency.

Table 6.

Comparison of computational complexity across different models.

Methods Parameters FLOPs Latency (Inference)
VT-UNet33 20.75 M 165.16 G 65.36 ms
Swin UNETR14 69.62 M 780.22 G 120.62 ms
nnMamba34 17.89 M 110.67 G 16.03 ms
SegMamba25 58.90 M 726.34 G 70.43 ms
MedSegMamba32 84.47 M 1,257.54 G 165.38 ms
SwiM-UNet (proposed) 57.54 M 241.67 G 45.02 ms

Second, we compared the segmentation performance of the models using both the BraTS 2023 and BraTS 2024 datasets, as shown in Table 7. In BraTS 2023 data, the proposed SwiM-UNet outperformed the other models by achieving the highest Dice score. While nnMamba34 achieved the lowest HD95 in the WT region, the proposed model outperformed others in the TC and ET regions, as well as in the mean HD95. Compared to MedSegMamba, which had the highest number of parameters and the highest FLOPs, the proposed model achieved a 0.671% increase in the mean Dice score and a 5.020% reduction in the mean HD95. In comparison with nnMamba-the model with the lowest parameter count and FLOPs-the proposed model improved the mean Dice score by 1.701% and reduced mean HD95 by 9.400%. In the BraTS 2024 dataset, SwiM-UNet demonstrated superior Dice score and HD95 compared to the other models in all regions, including WT, ET, and TC. Compared to MedSegMamba, the proposed model achieved a 0.903% increase in mean Dice score and a 2.735% reduction in mean HD95. In comparison with nnMamba, the proposed model improved the mean Dice score by 1.822% and reduced the mean HD95 by 8.017%. Several factors contribute to this outcome: (1) The integration of the Swin Transformer helps overcome the performance shortcomings of Mamba. 2) The MS-adapter learns the local context of the eTSMamba block’s output along the Inline graphic-, Inline graphic-, and Inline graphic-axes, as well as in a channel-wise manner. 3) The combination of transformers and Mamba, which utilize different feature extraction methods, resulted in a more robust feature representation.

Table 7.

Performance comparison on BraTS 2023 and BraTS 2024 datasets.

Methods Dice score (%,Inline graphic) HD95 (mm,Inline graphic)
WT TC ET Mean WT TC ET Mean
Performance on BraTS 2023
 VT-UNet33 0.928 0.889 0.849 0.889 4.126 4.284 4.774 4.395
 Swin UNETR14 0.929 0.877 0.842 0.883 4.315 5.026 5.200 4.847
 nnMamba34 0.930 0.874 0.843 0.882 3.908 4.752 4.935 4.532
 SegMamba25 0.917 0.873 0.843 0.878 5.317 4.898 5.226 5.087
 MedSegMamba32 0.930 0.893 0.850 0.891 4.093 4.205 4.670 4.323
 SwiM-UNet (proposed) 0.934 0.896 0.861 0.897 3.912 4.113 4.294 4.106
Performance on BraTS 2024
 VT-UNet33 0.926 0.886 0.842 0.885 4.172 4.331 4.863 4.455
 Swin UNETR14 0.918 0.864 0.833 0.872 4.573 5.047 5.367 4.996
 nnMamba34 0.926 0.871 0.837 0.878 3.952 4.876 5.091 4.640
 SegMamba25 0.915 0.867 0.843 0.875 5.614 4.967 5.315 5.299
 MedSegMamba32 0.929 0.884 0.844 0.886 4.127 4.394 4.642 4.388
 SwiM-UNet (proposed) 0.930 0.893 0.860 0.894 4.121 4.326 4.357 4.268

Qualitative evaluation

We provide an qualitative evaluation of the proposed SwiM-UNet across a diverse set of challenging cases drawn from the BraTS 2023 and BraTS 2024 datasets.

First, Fig. 3 presents qualitative results of brain tumor segmentation, for a representative case from the BraTS2023 validation set. In the segmentation, the green, red, and blue regions represent the WT, TC, and ET, respectively. Figure 3(a) shows brain tumor segmentation for Multi-focal or heterogeneous lesions, requiring consistent segmentation across spatially disjoint tumor regions. This case illustrates a tumor with multiple spatially separated or compositionally heterogeneous regions. The proposed model successfully captures all subregions consistently, preserving both the extent and shape of each lesion despite their disjoint appearance. Figure 3(b) shows brain tumor segmentation for small or sparsely distributed tumors, where accurate localization is difficult. This example contains very small or spatially sparse tumor regions, which are typically difficult to localize. The model accurately detects these subtle lesions without introducing false positives, demonstrating strong sensitivity to fine?grained structures. Figure 3(c) shows brain tumor segmentation for ambiguous or low-contrast boundaries, which often cause baseline models to over-segment or blur edge regions. This case presents tumors with weak or unclear boundaries caused by low contrast between the lesion and surrounding tissue. The model maintains precise boundary delineation and avoids over-segmentation, showing robustness to ambiguous intensity gradients.

Fig. 3.

Fig. 3

Qualitative evaluation on challenging BraTS cases:

Second, Fig. 4 compares qualitative comparison results across different methods, with the Proposed (SwiM-UNet) approach displayed in the rightmost column. From the enlarged views (bottom row), it is evident that: The proposed method’s segmentation boundaries (colored regions) align most closely with the Ground Truth, accurately preserving both the shape and extent of tumor subregions (green, red, and blue areas). In the orange-box region, other models (VT-UNet, SwinUNETR, nnMamba, SegMamba) tend to produce over-segmentation or irregular boundaries, whereas the proposed method closely follows the true contour. In the yellow-box region, the proposed method effectively captures small, fine-structured areas that other models partially miss or distort. In the black-box region, the proposed method better avoids false positives, reducing spurious predictions present in other methods. Overall, the proposed SwiM-UNet achieves higher segmentation fidelity, capturing both large-scale tumor shapes and subtle details more accurately than competing approaches. This aligns with the quantitative superiority observed in Dice Score, HD95, and performance-efficiency metrics (PEMs).

Fig. 4.

Fig. 4

Segmentation images of the proposed SwiM-UNet and other models.

Discussions, limitations and future works

Hybrid Mamba-swin architecture

In this work, the number of interconnected layers represents a fundamental trade-off between performance and system overhead, and should therefore be selected according to task requirements and device constraints. Identifying an optimal and universally generalizable layer configuration remains an open problem, which we leave for future investigation. Although the final architectural choice was guided by empirical evaluation, it is also supported by clear structural reasoning. Specifically, the encoder follows a hierarchical design in which the first three stages process progressively downsampled yet still high-resolution volumetric feature maps. At these stages, Mamba is employed to maximize computational efficiency, as its state-space modeling offers linear-time complexity that is particularly advantageous when spatial–volumetric dimensions are large. As the encoder depth increases, spatial resolution is significantly reduced through downsampling. At this lowest-resolution stage, global context modeling becomes both computationally feasible and semantically valuable. We therefore introduce the Swin Transformer at the final encoder stage, where its attention-based mechanism can effectively capture long-range dependencies and high-level semantic patterns with manageable computational cost. This Mamba-to-Swin transition reflects a principled allocation of architectural components across spatial scales, aligning with the hierarchical nature of 3D medical image representations. Model III thus leverages the complementary strengths of Mamba and Swin: computational efficiency in early high-resolution stages and powerful global representation learning in deeper, low-resolution stages

Ablation study of MS-adapter

First, each sub-adapter was intentionally designed to capture only one structural aspect of the 3D volume: A for intra-slice spatial context (x–y), B for inter-slice contextual relations (z-axis), and C for channel-wise feature reweighting. When applied alone, each module enhances only one dimension of the representation. This can lead to unbalanced feature amplification, where enriching only a single axis may disturb the encoder–decoder feature alignment or the local–global balance within the hybrid Mamba–Transformer pipeline. Thus, isolated application may not reliably improve the Dice score, which is consistent with the behavior observed in Table 5. When A, B, and C are used together, the model benefits from complementary multi-dimensional enhancement, capturing spatial, depth-wise, and channel-wise structure simultaneously. The gating mechanism significantly stabilizes this interaction by preventing any single branch from dominating the representation. Thus, the performance gain arises from synergistic integration, not from parameter count alone.

Second, while the proposed MS-adapter substantially improves segmentation accuracy, it also introduces a non-trivial parameter overhead (a 40.27% increase), which may raise concerns that the observed performance gains simply stem from increased model capacity rather than architectural effectiveness. Such a concern is particularly relevant for deployment on devices with strict memory constraints. In the context of on-device clinical applications, however, this overhead remains acceptable for mid-range edge hardware, such as portable MRI consoles or bedside accelerators. More importantly, an analysis of overparameterization based on the results in Tables 4 and 5 suggests that the performance improvement cannot be attributed solely to an increase in the number of parameters. Specifically, although the A+B+C configuration introduces a substantially larger number of parameters compared to the baseline without sub-adapters, it yields only marginal performance gains, indicating that increased model capacity alone does not translate into meaningful accuracy improvements. In contrast, when the gating mechanism is introduced (A+B+C+Gate), the number of parameters remains nearly unchanged relative to A+B+C, yet a significant performance improvement is observed. This clearly highlights the critical role of the gating mechanism in adaptively balancing the contributions of components A, B, and C. Taken together, these results demonstrate that the observed performance gains are not primarily driven by overparameterization, but rather by the careful architectural design of the sub-adapters and the effective gating-based coordination among them. Furthermore, if overparameterization were the dominant factor, one would expect a noticeable increase in computational load or inference latency, which is not observed in practice.

Third, Table 4 indicates an increase in parameters after incorporating all sub-adapters and the gating mechanism, yet the inference latency remains almost unchanged. The key reason lies in the fundamental distinction between the parameter count and the computational workload during inference. The majority of the additional parameters introduced by the MS-adapter reside in computationally lightweight components, which contribute minimally to the overall inference time. (i) The increase in parameters is primarily due to the use of Inline graphic convolutions for channel mixing, small-kernel convolutions such as Inline graphic, and adaptive average pooling layers. Although these layers substantially raise the number of trainable weights, they introduce only a negligible number of additional FLOPs because of their small spatial footprint. This explains why the parameter count increases by 40.27%, while FLOPs increase by only 3.39%. (ii) Runtime latency on modern GPUs is dominated by the total floating-point operations, tensor sizes, memory bandwidth, and the parallelization efficiency of the operations. Since the MS-adapter preserves the original spatial tensor dimensions and contributes only a small number of additional computations, modern GPUs can process these lightweight operations in parallel with almost no measurable delay. (iii) The operations introduced by the MS-adapter occur either after spatial downsampling—where tensor sizes are small—or in parallel sub-branches that GPUs can execute efficiently. As a result, they do not introduce new computational bottlenecks during inference. (iv) Profiling conducted on an NVIDIA RTX 4090 demonstrates that the observed increase in inference time is within the noise margin of Inline graphic ms, indicating that the overhead introduced by the MS-adapter is practically insignificant. In summary, although the inclusion of the full MS-adapter increases the number of parameters, these parameters correspond mostly to cheap convolutional and pooling operations with minimal FLOPs. Consequently, the inference latency remains nearly unchanged.

Generalization

In this work, our main experiments focus on brain tumor segmentation. However, the underlying design rationale is not task-specific. The principle—using efficient sequential modeling in high-resolution stages and attention-based global modeling in low-resolution semantic stages—is consistent with established hybrid CNN–Transformer and CNN–state-space architectures used in diverse 3D segmentation tasks. The results presented in Table 8 and Fig. 5 further demonstrate the generalizability and robustness of the proposed SwiM-UNet beyond brain tumor segmentation. Although the model was primarily designed and optimized for 3D brain MRI data, its strong performance on the BTCV (Synapse) multi-organ CT dataset [https://www.synapse.org/Synapse:syn3193805] indicates that the hybrid Mamba–Transformer architecture, together with the MS-adapter, can effectively adapt to substantially different anatomical structures and imaging modalities.

Table 8.

Performance comparison on BTCV (Synapse) dataset.

Methods Dice score (%,Inline graphic) HD95 (mm,Inline graphic)
TransUNet35 77.48 31.69
UNETR36 78.35 21.48
Swin UNETR37 79.13 21.55
nnU-Net38 81.96 10.68
SegMamba39 81.57 4.19
SwiM-UNet (proposed) 81.34 4.02

Fig. 5.

Fig. 5

Qualitative comparison on BTCV (Synapse).

Remark 2

BTCV (Synapse) is a contrast-enhanced abdominal CT benchmark released for the MICCAI Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) challenge. The scans are acquired in the portal venous phase with variable volume sizes and voxel spacing, and manual annotations are provided for 13 abdominal structures. Following the widely adopted Synapse evaluation protocol used by prior work (e.g. TransUNet35), we use the fixed split of 18 training and 12 testing cases and report performance on 8 organs (spleen, right/left kidney, gallbladder, liver, stomach, aorta, and pancreas).

As reported in Table 8, SwiM-UNet achieves a Dice score of 81.34%, which is competitive with state-of-the-art models-closely approaching nnU-Net and outperforming several Transformer-based architectures such as TransUNet, UNETR, and Swin-UNETR. Notably, SwiM-UNet obtains the lowest HD95 value (4.02 mm) among all compared methods, demonstrating its superior ability to produce spatially accurate and smooth segmentation boundaries.

The qualitative examples in Fig. 5 corroborate these quantitative findings. The predicted organ contours generated by SwiM-UNet closely match the ground truth, particularly for organs with complex shapes or ambiguous boundaries. The model successfully preserves fine-grained structures while avoiding boundary distortions, which are common failure modes in CNN-only or Transformer-only architectures. These observations highlight the advantage of integrating Mamba’s efficient long-range modeling with Swin’s hierarchical attention, while the MS-adapter ensures robust spatial and depth-wise feature alignment across diverse anatomical contexts.

Overall, the results on BTCV confirm that the proposed framework is not limited to a specific modality or region of interest. Instead, it shows strong potential as a general-purpose 3D medical image segmentation backbone, capable of achieving accurate and stable performance even under cross-modality or cross-anatomy shifts. Future work will seek to further validate this generalizability across additional various datasets and clinical scenarios.

Trade-off between performance and computation

We compare trade-off in terms of computational efficiency and performance, which is shown in Fig. 6. Figure 6a compares segmentation performance (Dice Score %) against computational cost (FLOPs) for several models. SwiM-UNet (ours), marked by the large red star, achieves the highest Dice Score (Inline graphic) while maintaining a relatively low FLOPs budget (Inline graphic 250 G). This indicates that SwiM-UNet offers the best trade-off between accuracy and efficiency: Fig. 6b compares the HD95 metric (lower is better) against computational cost (FLOPs) for several models. The proposed SwiM-UNet, represented by the large red star, achieves the lowest HD95 value (Inline graphic4.19 mm) while maintaining a low FLOPs budget (Inline graphic250 G). This highlights that SwiM-UNet provides the most favorable balance between segmentation boundary accuracy and computational efficiency.

Fig. 6.

Fig. 6

Peformance vs. Computation.

On the other hand, we propose a novel metric, termed the PEM, to demonstrate that the proposed model achieves the best balance between performance and efficiency. The PEM is calculated as the average of two normalized accuracy-related DSC and HD95-divided by the relative computational cost of the model.

graphic file with name d33e2310.gif 7

where, the mean Dice score and the mean HD95 of each model are normalized using the Min–Max normalization method. A higher PEM means the model achieves better accuracy (in both DSC and HD95) while using relatively fewer computational resources.

Figure 7 compares the PEM values for different segmentation models. A higher PEM value indicates a better trade-off between segmentation performance and computational efficiency. Our proposed SwiM-UNet achieves the highest PEM value (Inline graphic4.1), indicating the most balanced and superior performance when considering both accuracy and efficiency. VT-UNet follows closely behind but still falls short of SwiM-UNet’s performance-efficiency balance. nnMamba ranks third, showing moderate efficiency-performance balance. MedSegMamba, SwinUNETR, and SegMamba have significantly lower PEM scores, indicating weaker trade-offs between accuracy and computational cost. The proposed one has 7%, 20%, 610%, and 1,300% PEM gain compared to VT-UNet, nnMamba, MedSegMamba, and Swin UNETR, repsectively. In summary, SwiM-UNet achieves state-of-the-art segmentation quality with minimal computational overhead, making it the most optimal trade-off among the compared models.

Fig. 7.

Fig. 7

PEM values of the proposed SwiM-UNet and other models.

Remark 3

Regarding nnMamba and SwiM-UNet, the design objective of our proposed SwiM-UNet differs fundamentally from that of nnMamba. Our goal is not extreme parameter minimization, but rather to achieve a balanced integration of computational efficiency, global context modeling, and robustness for clinically challenging tumor structures, while still remaining suitable for practical on-device deployment. Several factors clarify this distinction: (i) Although nnMamba performs competitively overall, it consistently underperforms SwiM-UNet in the ET and TC subregions (Table 7), which are clinically critical and particularly challenging. Our qualitative results also show that nnMamba struggles with faint boundaries and heterogeneous textures where hybrid global modeling is advantageous. (ii) While nnMamba has fewer parameters, its recurrent selective state-space operations exhibit hardware-dependent latency behavior that does not always scale directly with FLOPs. In contrast, SwiM-UNet maintains a stable inference profile and achieves near-real-time latency while offering higher segmentation accuracy. (iii) SwiM-UNet is intentionally positioned between heavy transformer architectures and minimal Mamba-only models. It is significantly more efficient than transformer-based models (e.g. Swin UNETR, MedSegMamba), yet demonstrates substantially higher robustness and accuracy compared to pure Mamba variants. This balance is quantitatively reflected in the PEM (Performance–Efficiency Metric), where our model attains the highest score among all baselines. (iv) Lastly, pure Mamba models such as nnMamba are well suited for ultra-low-power deployments (e.g. microcontrollers or minimal edge devices). SwiM-UNet, although still lightweight, is designed for more capable but resource-constrained platforms such as portable MRI systems, bedside edge accelerators, or intraoperative workstations. Thus, the two models serve different tiers of on-device deployment.

Real-world on-device medical image segmentation

Regarding the potential for real-world on-device medical image segmentation, Tables 9 and 10 contextualize the computational complexity results in Table 6 by mapping them to representative on-device and edge deployment scenarios and explicitly linking these scenarios to concrete hardware constraints, including processing units, memory budgets, and real-time latency requirements. Table 10 further summarizes indicative computational capabilities of commonly used clinical edge systems-such as portable MRI consoles, bedside edge accelerators, and point-of-care workstations-while the device categories considered, ranging from handheld ultrasound probes to operating-room edge workstations, reflect distinct clinical use cases with varying real-time, memory, and computational constraints.

Table 9.

Representative on-device/edge medical service platforms.

Device category Primary use case
Handheld Ultrasound Probe Real-time lesion or organ boundary segmentation on probe/tablet
Portable MRI Console Bedside brain lesion pre-segmentation and QA
Point-of-Care Edge Box Mobile clinics, ER triage, telemedicine kits
Ultra–Low-Power (ULP) Edge Accelerator Kiosk or cart-based POC segmentation with privacy constraints
Operating-Room (OR) Workstation Edge Intraoperative guidance with low latency needs

Table 10.

Representative on-device and edge medical service platforms with indicative specifications.

Device category On-device computing Latency constraint Existing methods SwiM-UNet
Handheld Ultrasound Probe 1-10 TOPS (INT8, mobile CPU/GPU/NPU) < 50 ms Generally challenging for transformer-heavy models due to high FLOPs Likely feasible for real-time latency budgets
ULP Edge Accelerator Inline graphic TOPS (INT8, ultra-low-power edge accelerator) < 100 ms Generally infeasible for 3D transformer-based models due to extremely limited compute resources Meets real-time latency budgets but conditionally feasible with further optimization
Portable MRI Console 0.5–5 TFLOPS (FP16, embedded CPU + optional GPU) < 100 ms Transformer-heavy models often exceed real-time latency budgets Feasible within real-time latency budgets
Point-of-Care Edge Box 20-100 TOPS (INT8, edge GPU/DLA) < 80 ms Deployable, but with significant computational overhead Feasible within real-time latency budgets, efficient and stable deployment
OR Workstation Edge 5-40 TFLOP (FP16, workstation-class GPU) < 50 ms Deployable but computationally inefficient in terms of latency–accuracy trade-off Feasible within real-time latency budgets, preferred due to improved latency-accuracy efficiency

The representative specifications in Table 10 are based on publicly available technical documentation of embedded and mobile SoCs, as well as prior studies on on-device and edge medical image analysis4045

When examined alongside the computational complexity results in Table 6, it becomes clear that transformer-heavy models such as Swin UNETR and MedSegMamba-requiring 780–1,257 GFLOPs and exhibiting inference latencies exceeding 120 ms-generally exceed the practical resource budgets of most edge platforms, rendering real-time on-device 3D segmentation challenging without substantial optimization. In contrast, the proposed SwiM-UNet requires only approximately 242 GFLOPs and achieves an inference latency of around 45 ms, placing it within the indicative feasible operating range of several representative edge systems, including portable MRI consoles, point-of-care edge boxes, and latency-sensitive intraoperative workstations. As further reflected in Table 10, SwiM-UNet consistently meets or closely approaches the memory and latency constraints across a broad spectrum of edge platforms, while a more conservative assessment is adopted for ultra–low-power accelerators, where additional optimizations such as pruning or quantization may still be required.

Overall, the combined analysis of Tables 69 and 10 demonstrates that the primary advantage of SwiM-UNet lies in its well-balanced trade-off between segmentation accuracy and computational efficiency. By significantly reducing computational overhead and inference latency, the proposed model is well aligned with the capabilities of modern clinical edge systems, such as portable MRI consoles and point-of-care accelerators, making it practically viable for realistic on-device deployment rather than merely competitive in abstract benchmark settings. While extremely resource-constrained devices may still require additional optimization, SwiM-UNet provides a reliable foundation for real-world clinical edge applications. These feasibility assessments are indicative and based on relative complexity and latency trends rather than device-specific deployment benchmarks.

Future works

While this study demonstrates the effectiveness of the proposed approach, several limitations motivate future research. First, we will investigate parameter-efficient techniques-such as pruning, quantization, and knowledge distillation-to reduce the computational complexity of the model while preserving its performance. Building upon these efforts, we will further optimize the model for deployment on edge devices by improving its computational efficiency and memory footprint, thereby enabling practical on-device operation in resource-constrained clinical environments. Second, as discussed earlier, the current layer configuration was empirically determined to balance model performance and system overhead. A systematic exploration of optimal and task-adaptive layer interconnection strategies-while accounting for heterogeneous device capabilities-remains an important direction for future research. Third, a deeper investigation into the interdependencies between individual sub-adapters and the gateway-including an analysis of potential redundancies and identification of the minimal subset of components required to achieve optimal performance-remains an important direction for future research. Fourth, we intend to incorporate medical domain knowledge, including tumor-specific priors and multi-modal fusion strategies, into the model design to enhance its robustness in real-world clinical applications. Lastly, additional stress evaluations-such as variations in scanner types, noise levels, and anisotropic spatial resolutions-were not considered and remain essential for assessing real-world clinical generalizability. Future work will therefore extend the proposed framework beyond brain MRI to segmentation tasks involving other anatomical regions, including the liver, lungs, and heart, as well as additional imaging modalities such as ultrasound and X-ray, in order to broaden its applicability across diverse clinical environments.

Conclusion

In this study, we presented a novel lightweight Mamba–transformer hybrid model, SwiM-UNet. The model integrates the proposed eTSMamba and eSwin blocks, both incorporating low-rank MLPs. The eTSMamba blocks are employed in the early stages to maximize computational efficiency, whereas the eSwin blocks are used in the later stages to effectively capture long-range dependencies and local context. These two blocks are connected via the MS-adapter, which consists of three sub-adapters that emphasize local information along the x-, y- and z-axes, as well as in a channel-wise manner. The MS-adapter further incorporates a multi-gate mechanism to balance the contributions of the sub-adapters. Additionally, channel reduction was applied to the residual blocks of the decoder to decrease the number of parameters and FLOPs. Performance evaluation confirmed that the proposed SwiM-UNet outperforms SOTA benchmark models in terms of Dice score and HD95 while maintaining low computational complexity.

Author contributions

Y. N. (Yeonwoo Noh) conceived and designed the study, developed the methodology, implemented the software, and performed the experiments. S. L. (Seongwook Lee) and S. J. (Seyong Jin) supported the data curation and validation process. Y. C. (Yunyoung Chang) contributed to visualization of the results. D. W. (Dong-Ok Won) contributed to analyzing real-world on-device and edge deployment scenarios. Y. N. and W. N (Wonjong Noh) wrote the original draft of the manuscript. W. N., M. L. (Minwoo Lee) and D. W. contributed to supervision, critical review of the manuscript, and funding acquisition. W. N. was responsible for project administration. All authors reviewed and approved the final manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00223501, No. 2022R1A5A8019303), and partly supported by Hallym University MHC (Mighty Hallym 4.0 Campus) project, 2025 (MHC-202502-002).

Data availability

The BraTS 2023 and 2024 datasets used in this study are publicly available. The BraTS 2023 data were accessed through the official challenge page on the Synapse platform under Synapse ID syn51156910. The BraTS 2024 data were accessed via the Synapse portal under Synapse ID syn53708249. Additionally, the Beyond the Cranial Vault (BTCV) multi-organ abdominal CT segmentation data were accessed via the official challenge page on the Synapse platform under Synapse ID syn3193805.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

3/7/2026

The original online version of this Article was revised: In this article the grant number “MHC-MHC-202502-002" listed in funding information section is incorrect and is corrected to read “MHC-202502-002." The original article has been corrected.

References

  • 1.Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at arXiv: 2010.11929 (2020).
  • 2.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
  • 3.Ghazouani, F., Vera, P. & Ruan, S. Efficient brain tumor segmentation using swin transformer and enhanced local self-attention. Int. J. Comput. Assist. Radiol. Surg.19, 273–281 (2024). [DOI] [PubMed] [Google Scholar]
  • 4.Ferreira, A. et al. How we won brats 2023 adult glioma challenge? just faking it! enhanced synthetic data augmentation and model ensemble for brain tumour segmentation. Preprint at arXiv:2402.17317 (2024).
  • 5.Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at arXiv:2312.00752 (2023).
  • 6.Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at arXiv:2111.00396 (2021).
  • 7.Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. Preprint at arXiv:2401.09417 (2024).
  • 8.Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers . In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17773–17783, 10.1109/CVPR52733.2024.01683 (IEEE Computer Society, 2024).
  • 9.Han, D. et al. Demystify mamba in vision: a linear attention perspective. In Proc. of the 38th International Conference on Neural Information Processing Systems, NIPS ’24 (Curran Associates Inc., 2024).
  • 10.Han, D. et al. Agent attention: On the integration of softmax and linear attention. In Leonardis, A. et al. (eds.) Computer Vision – ECCV 2024, 124–140 (Springer Nature, 2025).
  • 11.Lou, M. et al. Transxnet: Learning both global and local dynamics with a dual dynamic token mixer for visual recognition. IEEE Trans. Neural Netw. Learn. Syst.36, 11534–11547. 10.1109/TNNLS.2025.3550979 (2025). [DOI] [PubMed] [Google Scholar]
  • 12.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241 (Springer, 2015).
  • 13.Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018).
  • 14.Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI brainlesion workshop, 272–284 (Springer, 2021).
  • 15.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18, 203–211 (2021). [DOI] [PubMed] [Google Scholar]
  • 16.Goodfellow, I. et al. Generative adversarial nets. Advances in Neural Information Processing Systems27 (2014).
  • 17.ZongRen, L., Silamu, W., Yuzhen, W. & Zhe, W. Densetrans: Multimodal brain tumor segmentation using swin transformer. IEEE Access11, 42895–42908 (2023). [Google Scholar]
  • 18.Shi, Y., Li, M., Dong, M. & Xu, C. Vssd: Vision mamba with non-causal state space duality. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 10819–10829 (2025).
  • 19.Lou, M., Fu, Y. & Yu, Y. Sparx: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks. vol. 39, 10.1609/aaai.v39i18.34103 (2025).
  • 20.Shi, Y., Dong, M. & Xu, C. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. In Globerson, A. et al. (eds.) Advances in Neural Information Processing Systems, vol. 37, 25687–25708, 10.52202/079017-0808 (Curran Associates, Inc., 2024).
  • 21.Fu, Y., Lou, M. & Yu, Y. Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19077–19087, 10.1109/CVPR52734.2025.01777 (2025).
  • 22.Lai, Y. et al. Advancing efficient brain tumor multi-class classification–new insights from the vision mamba model in transfer learning. Preprint at arXiv:2410.21872 (2024).
  • 23.Bozinovski, S. & Fulgosi, A. The influence of pattern similarity and transfer of learning upon training of a base perceptron b2. In Proc. Symp. Informatica, 3–121–5 (Bled, 1976). Original in Croatian: Utjecaj slicnosti likova i transfera ucenja na obucavanje baznog perceptrona B2.
  • 24.Dang, T. D. Q., Nguyen, H. H. & Tiulpin, A. Log-vmamba: Local-global vision mamba for medical image segmentation. In Proc. of the Asian Conference on Computer Vision, 548–565 (2024).
  • 25.Xing, Z., Ye, T., Yang, Y., Liu, G. & Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 578–588 (Springer, 2024).
  • 26.Ding, X., Zhang, X., Han, J. & Ding, G. Scaling up your kernels to 3131: Revisiting large kernel design in cnns. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11953–11965, 10.1109/CVPR52688.2022.01166 (2022).
  • 27.Ding, X. et al. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5513–5524, 10.1109/CVPR52733.2024.00527 (2024).
  • 28.Lou, M. & Yu, Y. Overlock: An overview-first-look-closely-next convnet with context-mixing dynamic kernels. 10.1109/CVPR52734.2025.00021 (2025).
  • 29.Zhou, R. et al. Cascade residual multiscale convolution and mamba-structured unet for advanced brain tumor image segmentation. Entropy26, 385 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hatamizadeh, A. & Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. Preprint at arXiv:2407.08083 (2024).
  • 31.Zhang, M., Chen, Z., Ge, Y. & Tao, X. Hmt-unet: A hybird mamba-transformer vision unet for medical image segmentation. Preprint at arXiv:2408.11289 (2024).
  • 32.Cao, A., Li, Z., Jomsky, J., Laine, A. F. & Guo, J. Medsegmamba: 3d cnn-mamba hybrid architecture for brain segmentation. Preprint at arXiv:2409.08307 (2024).
  • 33.Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. A robust volumetric transformer for accurate 3d tumor segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, 162–172 (Springer, 2022).
  • 34.Gong, H. et al. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. In 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), 1–5 (IEEE, 2025).
  • 35.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. Med. Image Anal.97, 103280. 10.1016/j.media.2024.103280 (2024). [DOI] [PubMed] [Google Scholar]
  • 36.Hatamizadeh, A. et al. Unetr: Transformers for 3d medical image segmentation. In Proc. of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584, 10.1109/WACV51458.2022.00181 (2022).
  • 37.Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, 272–284, 10.1007/978-3-031-08999-2_22 (Springer, 2022).
  • 38.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18, 203–211. 10.1038/s41592-020-01008-z (2021). [DOI] [PubMed] [Google Scholar]
  • 39.Xing, Z., Ye, T., Yang, Y., Liu, G. & Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 578–588, 10.1007/978-3-031-69023-9_57 (Springer, 2024).
  • 40.Esteva, A. et al. A guide to deep learning in healthcare. IEEE Trans. Med. Imaging38, 2650–2660 (2019). [Google Scholar]
  • 41.NVIDIA Corporation. Nvidia jetson xavier nx developer kit: Technical specifications (2023). Reference 42: Source for computing specifications (TOPS/TFLOPS) for portable MRI and edge boxes.
  • 42.NVIDIA Corporation. Nvidia jetson orin nano and orin nx product brief (2024). Reference 43: Specifications for next-generation edge GPU and DLA units mentioned in Table 10.
  • 43.Google LLC. Edge tpu system architecture (2022). Reference 44: Technical basis for Ultra-Low-Power (ULP) Edge Accelerator specs.
  • 44.ARM Ltd. Arm ethos npu technical overview (2023). Reference 45: Source for NPU computing capabilities in handheld devices.
  • 45.Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng.1, 691–696 (2017). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The BraTS 2023 and 2024 datasets used in this study are publicly available. The BraTS 2023 data were accessed through the official challenge page on the Synapse platform under Synapse ID syn51156910. The BraTS 2024 data were accessed via the Synapse portal under Synapse ID syn53708249. Additionally, the Beyond the Cranial Vault (BTCV) multi-organ abdominal CT segmentation data were accessed via the official challenge page on the Synapse platform under Synapse ID syn3193805.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES