Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Mar 14;15:8883. doi: 10.1038/s41598-025-93353-2

Alternate encoder and dual decoder CNN-Transformer networks for medical image segmentation

Lin Zhang 1, Xinyu Guo 2,, Hongkun Sun 2,, Weigang Wang 2,3,, Liwei Yao 1,
PMCID: PMC11909241  PMID: 40087352

Abstract

Accurately extracting lesions from medical images is a fundamental but challenging problem in medical image analysis. In recent years, methods based on convolutional neural networks and Transformer have achieved great success in the medical image segmentation field. Combining the powerful perception of local information by CNNs and the efficient capture of global context by Transformer is crucial for medical image segmentation. However, the unique characteristics of many lesion tissues often lead to poor performance and most previous models failed to fully extract effective local and global features. Therefore, based on an encoder-decoder architecture, we propose a novel alternate encoder dual decoder CNN-Transformer network, AD2Former, with two attractive designs: 1) We propose alternating learning encoder can achieve real-time interaction between local and global information, allowing both to mutually guide learning. 2) We propose dual decoder architecture. The unique way of dual-branch independent decoding and fusion. To efficiently fuse different feature information from two sub-decoders during decoding, we introduce a channel attention module to reduce redundant feature information. Driven by these two designs, AD2Former demonstrates strong capture ability for target regions and fuzzy boundaries. Experiments on multi-organ segmentation and skin lesion segmentation datasets also demonstrate the effectiveness and superiority of AD2Former.

Keywords: Medical Image Segmentation, Vision Transformer, Convolutional Neural Networks, Channel Attention, Alternate Encoder, Dual Decoder

Subject terms: Computational science, Statistics, Machine learning

Introduction

Automatic segmentation of anatomical structures (such as organs) and lesion regions in medical images has become a critical task in medical image analysis and is widely used in clinical diagnosis and treatment. Precise segmentation enables doctors to identify and measure the size and shape of organs or pathological tissues more accurately, which is vital to diagnosing diverse diseases, especially dermatological. Before surgery, segmented images help physicians plan surgical approaches, assess risks, and predict outcomes. During radiotherapy, accurate segmentation is indispensable for precisely defining the radiation dose distribution, thereby minimizing damage to surrounding healthy tissues. However, manual annotation not only requires a significant amount of time and energy, but can also carry the risk of human errors. Therefore, the development of more precise medical image segmentation techniques has become an imperative undertaking. In recent years, propelled by the rapid advancements in deep learning and computer vision technologies, several approaches leveraging Convolutional Neural Networks (CNNs) have emerged as highly effective solutions for image classification and segmentation tasks. CNN-based methods have demonstrated satisfactory capability in local feature representation, they often face challenges in explicitly modeling global information due to the inherent inductive bias and limited receptive field of convolution, leading to weak performance of such architectures.To address the limitations of CNNs in handling long-range dependencies, the Vision Transformer (ViT) model has been proposed, drawing on the immense success of Transformers in machine translation and natural language processing. ViT divides images into multiple patches and converts each patch into tokens, which are then inputted into the multi-head self-attention mechanism (MSA) along with positional encodings to effectively model global context. Given the respective strengths of CNNs and Transformers, combining the advantages of these two architectures to achieve more powerful image processing capabilities has become an inevitable trend in research.

Inspired by the powerful local and global modeling capabilities of CNN and Transformer, we proposed a novel medical image segmentation framework, named AD2Former. It fundamentally differs from the model frameworks used in previous research. As shown in Fig. 1. Currently, most segmentation methods based on CNN and Transformer simply adopt a single CNN encoder(as shown in (a)) and a single Transformer encoder(as shown in (c)), or simply connect the feature maps generated by the two in a cascaded manner(as shown in (b)), they cannot achieve interactive guidance between local and global information during the feature extraction process and therefore cannot fully extract features. To enhance the utilization of encoder-extracted features, this paper employs a more sophisticated alternating learning strategy. During encoding, instead of simply combining the functions of CNNs and Transformers, it enables them to collaborate. In each alternating step, CNNs first extract local features, which are then processed by Transformers as sequence information to capture global context. This facilitate effective extraction and integration of local and global information. Inspired by the innovative skip connections of the U-NET architecture, a dual-decoder structure is proposed. It independently decodes the feature maps generated by CNNs and Transformers from the encoder and ultimately classifies the feature maps by fusing features via the attention mechanism. To address the issue of redundant features that may arise from the direct fusion of local and global information, we introduce a simple channel attention module. Finally, we evaluate our proposed AD2Former in two publicly available datasets, including the Synapse datasets for abdominal multiorgan segmentation and the ISIC2018 datasets for skin lesion segmentation. The ablation studies and comparisons with the state-of-the-art methods also demonstrate the superiority of our proposed approach. The contributions of this paper can be summarized as follows:

  • We propose a novel U-shape medical image segmentation network, named AD2Former, which consists of an interactive encoder and a novel dual decoder and can comprehensively extract both local and global information from images and promote the effective supplementation of more detailed information, which is crucial for generating accurate segmentation prediction maps.

  • We propose an alternate learning strategy by alternately training CNN layer and Transformer layer to repeatedly optimize the feature extractors, which allows the model to better understand the semantic information.

  • We propose a novel dual-branch decoder for feature map decoding. The first branch decodes the CNN and Transformer layers in the encoder independently, generating coarse position information. The second branch fuses local and global semantic information, emphasizing more accurate boundary details. To enhance the fusion of local and global information, we introduce a channel attention to selectively activate and leverage relevant features.

Fig. 1.

Fig. 1

Comparison of popular CNN and Transformer-based architectures for medical image segmentation. (a) The standard pure convolutional (e.g., UNet 1). (b) The cascaded hybrid model of CNNs and Transformer structure (e.g., TransUNet 2). (c) The pure Transformer structure (e.g., Swin-UNet 3). (d) Ours proposed alternate encoder and dual decoder architecture.

Related work

CNN-based methods for medical image segmentation

Early medical image segmentation relied primarily on machine learning methods. However, with the advancement of deep learning, CNNs demonstrated outstanding performance in medical image segmentation. Among these approaches, U-Net was commonly used and proved highly effective for medical image segmentation 1. Recently, various excellent segmentation networks based on U-Net have emerged, including U-Net++ 4, V-Net 5 and etc. These networks improved segmentation accuracy and robustness by introducing different improvement methods, and had been widely used in various scenarios. Zhou et al 4 improved on the basis of U-Net by combining multiple U-Net sub-modules to construct a deep and multi-resolution network structure. Huang et al 6 utilized a multi-resolution cascaded architecture and proposed new skip connection strategies, which effectively processed targets of different scales and improved segmentation accuracy. Huang et al  7 introduced the concept of dense connection, where the output of each layer was connected to the input of all subsequent layers, ensuring the sufficient propagation and reuse of information. Yuan et al 8 propose a UNet-based multi-scale context fusion algorithm for medical image segmentation. It extracts rich contextual information via semantic extraction at different encoding stages and utilizes the TBSFF module to assign varying weights to semantic information of different scales, thus enhancing the network’s feature learning ability. Isensee et al 9 proposed a robust and adaptive medical image segmentation framework based on 2D and 3D U-Net architectures, which supported multi-resolution feature map input and adopted adaptive data augmentation. V-Net was designed for 3D medical image segmentation problems, which can improve segmentation accuracy and retain anisotropic information in 3D images 5. However, CNN-based methods ignored the importance of global contextual information.

Transformer-based methods for medical image segmentation

To capture long-range dependencies, researchers have began to introduce transformer into the field of computer vision. A typical representative of this trend is the ViT 10 network, which applies the Transformer idea and has achieved good results in visual tasks. To address the significant performance gap of ViT in data size, researchers have proposed several ViT training strategies.

Wang et al 11 introduced a distributed transformer structure by grouping weights in the model. Additionally, by introducing the Spatial Reduction Attention (SRA) module, the computational cost of the transformer is greatly reduced. Liu et al 12 employed a window-based self-attention (WSA) mechanism and a staged feature extraction method. By dividing the input image into multiple fixed-size small blocks and extracting features for each block separately, it achieved information interaction of local features, which is also an improvement on some excellent Transformer works. Although it greatly reduced computation, the self-attention mechanism of each window only considered the interaction between local features, which may not capture global information when processing specific tasks. Xie et al 13 introduced a simple and efficient design for semantic segmentation. To fully leverage the local feature extraction capabilities of CNNs and the global contextual information capturing of transformer, researchers have started exploring combinations of these two architectures. Xu et al 14 combines local features extracted by CNNs and long-range dependencies between different levels inscribed by Parallel ViT, which can capture tumor features more effectively. Gu et al 15 proposed a Weighted Medical Transformer (WMT) model that imposes low requirements on dataset quantity. Khan et al 16 proposed framework consists of three key modules. Firstly, the Local-Global Convolutional Neural Network (LG-CNN) extracts local fine-grained and global context features. Secondly, the Endoscopy-Lesion Attention Module (ELA) highlights crucial regions and filters out noise. Finally, the Gastrointestinal Endoscopy CNN (GE-CNN) effectively utilizes the previous two modules to classify input images. Chen et al 2 was the first network to combine U-Net and transformer and apply it to the field of medical image segmentation, achieving impressive results. However, this approach led to a large number of parameters and low computational In addition, there is a class of network architectures that are purely based on transformer, such as Swin-Unet 3, nnFormer 17, MISSFormer 18, TransDeepLab 19, and DAE-Former 20. Swin-Unet and TransDeepLab adopted a full Transformer architecture, embedding Transformer into both the encoder and decoder, which enabled them to capture more global feature representations from input to output.

Datasets

We conducted comprehensive experiments on two publicly available datasets to validate the effectiveness of our method. Below, we provide a detailed explanation of the two datasets used in our experiments.

  • Synapse multi-organ segmentation: The dataset comprises 30 abdominal CT scans, consisting of 3779 axial contrast-enhanced clinical images 21. Each volume sample consists of Inline graphic slices, with a size of 512Inline graphic512 pixels, and a voxel spatial resolution of Inline graphic. Each image includes parts or all of eight abdominal organs, namely the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. Following the approach in 2,3 we divided the samples into a training set (18 samples) and a test set (12 samples).

  • ISIC 2018: The dataset is collected by the International Skin Imaging Collaboration (ISIC) and consists of 2594 RGB images of skin lesions. We follow the same strategy to set up the evaluation setting 22. One of the challenges of this dataset is that the lesion areas have irregular geometric shapes as well as various interferences such as skin texture of different populations, hair, rulers and etc.

Methods

The proposed AD2Former framework consists of an alternate encoder with both CNNs and Transformer layers, and two decoders which divided into DUD and CFB sub-decoders, as shown in Fig. 2. Due to the relatively small proportion of foreground information in medical images, we use a pre-trained ResNet for feature extraction in the early layers of the encoder to better utilize information in segmentation tasks 23. To enhance the capability of network in processing global information, transformer layers are introduced for feature learning. CNNs can use gradient information collected by transformer layers during the training stage to change parameters and enhance learning capacity. By using information from CNNs layers, transformer layers may also more effectively extract global context, these two techniques complement one another. The decoding process in AD2Former consists of two parts. On the one hand, the multi-scale feature maps output by the CNNs and transformer layers in the encoder are independently decoded through upsampling and skip connections in DUD. On the other hand, to preserve more detailed information and capture features at different scales, the two independently decoded parts are fused in CFB. To achieve effective feature fusion, we designed a channel attention module to selectively activate useful channel features while suppressing redundant ones. Finally, the prediction results with the same resolution as the input image are obtained through upsampling and convolution. In the following sections, we will describe our method in more detail.

Fig. 2.

Fig. 2

The overall architecture of our proposed AD2Former includes alternate encoder and dual decoder. The encoder alternates learning between CNNs and ViT models to extract more discriminative features and better achieve real-time interaction between local and global semantic information. The decoders process local and global information separately and combine channel attention to improve the model’s prediction performance.

CNN-Transformer alternate encoder

In the encoder, for an input image Inline graphic, with a spatial resolution of Inline graphic and 3 channels. In the first stage, we downsample the input image using the first and second layers of a pre-trained ResNet encoder to obtain the feature map Inline graphic, Inline graphic. To capture long-range dependencies in feature representations, we introduce Transformer layers that establish connections between different position pixel information using a self-attention mechanism. Subsequently, we connect the ResNet layer to enable local and global information interaction guidance. The ResNet layer not only extracts better local information but also serves as a patch operation for the following Transformer layers. This architecture can better leverage the advantages of both architectures, thereby improving the model’s ability to predict the area of the lesion. Specifically, we implemented three local-global alternating modules. By utilizing CNNs and ViT’s complementary strengths, more detailed and semantically insightful feature maps may be produced. The feature maps output by the CNNs layers can be denoted as Inline graphic, Inline graphic, Inline graphic, Inline graphic, with corresponding sizes of: Inline graphic, Inline graphic, Inline graphic, Inline graphic, and the feature maps output by the Transformer layers can be denoted as Inline graphic, Inline graphic, Inline graphic, with corresponding sizes of: Inline graphic, Inline graphic, Inline graphic.

Dual decoder

To improve the accuracy of our prediction results, we have designed a decoder consisting of two sub-decoders: DUD and CFB. In DUD, we decode the multi-scale feature maps output by the CNNs and ViT layers of the encoder independently. In CFB, we integrate the output feature maps of the CNNs sub-decoder and transformer sub-decoder. Both adopt a U-shaped decoding process, gradually fusing features through upsampling and skip-connection operations on the outputs Inline graphic, Inline graphic of the last module in the encoder, which restores low-resolution images to high resolution and combines low-level and high-level features. Specifically, in DUD, we gradually realize feature fusion with Inline graphic, Inline graphic and Inline graphic, Inline graphic and its calculations in DUD can be represented as follows:

  • For CNNs sub-decoder:
    graphic file with name M28.gif 1
    graphic file with name M29.gif 2
  • For Transformer sub-decoder:
    graphic file with name M30.gif 3
    graphic file with name M31.gif 4

Where Inline graphic represents the upsampling operator, Inline graphic represents the concatenation operation. Inline graphic represents convolution with two Inline graphic filters. Inline graphic and Inline graphic are the feature maps generated by the CNNs and Transformer sub-decoders. Inline graphic, Inline graphic. The formula for CFB can be represented as follows:

graphic file with name M40.gif 5
graphic file with name M41.gif 6

Where Inline graphic represents the channel attention CAM module. Ultimately, the output Inline graphic is upsampled by a Inline graphic convolution to obtain a prediction map F with the same size as the input image.

In CFB, we achieved the purpose of fusing local and global information by concatenating the features generated by the CNNs and Transformer sub-decoders. Obviously, the most effective method to extract the concatenated features is to employ convolutional layers. However, basic convolutional layers may struggle to adequately describe the critical feature correlations between various feature channels. To enhance the representation capabilities of the features and capture the dependencies between channels more effectively, we introduced CAM modules 24. By performing subspace transformation and adaptive weighted averaging on the feature maps from the CNNs and ViT sub-decoders in the channel dimension, the network can focus more on adversarial-related information noise and enhance useful information, resulting in improved accuracy and robustness of the model, as shown in Fig. 3. In detail, we first compress the feature map Inline graphic into Inline graphic along the spatial dimension using global average pooling. Then, we implement a gate mechanism that generates weights for each feature channel using parameter W, explicitly modeling the correlations between feature channels. Finally, we use the output weights of the Excitation module as the importance of each feature channel after feature selection. We then weight the previous features element-wise along channels, completing the selection of important feature channels in the original feature channel dimension. As a result, we achieve better fusion of local and global contexts.

Fig. 3.

Fig. 3

Channel Attention: CAM Block. Inline graphic denotes the features outputted by the CNNs and ViT sub-decoders, while Inline graphic denotes the channel-weighted features. Inline graphic indicates Sigmoid function.

Loss function

We use the Dice loss function 5 and the Cross Entropy loss function as the loss function, which can be expressed as:

graphic file with name M50.gif 7

Where i represents a specific index value of the element value, Inline graphic is the probability that the i-th pixel belongs to the segmentation area, and Inline graphic is the true value of the i-th pixel, Inline graphic and Inline graphic represent different weighting coefficients. In this paper, all our experiments are conducted with default values of Inline graphic, Inline graphic as 0.5 and 0.5 respectively.

Experiments

Implementation details

The implementation of our proposed AD2Former was based on the PyTorch library and Python 3.8. The experiments were conducted on a single NVIDIA RTX 3090 GPU. In our experiments, we resized images to Inline graphic. In order to better initialize our model, we used a pre-trained ResNet34 model. After conducting a thorough comparon, we selected the following hyperparameters for each dataset:

  • For Multi-organ Synapse datasets: We preprocessed the resolution of the original images to Inline graphic size using rotation and flipping data augmentation techniques. The model was trained using the SGD optimizer with a batch size of 4 for 150 epochs, where the momentum was set to 0.9, the weight decay rate was 1e-4, and the learning rate was set to 0.01.

  • For ISIC2018 datasets: Learning rate is set to 0.02, using the Adam optimizer to train for 150 epochs with a batch size of 4. The weight decay rate was 1e-4.

Evaluation metrics

We employ Dice score and Hausdorff Distance(HD) as evaluation metrics to measure the segmentation performance of the model on the Synapse dataset. For the skin lesion dataset, we used Dice score (DSC), specificity (SP), sensitivity (SE), and accuracy (ACC) to evaluate performance. The specific formulas are as follows:

graphic file with name M59.gif 8
graphic file with name M60.gif 9
graphic file with name M61.gif 10
graphic file with name M62.gif 11

Where FP and FN represent the number of foreground and background pixels that were predicted incorrectly. TP and TN represent the number of correctly classified foreground and background pixels. It should be emphasized that since the Synapse dataset consists of many classes, we must first get the DSC for each class before averaging the Dice scores for all classes to determine the final overall DSC. The formula for HD is:

graphic file with name M63.gif 12

Where h(XY) is defined as Inline graphic, and h(YX) is defined as Inline graphic, where X and Y represent the predicted and ground truth segmentation maps, x and y represent pixels in the predicted and ground truth maps, and Inline graphic denotes the distance metric between x and y.

Ablation studies

To demonstrate the effectiveness of our proposed network, we conducted ablation studies that primarily focused on two aspects: the validity of the core module and the efficacy of the model parameters. Due to the limited space of the article, we conducted the main studies on the Synapse dataset and performed ablation studies to showcase the outstanding performance of our proposed network architecture.

Effectiveness of model parameters

To explore the optimal parameters, we studied the impact of the depth of the Transformer and the MLP size in each Transformer sub-module on the performance of the model and conducted two sets of ablation studies. For the three stages of using the ViT encoder, we used a layer list [Inline graphic, Inline graphic, Inline graphic] to indicate the layer settings for the three Transformer modules in the encoder. For the MLP size, we compared our model with the original dimensionality expansion factor and used M to represent the dimensionality expansion factor of the first fully connected layer in the MLP block of the Transformer. The specific results of different configurations are shown in Table 1 and Table 2. The results show that using 3/3/3 configuration for the depth of the Transformer layers in the three stages achieves the best performance. The experimental results indicate that although the depth of the Transformer layers was increased in the second and third stages, it did not lead to a significant improvement in segmentation accuracy. It can also be seen that an MLP block with an expansion factor of Inline graphic performs better than the DSC values of other configurations, bringing 1.36% performance improvement.

Table 1.

Ablation studies on the depth of the Transformer layers on the Synapse dataset. The best results are in bold.

Method Layers DSC Inline graphic HD Inline graphic
AD2Former:A [3,3,3] 83.18 20.89
AD2Former:B [3,6,3] 81.28 25.25
AD2Former:C [3,5,8] 82.22 22.77
AD2Former:D [3,6,9] 83.03 23.40
Table 2.

Ablation studies on MLP block dimension multiplier on the Synapse dataset. The best results are in bold.

Method DSC Inline graphic HD Inline graphic
AD2Former w/ M=2 (ours) 83.18 20.17
AD2Former w/ M=4 81.82 20.89

Validity of core modules

In this section, we first validate the effectiveness of CFB. The objective of the ablation study was to test how changing the number of skipping connections in CFB would affect its performance. We compared the results of skip-0 (no CFB), skip-1 (ours), and skip-2. The experimental results are presented in Table 3. Our findings demonstrate that adding skip-1 to CFB increased the DSC by 1.26 Inline graphic, resulting in the best performance. However, the performance decreased when skip-2 was added. This is mainly because more skip connections introduce noise, which degrades the impact of each connection’s information transmission. To more clearly illustrate the impact of different numbers of skip connections on segmentation performance, we conducted three approaches with different numbers of skip connections on the Synapse dataset and presented the comparison results in Fig. 5.

Table 3.

Ablation studies of the number of skipped connections in CFB on the Synapse dataset. The best results indicated in bold.

Method DSC Inline graphic HD Inline graphic Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
skip-0 81.92 18.17 88.57 69.47 84.68 79.94 95.04 68.97 90.05 78.67
skip-1 83.18 20.89 88.94 71.76 83.54 80.00 95.16 69.71 92.14 84.23
skip-2 81.18 19.19 89.43 71.12 82.27 74.25 95.51 68.31 89.55 78.97
Fig. 5.

Fig. 5

Visual comparison of ablation studies with different numbers of skip connections in CFB on the Synapse dataset. (a) Ground Truth. (b) skip-0 (no CFB). (c) skip-2. (d) skip-1 (ours). More distinct areas within the white boundary.

For example, in the first row of segmentation results, skip-0 and skip-2 frequently misclassify background information as foreground. But our skip-1 approach demonstrates accurate pixel classification in the vicinity of the target area. In the second row, skip-1 segments the edge data in the segmentation region more accurately. In the third and fourth rows, when compared with the Ground Truth, we can clearly see that skip-1(ours) performs better than the other two approaches in the segmentation of the stomach. skip-0 fails to detect some regions in the third row, while skip-2 only captures a small portion of the stomach, However, our skip-1 effectively identifies the majority of the area.

Furthermore, we conducted ablation studies to examine the effectiveness of the channel attention module, as shown in Table 4. The experiments demonstrated that AD2Former outperformed AD2Former w/o CAM, achieving an average DSC improvement of 0.54Inline graphic. This result suggests that without the channel attention module, directly fusing the CNNs and transformer feature maps at the same scale outputted by DUD through a Inline graphic convolution and inputting them to CFB may cause feature redundancy, which can affect the segmentation performance. By adding the channel attention module, the network learned the correlation between feature channels, making the extraction and utilization of local and global information more effective. To compare the feature learning performance of the model with and without the channel attention module graphically, we show the key feature maps of the two approaches as shown in Fig. 4. We created maps of the liver and right kidney’s before- and after-channel attention features. We found that the network could concentrate more on the target organ area when the CAM module was applied in the initial stage of segmentation. In the second stage, compared to the situation without the channel attention module, our model was better able to pay attention to semantic information. For the right kidney segmentation, in stage1, although the features generated by the ViT sub-decoders misclassify the left kidney as the right kidney, the feature selection through channel attention allows for a greater focus on the region corresponding to the right kidney.

Table 4.

Ablation studies of the CAM module on the Synapse dataset. The best results are in bold.

Method DSC Inline graphic HD Inline graphic Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
w/o CAM 82.64 16.44 88.65 73.40 84.22 77.92 95.17 68.99 91.56 81.19
w CAM 83.18 20.89 88.94 71.76 83.54 80.00 95.16 69.71 92.14 84.23
Fig. 4.

Fig. 4

To compare the feature learning performance w/ and w/o the CAM module on the Synapse dataset, we used heatmaps to visualize the before- and after-channel attention feature maps. The first and third rows show the feature maps w/ the CAM module, while the second and fourth rows represent those w/o the CAM module. The symbol “C” represents feature concatenation, with the left side being the output feature of the CNNs sub-decoder and the right side being the output feature of the ViT sub-decoder.

Results of multi-organ segmentation

To validate the superiority of our proposed method in multi-organ image segmentation tasks, we compared it with the current state-of-the-art methods based on CNN and Transformer. To ensure the relative consistency of the experimental results, we conducted comparative experiments under the same experimental settings. The results are shown in Table 5. The best results are highlighted in black bold. It should be noted that AD2former is a framework for 2D images, so only the 2D network is compared in the comparison process, not the 3D network.

Table 5.

Statistical comparison with different state-of-the-art methods on the Synapse dataset.

Method DSC Inline graphic HD Inline graphic Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
V-Net 68.81±2.36 - 75.34 51.87 77.10 80.75 87.84 40.05 80.56 56.98
DARR 69.77±2.45 - 74.74 53.77 72.31 73.24 94.08 54.18 89.90 45.96
R50 U-Net 74.68±2.37 36.87±2.57 87.47 66.36 80.60 78.19 93.74 56.90 85.87 74.16
U-Net 76.85±2.41 39.70±1.21 89.07 69.72 77.77 68.60 93.43 53.98 86.67 75.58
R50 Att-UNet 75.57±2.49 36.97±1.60 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
Att-UNet 77.77±2.52 36.02±1.64 89.55 68.88 77.98 71.11 93.57 58.04 87.30 75.75
R50 ViT 71.29±2.37 32.87±1.73 73.73 55.13 75.80 72.20 91.51 45.99 81.99 73.95
TransUNet 77.48±2.39 31.69±2.01 87.23 63.13 81.87 77.02 94.08 55.86 85.08 75.62
Swin-Unet 79.13±3.50 21.55±1.43 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.60
TransDeepLab 80.16±4.55 21.25±1.48 86.04 69.16 84.08 79.88 93.53 61.19 89.00 78.40
HiFormer 80.39±2.62 14.70±2.62 86.21 65.69 85.23 79.77 94.61 59.52 90.99 81.08
MISSFormer 81.96±3.41 18.20±1.63 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
TransCeption 82.24±3.37 20.89±2.29 87.60 71.82 86.23 80.29 95.01 65.27 91.68 80.02
DAE-Former 82.43±3.44 17.46±1.76 88.96 72.30 86.08 80.88 94.98 65.12 91.94 79.19
AD2Former(Ours) 83.18±2.32 20.89±1.79 88.94 71.76 83.54 80.00 95.16 69.71 92.14 84.23

As presented in Table 5, our AD2Former outperformed other network architectures with DSC of 83.18Inline graphic and HD of 20.89Inline graphic. Moreover, our method achieved the best performance for liver, pancreas, spleen, and stomach segmentation. Specifically, in the segmentation of the pancreas and stomach, our method outperformed the second-best methods MISSFormer 18 and HiFormer 25 by 4.04Inline graphic and 3.15Inline graphic, respectively. In order to further illustrate the merits of our proposed approach, a paired student’s t-test was carried out in comparison with the second set of results. The obtained p-value (p<0.05) clearly demonstrates a statistically significant distinction between our method and those of the comparison. We also conducted a visual comparison of different methods on the Synapse dataset. We randomly selected six slices and divided them into two groups. One group was used to compare the overall segmentation effect, while the other group was used to observe the segmentation effect at the edges, as shown in Figure 7. It can be seen from the first group of slices (the first three rows), our method achieved better organ segmentation performance when compared with the other five networks. Compared to CNN-based methods, transformer-based methods pay more attention to the global context information but have some limitations in handling local details, especially in accurately judging the details on the edges of the segmentation region. As shown in the second row, other model misclassified the peripheral background information of the liver and stomach as foreground information. Although DAE-Former combines self-attention and convolution, it still has certain issues in handling boundary information, such as misclassifying parts of the boundaries of the pancreas and stomach in the image. In contrast, our proposed AD2Former model, through the real-time interaction of self-attention and convolution mechanisms and dual decoding process, can more accurately classify the information of the pancreas and stomach. In the second group of slices, taking the first row as an example, other models would misclassify irrelevant regions as the liver, while our proposed model performs excellently and is completely unaffected by such interferences. In the second row, when faced with highly blurred edges, other models struggle to achieve precise segmentation, with jagged and indistinct segmentation lines. In contrast, the AD2Former model can, by virtue of its outstanding performance, delineate organ edges relatively clearly, with distinct segmentation boundaries.

Fig. 7.

Fig. 7

Qualitative results of different mothods on the Synapse dataset. (a) Ground Truth. (b) U-Net 1. (c) TransUNet 2. (d) Swin-Unet 3. (e) TransDeepLab 19. (f) DAE-Former 20. (g) Ours. More distinct areas within the white boundary.

However, there are also some failed cases worthy of discussion during the segmentation process. As shown in the Figure 6, when the samples are severely imbalanced, the model may also encounter inaccurate segmentation situations.

Fig. 6.

Fig. 6

Failure cases: When the samples are severely imbalanced, the performance of the algorithm is mediocre.

Results of skin lesion segmentation

We also evaluated the performance of AD2Former on the ISIC2018 skin lesion dataset and compared it with existing state-of-the-art (SOTA) approaches to demonstrate the generalizability. All of the competitors were tested under the same computing environment. Table 6 shows the accuracy (ACC), sensitivity (SE), specificity (SP), Inference Speed(Images/S), Memory Utilization and Dice score (DSC) of each method. Our method clearly outperformed other image segmentation networks in all indices, achieving the highest DSC and SE scores of 91.28Inline graphic and 92.00Inline graphic, SP and ACC also have the highest scores, with 98.82Inline graphic and 96.49Inline graphic, respectively. The Inference Speed refers to the number of pictures that can be inferred per second. On average, the AD2Former can infer 36 pictures per second, and the memory usage rate during training reaches 99%. Furthermore, the p-value(p<0.05) based on student’s T-test indicates that our method is significantly different from these comparative methods. The comparison with attention-based methods further shows that our method has better robustness and generalization ability. Furthermore, we conducted a visual comparison of the segmentation results of different models. In our experiments, we selected U-Net, TransUNet, FAT-Net, and Swin-Unet as representative methods for visual comparison, as shown in Figure 8. The best results are highlighted in black bold.

Table 6.

Statistical comparison with different state-of-the-art methods on the ISIC 2018 dataset.

Method DSC Inline graphic SP Inline graphic SE Inline graphic ACC Inline graphic Inference Speed Memory Utilization
U-Net 0.8545±2.86 0.9697±3.32 0.8800±3.47 0.9404±3.51 79 98%
TransUNet 0.8499±3.55 0.9653±2.70 0.8578±2.76 0.9452±3.25 34 93%
FAT-Net 0.8566±3.01 0.9863±3.43 0.8674±3.03 0.9376±2.72 45 97%
Swin-Unet 0.8946±3.62 0.9798±2.34 0.9056±2.48 0.9645±2.46 49 96%
MISSFormer 0.8657±2.94 0.9742±3.50 0.8371±3.64 0.9453±2.92 24 98%
TransCeption 0.9124±3.38 0.9744±3.73 0.9192±2.89 0.9628±3.61 9 98%
AD2Former(Ours) 0.9128±2.52 0.9882±3.69 0.9200±2.81 0.9649±2.58 36 99%

Fig. 8.

Fig. 8

Visual comparison with different state-of-the-art methods on ISIC 2018 datasets .(a) Input images. (b) Ground Truth. (c) U-Net 1. (d) TransUNet 2. (e) FAT-Net 26. (f) Swin-Unet 3. (g) Ours. The red contours are ground truth, and the blue contours are the segmentation results of various methods.

Based on our observations, our method generally outperforms other competitors and achieves the best segmentation results. U-Net struggles to distinguish foreground information from background information without global context information, leading to inaccurate predictions when handling skin lesion edges. Although FAT-Net combines global and local information to reduce the impact of weak global context information extraction ability, it can be too sensitive and result in a larger prediction range, thereby reducing its accuracy. Swin-Unet introduces a window mechanism to reduce computational complexity, but it still has insufficient global information interaction, particularly when dealing with samples with complex boundaries, causing mis-segmentation. However, our model adopts an alternating learning strategy, which can overcome the deficiency of insufficient local and global interaction. The operation of dual encoders can optimize detail information, enabling our model to better cope with challenging scenarios like low contrast, blurred boundaries, and so on. Therefore, as shown in the figure above, our model can achieve better segmentation results than other competitors in such cases.

Discussion and limitation

Accurate medical image segmentation is of crucial importance in clinical diagnosis and treatment. In this paper, a multi-scale information fusion network based on alternating encoding is proposed, aiming to provide a reliable medical image segmentation method. This method has been validated on the Synapse and ISIC2018 datasets. On the Synapse dataset, it achieved a Dice coefficient of 83.18% and an HD of 20.89%. On the ISIC2018 dataset, it reached a Dice coefficient of 91.28%, a specificity (SP) of 98.82%, a sensitivity (SE) of 92.00%, and an accuracy (ACC) of 96.49%. The network designed in this paper successfully extracts more effective information from the feature maps through the alternating encoder and reduces the information loss during operation through the multi-scale information fusion mechanism of the dual decoders. Through a large number of experiments and analyses, the advantages of the proposed model are demonstrated.

However, the model proposed in this paper still has some deficiencies. For example, when the samples are severely imbalanced, the performance of the algorithm is mediocre. In the future, we will solve this problem by integrating Transformer and GAN (Generative Adversarial Network) into this network.

Acknowledgements

This artical supported by the project of Economic Forecasting and Policy Simulation Laboratory, Zhejiang Gongshang University (No.2024SYS015).

Data availability

The datasets generated and/or analysed during the current study are available in the following repository. [https://challenge.isic-archive.com/data/#2018 and https://pan.baidu.com/s/14yCfniLPlEO8y2Es-0JcTQ extract code(ezxa)].

Code availability

The code can be found here: https://github.com/Data-Guo/AD2Former.git

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Xinyu Guo, Email: gxy2711554980@163.com.

Hongkun Sun, Email: 22020040140@pop.zjgsu.edu.cn.

Weigang Wang, Email: wangweigang@zjgsu.edu.cn.

Liwei Yao, Email: ylwmed@163.com.

References

  • 1.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241 (Springer, 2015).
  • 2.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
  • 3.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, 205–218 (Springer, 2022).
  • 4.Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018). [DOI] [PMC free article] [PubMed]
  • 5.Milletari, F., Navab, N. & Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), 565–571 (Ieee, 2016).
  • 6.Huang, H. et al. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), 1055–1059 (IEEE, 2020). [DOI] [PMC free article] [PubMed]
  • 7.Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017).
  • 8.Yuan, Y. & Cheng, Y. Medical image segmentation with unet-based multi-scale context fusion. Scientific Reports14, 15687 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Isensee, F. et al. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486 (2018).
  • 10.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 11.Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, 568–578 (2021).
  • 12.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
  • 13.Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems34, 12077–12090 (2021). [Google Scholar]
  • 14.Xu, R. et al. Boundary guidance network for medical image segmentation. Scientific Reports14, 17345 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gu, Q., Zhang, H., Cai, R., Sui, S. Y. & Wang, R. Segmentation of liver ct images based on weighted medical transformer model. Scientific Reports14, 9887 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Khan, S. D., Basalamah, S. & Lbath, A. Multi-module attention-guided deep learning framework for precise gastrointestinal disease identification in endoscopic imagery. Biomedical Signal Processing and Control95, 106396 (2024). [Google Scholar]
  • 17.Zhou, H.-Y. et al. nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201 (2021).
  • 18.Huang, X., Deng, Z., Li, D., Yuan, X. & Fu, Y. Missformer: An effective transformer for 2d medical image segmentation. IEEE Transactions on Medical Imaging (2022). [DOI] [PubMed]
  • 19.Azad, R. et al. Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine, 91–102 (Springer, 2022).
  • 20.Azad, R., Arimond, R., Aghdam, E. K., Kazerouni, A. & Merhof, D. Dae-former: Dual attention-guided efficient transformer for medical image segmentation. arXiv preprint arXiv:2212.13504 (2022).
  • 21.Azad, R., Jia, Y., Aghdam, E. K., Cohen-Adad, J. & Merhof, D. Enhancing medical image segmentation with transception: A multi-scale feature fusion approach. arXiv preprint arXiv:2301.10847 (2023).
  • 22.Alom, M. Z., Yakopcic, C., Hasan, M., Taha, T. M. & Asari, V. K. Recurrent residual u-net for medical image segmentation. Journal of Medical Imaging6, 014006–014006 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
  • 24.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018).
  • 25.Heidari, M. et al. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 6202–6212 (2023).
  • 26.Wu, H. et al. Fat-net: Feature adaptive transformers for automated skin lesion segmentation. Medical image analysis76, 102327 (2022). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analysed during the current study are available in the following repository. [https://challenge.isic-archive.com/data/#2018 and https://pan.baidu.com/s/14yCfniLPlEO8y2Es-0JcTQ extract code(ezxa)].

The code can be found here: https://github.com/Data-Guo/AD2Former.git


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES