Skip to main content
IEEE Journal of Translational Engineering in Health and Medicine logoLink to IEEE Journal of Translational Engineering in Health and Medicine
. 2022 Sep 26;10:1800909. doi: 10.1109/JTEHM.2022.3210047

MPSHT: Multiple Progressive Sampling Hybrid Model Multi-Organ Segmentation

Yiyang Zhao 1, Jinjiang Li 2,, Zhen Hua 1
PMCID: PMC9704745  PMID: 36457896

Abstract

Background: In recent years, computer-assisted diagnosis of patients is an increasingly common topic. Multi-organ segmentation of clinical Computed Tomography (CT) images of the patient’s abdomen and magnetic resonance images (MRI) of the patient’s heart is a challenging task in medical image segmentation. The accurate segmentation of multiple organs is an important prerequisite for disease diagnosis and treatment planning. Methods: In this paper, we propose a new method based on multi-organ segmentation in CT images or MRI images; this method is based on the CNN-Transformer hybrid model, and on this basis, a progressive sampling module is added. Results: We performed multi-organ segmentation on CT images and MRI images provided by two public datasets, Synapse multi-organ CT dataset (Synapse) and Automated cardiac diagnosis challenge dataset (ACDC). By using Dice Similarity Coefficient (DSC) and Hausdorff_95 (HD95) as the evaluation metric for the Synapse dataset. For the Synapse dataset of CT images, the average DSC reached 79.76%, and the HD95 reached 21.55%. The DSC indicators of Kidney(R), Pancreas, and Stomach reached 80.77%, 59.84%, and 81.11%, respectively. The average DSC for the ACDC dataset of MRI images reaches 91.8%, far exceeding other state-of-the-art techniques. Conclusion: In this paper, we propose a multi-sampled vision transformer MPSHT based on the CNN-Transformer structure. The model has both the advantages of CNN convolutional network and Transformer, and at the same time, the addition of a progressive sampling module makes the model’s segmentation of organs more accurate, making up for the shortcomings of the previous CNN-Transformer hybrid model.

Keywords: Convolutional neural network, segmentation, muti-organ, CNN-Transformer, CT, MRI

I. Introduction

Computer vision technology has been increasingly used in medical image analysis. In the process of medical image analysis in the process of translational engineering, medical image segmentation is an indispensable link. Currently, medical image segmentation relies heavily on the U-Shape Full Convolution Network (FCN) [1], [2]. A representative U-Net [1] includes a symmetric codec -decoder network. This network structure can not only make good use of the local features of the space domain but also reduce the computation amount effectively. However, with the development of deep learning research, the FCN model gradually reveals some defects. At Encoder, the images were extracted from deep features through a series of convoluted and subsampled layers. The deep features extracted in Decoder are restored to input image resolution by cascading over sampling, and the missing local spatial information resulting from the lower sampling is recovered during the upper sampling process by using skip connections to enhance the detail of the image and then mitigate the loss of spatial information resulting from the lower sampling. This approach has been used with great success in a variety of medical applications and translational engineering. The high performance of many algorithms such as Res-Unet [3], U-Net + [4], and Unet3 + [5] in heart recognition, organ segmentation, and lesion segmentation based on the FCN approach confirms CNN’s effectiveness and universality.

While the current CNN-based approach is performing well in medical image analysis in the process of translational engineering, accurate medical image segmentation plays an important role in clinical applications [6] This method does not fully meet the high precision of image segmentation in clinical diagnosis. Medical image segmentation is still one of the most difficult problems in medical image analysis in the process of translational engineering. Because of the inherent location and natural limitation of convolution operation, it is very difficult to learn remote and global semantic information. More recently, Transformer has been tried by researchers as a model for the great success of natural language processing (NLP) [7]. Vision transformer (ViT) is proposed in the literature to use transformer to identify images. Since self-attention is an important part of transformer, the attention mechanism models the relevance of all input tokens, allowing transformers to learn global and remote dependencies more effectively.

In response to Transformer’s success, Chen and others proposed TransUNet [8] to introduce a CNN-Transformer hybrid model in the field of medical image segmentation. The overall architecture of TransUNets is based on U-Net [1] and utilizes Transformer’s sequence-to-sequence self-attention mechanism. This compensates for the inherent location and inherent limitations of CNN’s convolution operation and improves Transformer’s difficulty in accurately extracting low-resolution and multi-scale features. TransUNet and its followers [9], [10], [11] first get the feature diagram by convolution and then encode the resulting feature diagram in patches with Transformer enhancing global connections. However, this naïve segmentation has its limitations, as hard segmentation can cause damage to the more semantically related parts of the photo, reducing the information carried by the input patches, as shown in Fig. 1 (a). Making the whole task a problem of identifying a part of an image. In contrast, Fig.1(b), the sampling position is updated by a progressive sampling module to avoid damage to organ integrity.

FIGURE 1.

FIGURE 1.

In ViT [7] compared the naïve tokenization schema for images to the progressive sampling module of MPSHT presented in this study. (a) Naïve segmentation scheme divides the input images into a series of patches fed into Transformer. (b) MPSHT, as designed in this paper, mitigates the damage to image integrity caused by hard segmentation through iterative sampling locations and progressive sampling. *N is for performing N iteration sampling.

Specifically, our contribution can be summarized as follows:

  • (1)

    CNN-Transformer hybrid model was used for this study. It makes up for the fact that the inner convolution nucleus of CNN only focuses on its own and its boundary characteristics, and does not integrate the characteristics of the large area well. This makes up for Transformer ignoring the correlation between samples, the difficulty of extracting low-resolution, multi-scale features and so on.

  • (2)

    This article adds a progressive sampling module. Most CNN-Transformer hybrid models (such as literature TransUNet [8], Transfuse [9], Medical Transformers [12], Transclaw U-Net [10], and Transattenet [11]) do not consider that the naïve segmentation method can result in image structure destruction and interference with too much information. The Progressive sampling module we have added is iterative and Progressive sampling strategy, which can be used for adaptive learning of interesting locations, which can improve the above problems.

II. Related Work

A. CNN-Based Methods

In the early segmentation of medical images, traditional machine learning and algorithms based on organ contours were used [13], [14]. Convolutional neural networks have been widely used for their robustness and adaptability, and now CNN has replaced the earlier traditional machine learning methods as the de-facto standard for medical image segmentation. Since the introduction of U-Net in 2015, and having demonstrated remarkable ability to perform the task of medical image segmentation, more U-shaped structural networks have been proposed by researchers. These include Resnet34-Unet [3], Dense-UNet [15], AttnUnet [16], and UNet + + [4], and introduce U-shaped structural networks in 3D for medical image segmentation, such as 3D-Unet [17], V-Net [18]. CNN, like other networks developed on its foundation, has become the backbone of medical image segmentation with its superior representation.

B. Vision Transformer

The Transformer architecture was originally proposed for sequence modeling, characterized by a strong ability to model global relationships, becoming the backbone of natural language processing (NLP) missions, and achieving state-of-the-art metrics [19].

Because of Transformer’s success in completing the NLP mission, many researchers have tried to apply Transform’s attention mechanism to the visual field. Recently, ViT [7] has applied attention mechanisms to images and achieved good indicators. However, ViT has inherent limitations. Its strong ability to model globally comes at the expense of computing power, which increases based on the length of the token, leading to the need for ViT to pretrain models on large datasets. In order to alleviate the problem of ViT pre-training time, Deit in his paper [20] presents several training strategies for Transformer that enable ViT to effectively pre-train on large datasets.

C. CNN-Transformer Hybrid Model

Recently, the researchers further improved network performance by combining traditional convoluted neural networks with self-attention mechanisms [21].

In literature [12] Traditional U-Shape skip-connection is replaced by fusion addition attention gates skip-connection and dividing medical images based on this. CNN, however, remains the primary research tool. Now more researchers are trying to better integrate CNN-Transformer hybrid models in hopes of breaking CNN’s dominance in the field of medical image segmentation [8]. In the literature [8], [12], [22], the researchers continued CNN’s Encoder-Decoder structure by adding Transformer to Encoder, improving the respective limitations of CNN, Transform, and generating powerful two-dimensional image medical segmentation encoders. However, deep learning of a single structure has some drawbacks, which makes it difficult to achieve good results in solving complex problems. Therefore, how to improve classification accuracy through the complementary use of multiple network structures becomes an important research topic. Similarly, to literature [2], [9], [12], the CNN-Transformer hybrid model is exploited to further enhance the segmentation capability of the network model. At present, the combination of CNN-Transformer hybrid models is widely used in multimodal brain tumor segmentation [23] and two-dimensional, three-dimensional medical image segmentation [2], [24].

D. Progressive Sampling

With Transformer shining in every field. However, due to the inherent limitations of ViT, large datasets require prior training, which can make image segmentation difficult when input information is insufficient. In medical image segmentation, naïve segmentation can disrupt image structure and focus attention on meaningless interference signals, making it difficult to subdivide organs. Inspired by the ViT presented in [7], we add a progressive sampling strategy to avoid structural damage caused by the hard segmentation of an image by ViT, while focusing attention on the regions of interest like a human eye system to reduce the interference of useless information on segmented regions. We took full advantage of the progressive sampling module to update the location not by sampling at a fixed location but by iterating the sample through the module. The current sampling location is updated with the offset vector of the previous iteration during each iteration. By using Transformer’s strengths to learn about global information, by combining local contexts with current tokens’ locations, regions of interest are studied. Focusing more on areas of interest to the organ reduces the distraction of uninterested background information compared to traditional attention. PS-Vit is similar to hard visual attention [25], [26], [27], [28] but has its own advantages. Unlike the hard vision attention presented earlier, the incremental sampling module is differentiable, which makes it easy for incremental sampling modules to be trained end-to-end. Secondly, progressive sampling modules can avoid underperformance on complex datasets using the Enhanced Learning (RL) approach during training. The progressive sampling module also uses methods related to variable convolution [29], [30] and variable attention mechanisms [31], but pixel sampling is intended to distinguish between variable convolution and attention mechanisms, with the goal of making judgements based solely on the location information of the input image and adopting a new incremental sampling strategy.

The MPSHT structure proposed in this paper is to add progressive sampling module to the traditional CNN-Transformer hybrid model to mitigate the inaccuracy caused by the traditional hard segmentation of ViT images. Moreover, this model can play a significant role in constructing global semantic connections and further optimize their feature. It is intended to provide a new benchmark for CNN-Transformer model application in medical image segmentation.

III. Method

The traditional Unet [1] does not have an attention mechanism, whereas the network architecture introduces attention mechanisms through Transformer, but the traditional Transformer classification of the picture naïve does not take into account the integrity of important elements and organs in the picture. In order to make the network as complete as possible for organ segmentation and to mitigate damage to organ integrity, we have added a progressive sampling module [32] to the network, iterating so that the sampled locations eventually converge as much as possible to important areas of the image.

We will first describe the overall MPSHT architecture in 3.1. This is followed by a detailed implementation of the progressive sampling module in 3.2. The CNN-Transformer hybrid model is then described in 3.3 for direct application to Encoder. Finally, in 3.4, we introduce our functional Loss function.

A. Overall Architecture

This section describes the overall structure of MPSHT, as shown in Fig.2 This network is based on the Encoder-Decoder structure. Typically, the shallower layers contain more low-level information, which is more conducive to the generation of segmented mask. Therefore, on the Decoder of this network, feature information from the shallower convolution layers is added to the image via skip-connection on UNet and then the integrity of the lower-level features is ensured. Compared to convolution operations, Transformer can enhance network performance but increase computing significantly. To reduce the computational cost, we still use convolution as a feature extraction and generate a feature diagram for the input image at the upper Encoder level. The feature map obtained by the convolution operation was divided into 16 patches of the same size using hard segmentation in ViT, and the original image and feature map were then passed into progressive sampling modules for iteration of their sampling locations. The number of iterations in this paper is four, and the progressive sampling module adds the final output token and classified token to the Transformer Layer iteration, and then introduces the feature diagram to the decoder. In decoder, the feature map obtained in encoder is reshaped by 1*1 convolution, the number of channels is reduced to the number after classification, and the original resolution is reached by bilinear upper sample feature map, and the loss of low-level feature is compensated by skip connection. Finally, the prediction of segmentation result is completed.

FIGURE 2.

FIGURE 2.

The overall structure of MPSHT.

B. Progressive Sampling Module

In ViT-B / 16, for example, the input images were linearly divided into a series of 16 * 16 sized patches, but this was done without considering that such hard segmentation would compromise the integrity of vital organs in CT images. In order to enable the network to segment organs as accurately as possible and mitigate the damage to organ integrity caused by hard splitting, we have introduced a local progressive sampling module, which aims to ensure that highly relevant regions of the organ are in the same patch as much as possible.

As shown in Fig. 3, the characteristic graph Inline graphic derived from the upper convolution network has a resolution of Inline graphic and a number of channels of Inline graphic, while the iterated progressive sampling module outputs a series of tokens Inline graphic, of which Inline graphic represents the number of tokens generated on a single image.

FIGURE 3.

FIGURE 3.

The implementation of progressive sampling module. First, the characteristic diagram obtained from the upper network is fed into Inline graphic, and the initial position Inline graphic is sampled to get Inline graphic. And put Inline graphic into Position Embedding to get Inline graphic and add up the output tokens Inline graphic from the previous iteration to get the tokens Inline graphic from this iteration by forecasting. Inline graphic first generates an offset via a Fully-Connected Layer that adds up to Inline graphic get the sampling location Inline graphic for the next iteration. This module iterates N times.

As shown in Fig. 3, in each iteration, the location of the next sample is updated by adding the sample location Pt to the offsets of the previous iteration:

B.

where Inline graphic and Inline graphic denote the predicted sampling positions and offsets at iteration Inline graphic.

B.

Among them, Inline graphic is a token sampled from a feature diagram during iteration Inline graphic. Since sampling is done by bilinear interpolation, element Inline graphic is decimal, and it is differentiable for both feature diagram Inline graphic and sampling position Inline graphic.

The sample tokens obtained for the first time, the output tokens obtained for the previous iteration and the location code of the current sample location are added together and sent to the transformer encoder layer to obtain the output tag for this iteration:

B.

Of these, Inline graphic projects the sampled location Inline graphic onto the location-coding matrix Inline graphic, Inline graphic denotes the element addition, and Inline graphic is the Multi-head Self-Attention module (MSA). Because the sampling position after using the progressive sampling module is not equidistant. So, map the coordinates of the sampled locations into an embedding space as location embedding. Except at the end of the iteration, the offset for predicting the next sampling location is as follows:

B.

Iterative updating of the sample location gradually converges the attention to the area of interest in the image and enables a more precise segmentation of the shape and boundaries of the organ.

C. CNN-Transformer Hybrid Model as Encoder

First, our model does not use pure CNN or Transformer as encoders, but a CNN-Transformer hybrid model. The hybrid model makes up for the fact that CNN feels Field is not able to extract global information well and that Transformer is not able to extract local information well. But the combination of CNN and Transformer allows CNN to extract local information well, and Transformer has a natural advantage for extracting global information because of its self-attention structure, which will be described in more detail later.

Enter an image of in size with a resolution of and a number of channels of. Using the convoluted stem and the first three residuals of ResNet50 [33] as our feature extraction module for the input image, we iterated the resulting feature diagram into a progressive sampling module, with the output token plus classified token from the last iteration as follows:

C.

One, Inline graphic, represents the functionality of the Transformer Encoder module, and since the sampling location is already embedded in the Inline graphic in the progressive sampling module, there is no need to embed additional location information. Eventually, the final image categories are predicted by refining the classification token through iterations of the Transformer Encoder module.

D. Loss Function

To ensure the integrity and detail of the organ after image segmentation, a loss function combining cross entropy loss and Dice loss is used to optimize MPSHT.

The two loss functions are combined because Dice Loss is prone to instability when training loss, especially in the face of small targets. Extreme conditions can lead to gradient saturation, but Dice Loss can mitigate uneven foreground and negative impacts in samples, focusing more on mining foreground regions. However, Cross Entropy Loss is an equal way of calculating loss per pixel, and current point losses are only related to distance between predicted and true label values. Dice Loss alone often doesn’t work very well, so we combined Dice Loss with Cross Entropy Loss. The loss function LOSS is calculated as follows:

D.

Of these, Inline graphic can calculate how close the actual output image is to the expected output image, while Inline graphic calculates how similar the two samples are. The description is as follows:

D.

Inline graphic and Inline graphic represent the number of two elements, respectively, while Inline graphic is the set of Inline graphic and Inline graphic. The probability distribution Inline graphic is the expected output and the probability distribution Inline graphic is the actual output.

IV. Experiments

A. Dataset

Synapse (Synapse multi-organ segmentation dataset)1: The dataset included scans of 30 abdominal clinical CT cases. Each case consists of 85 to 198 images of Inline graphic pixels. As in the dataset set in [8] and [34], 18 of these samples were randomly used as a training set for the network, while the remaining 12 samples were used as a test set. Images of each of these abdominal clinical CT cases included eight abdominal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, stomach). Synapse datasets were evaluated in this paper using Dice Simplicity Coefficient (DSC) and Hausdorff_95 (HD95). The SI unit for magnetic field strength H is Inline graphic. However, if you wish to use units of T, either refer to magnetic flux density B or magnetic field strength symbolized as Inline graphic. Use the center dot to separate compound units, e.g., “ Inline graphic.”

ACDC (Automated Cardiac Diagnosis Challenge): The ACDC Challenge provides a public MRI dataset for heart segmentation, obtained by scanning different patients with two MRI scanners of different magnetic intensities and then manually tagging each patient’s left ventricle (LV), right ventricle (RV), and heart muscle (MYO). We used a random sample of 70 of the ACDC dataset cases as training samples, 10 as validation samples, and the last 20 as test set samples.

B. Implementation Details

For MPSHT network training, a Pytorch deep learning framework was used for synapse multiorganization dataset and ACDC datasets with simple data enhancement (e.g., random rotation, flipping, etc.). The CNN-Transformer hybrid model combines ResNet50 and ViT to extract image features using ResNet50 phase 1 convoluted stem and the first three residuals to extract the input image feature module for us. Then import the feature diagram into the progressive sampling module, iterating four times. Finally, this paper combines the location information tokens and classification information tokens obtained by progressive sampling module, and realization the image classification after 12 iterations in Transformer. Both the ResNet-50 and Transformer backbone mentioned above were pre-trained by ImageNet [35]. The input image resolution was set at 224 * 224. Patch size was set to 16 * 16. According to the above design, on Decoder we need to cascade four 2 * up-sample modules to restore the image to full resolution. The model uses a stochastic gradient descent algorithm [36] in which the initialization learning rate is 0.01 and Momentum is 0.9. Set weight attenuation between 1e-4. The training rounds were set at 300, and we summarized the selection of parameters in this experiment in Table 1.

TABLE 1. Hyperparametric Settings.

Epochs 300
Optimizer SGD
Patch size 16
Learning rate 0.01
Weight decay 1e-4

For fairness, all other comparator parameters are set to the same parameter. During the training process, the model with the best performance in the concentration is validated by recording as the final model. All trials were performed in NVIDIA Titan RTX.

Inspired by the literature [17], [32], we divided all 3D volumes by slicing them layer by layer along the z-axis. 2D slices were stacked in their original order and reconstructed into 3D for predictive assessment at the time of prediction.

C. Comparison With Other Advanced Techniques on the Synapse Dataset

Experiments were performed primarily on pretreated Synapse multiorgan segmentation dataset, comparing our proposed MPSHT with four previously proposed advanced models:1)SwinUNet [34];2)TransUNet [8];3)AttnUNet [16];4)UNet [1]. TABLE 2 shows the comparison results obtained on test sets for each advanced model. Data clearly show that our MPSHT performed best on evaluation metrics, with 79.76% DSC and 26.55% HD95, respectively. This is a significant improvement over previous advanced methods, with DSC scores below 79.13% for all other networks and above 21.55% for HD95. Compared to the recently published SwinUNet, our model is 0.53% more accurate in DSC, and this improvement demonstrates the role of progressive sampling modules for organ margin detail segmentation. The experimental data also show that our model has the best segmentation metrics on Kidney (R), Pancreas and Stomach, demonstrating the real-world value of our proposed model.

TABLE 2. Experimental Results on the Synapse Multi-Organ Segmentation Dataset. (Results of DSC Experiments for Each Different Organ are Shown).

Method DSC (%) HD95(mm) Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
V-Net 68.81 - 75.34 51.87 77.10 80.75 87.84 40.50 80.56 56.98
DARR 69.77 - 74.74 53.77 72.31 73.24 94.08 54.18 89.90 45.96
R50 Unet 74.68 36.87 84.18 62.84 79.16 71.29 93.35 48.23 84.41 73.92
R50AttnUNet 75.57 36.97 55.92 63.91 79.20 72.71 93.56 49.37 87.19 74.95
Unet 76.85 39.70 89.07 69.72 77.77 68.60 93.43 53.98 86.67 75.58
AttnUNet 77.77 36.02 89.55 68.88 77.98 71.11 93.57 58.04 87.30 75.75
R50 ViT 71.29 32.87 73.73 55.13 75.80 72.20 91.51 45.99 81.99 73.95
ViT 61.50 39.61 44.38 39.59 67.46 62.94 89.21 43.14 75.45 69.78
TransUNet 77.48 31.69 87.23 63.16 81.87 77.02 94.08 55.86 85.08 75.62
MT-UNet 78.59 26.59 87.92 64.99 81.47 77.29 93.06 59.46 87.75 76.82
SwinUNet 79.13 21.55 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.60
Ours 79.76 21.55 87.74 65.36 80.55 80.77 93.64 59.84 89.46 81.13

Fig. 4 visualizes the segmentation of our proposed network structure and other comparison models in Synapse multiorgan segmentation dataset. For (e) AttnUNet and (f) UNet, the over-segmentation of organs is evident first and foremost due to CNN’s inherent limitations and inability to effectively learn the interaction of global and long-distance semantic information. Compared to (d) TransUNet and (c) SwinUNet, our proposed model has better results for organ margin refinement, especially when the organ occupies a small area on a CT map, such as our MPSHT accuracy of 81.13% on the Stomach index, which is significantly higher than the next best 4.31%.These results reaffirm our initial view that although our proposed MPSHT is a CNN-Transformer hybrid model with both TransUNet and SwinUNet, the segmentation of images in our proposed network minimizes damage to organ integrity through progressive sampling modules, placing highly relevant regions of the organ in the same patch as possible, an operation that allows our network to segment smaller organs with greater accuracy.

FIGURE 4.

FIGURE 4.

The effects of the different models are compared by visualization. From left to right are (a) Ground Truth, (b) MPSHT (c) SwinUNet (d) TransUNet (e) AttnUNet (f) UNet.

D. Experimental Results of ACDC Dataset

To demonstrate the generalizability of our MPSHT across different datasets, we will train our proposed MPSHT on ACDC datasets and perform organ segmentation testing on the dataset. The experimental results and indicators are shown in TABLE 3, and all evaluation indicators surpass the comparison network. It shows that our designed MPSHT also performs well for different types of image inputs, with a DSC of 91.80%. The results show that our proposed method has excellent generalization ability but also robustness.

TABLE 3. Experimental Results on the ACDC Dataset.

Method DSC (%) RV Myo LV
R50 UNet 87.60 84.62 84.52 93.68
R50 AttnUnet 86.90 83.27 84.33 93.53
ViT-CPU 83.41 80.93 78.12 91.17
R50 ViT 86.19 82.51 83.01 93.05
TransUNet 89.71 86.67 87.27 95.18
MT-UNet 90.43 86.64 89.04 95.62
SwinUNet 90.00 88.55 85.62 95.83
Ours 91.80 90.13 89.34 95.92

E. Ablation Experiments

To test and validate the performance of our proposed MPSHT model with different input parameters, we designed several ablation experiments, including skip-connections number and patch size, and finally validated the role of progressive sampling modules in the network.

1). The Number of Skip-Connections

Since the CNN-Transformer hybrid model is used for the network presented in this paper, the skip-connections used can enhance the fine segmentation of organ details by combining feature information from shallow convolution layers with images. In the MPSHT presented here, skip-connections are added to the original image resolution of Inline graphic, Inline graphic, Inline graphic, as spatial domain information is very important for segmentation tasks. This method uses encoder’s pooling layer in the network to reduce the resolution of the feature map very little, which is not conducive to the accurate segmentation of the mark, whereas the shallow features can be transmitted by skip-connections. These high-resolution shallow features will contain a lot of low-level information, and the upper sample will need the skip-connections to provide sufficient auxiliary information to facilitate the segmentation of the mask. By setting the number of skipping connections to 0, 2, and 3. The obtained index allows us to find out the extent to which different skip-connections have an impact on performance, as shown in Fig. 5, where the organ segmentation DSC index varies depending on how many connections are skipped. It is clear from the bar chart that the more skip-connections there are, the better the accuracy of organ segmentation. DSC scores for smaller organs such as Gallbladder, Kidney and Pancreas were positively correlated with the number of skip-connections, more so than for larger organs such as Liver, Spleen and Stomach. The reason for this gap is that skipping the connection transmits the underlying, richer low-level information to Decoder, and this interplay allows the model to have better performance allowing the network to add more segmentation detail. In the time dimension, skip-connection can reduce computation and improve efficiency. Visualization details are shown in Fig. 6. So, in this paper we set the number of jump links at three.

FIGURE 5.

FIGURE 5.

Comparison of ablation experiments for the effect of number of skip connections on multi-organ DSC (%).

FIGURE 6.

FIGURE 6.

For visualization of the number of skip connections on multi-organ segmentation for experimental comparison. Where (a) stands for Ground Truth. (b) indicates that the number of skip connections is set to 0. (c) indicates that the number of skip connections is set to 2. (d) indicates that the number of skip connections is set to 3.

Patch size impact. For the training model, the size of patch-size is directly related to the training efficiency and output performance of the model. To test whether patch-size influences training model parameters, we set the patch-size at Inline graphic, Inline graphic, and Inline graphic. The results are shown in TABLE 4. Since the length of the sequence varies with the size of the patch-size, if the patch-size is set too small, the length of the sequence is too long, greatly increasing the parameters to be learned, resulting in a significant increase in computational time and resource requirements. Conversely, if the patch-size is set too high, the effective sequence is shortened and the model does not learn global features well from the sequence. So, it’s important to choose the right patch-size. Therefore, we adopt the appropriate range control to ensure that the algorithm converges to the optimal solution. As can be seen from TABLE 4, DSC was 79.76% when patches were Inline graphic, but 80.04% when patches were Inline graphic, and 80.17% when patches were Inline graphic. Thus, DSC growth is extremely weak, but greatly increasing for computational volume and occupancy is many times greater. Therefore, the default size of this thesis is Inline graphic patches. For a more detailed comparison, see Fig. 7, where Encoder can learn less and less effective information through training as the patch size gets bigger, and Fig. 7, where the patch size is larger, over divides the organ or causes segmentation errors.

TABLE 4. Comparison of Ablation Experiments for Patch Size.
Patch Size DSC (%) Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
16*16 79.76 87.74 65.36 80.55 80.77 93.64 59.84 89.46 81.13
24*24 80.04 88.80 66.54 80.84 81.92 95.89 60.51 84.95 82.63
32*32 80.17 88.58 66.32 79.73 81.03 94.21 60.23 89.83 81.85
FIGURE 7.

FIGURE 7.

For the visualization comparison of Patch size on the accuracy of organ segmentation. Where (a) represents Ground Truth plot (b) indicates that the patch size is set to Inline graphic. (c) indicates that the patch size is set to Inline graphic. (d) indicates that the patch size is set to Inline graphic.

2). Effects of Progressive Sampling Modules

In order to verify that the incremental sampling module added in this paper can function in medical image segmentation tasks, we designed the same network without the incremental sampling module and compared it with MPSHT presented in this paper. As shown in TABLE 5, it is clear from the indicators that the progressive sampling module improves the task of medical image segmentation. This is due to the possibility that traditional hard segmentation may cause the integrity of vital organs in medical images to be destroyed, so that the model can only achieve image segmentation through a particular feature point. With the addition of progressive sampling module, attention will be focused on the organs through iteration, and the integrity of the organs will be avoided as much as possible. Of note was the 4.38% and 5.52% improvement in DSC scores for Spleen and Stomach, respectively, with the addition of the progressive sampling module. The results of removing the progressive sampling module in Fig. 8 show that the organs are misclassified and the organs are missing. Progressive sampling modules will be more significant for medium to large organ segmentation precision gains such as Liver, Spleen, and Stomach.

TABLE 5. For Experimental Comparison of Ablation of the Effect of Progressive Sampling Modules.
DSC (%) Aorta Gallbladder Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
Remove progressive Sampling 77.84 87.23 63.13 81.87 77.92 94.08 55.86 85.08 75.62
Progressive Sampling 79.76 87.74 65.36 80.55 80.77 93.64 59.84 89.46 81.13
FIGURE 8.

FIGURE 8.

Visual comparison of progressive sampling module for organ segmentation. Where (a) denotes Ground Truth. (b) denotes the network with the progressive sampling module removed. (c) denotes the network with progressive sampling module.

V. Discussion

It is well known that pre-training a model can seriously affect its effectiveness. This work is not optimal because we use CNN and Transformer’s backbone directly is pre-trained via ImageNet to initialize the network Encoder and Decoder.

In addition, the progressive sampling module in this paper is better for smaller object segmentation because of the size of patches, but for larger object segmentation, the progressive module in this paper also causes damage to the whole object and cannot achieve the ideal effect. Finally, the images entered in this paper are 2D images from pre-processed datasets, whereas most medical images are 3D images at present. Therefore, MPSHT will be used in 3D medical image segmentation datasets in the future.

VI. Conclusion

In this paper, we propose a CNN-Transform-based MPSHT design method in order to give full expression to the respective advantages of CNN and Transformer, and makes the division of organs more accurate by adding progressive sampling module, thus compensating for the shortcomings of CNN-Transformer hybrid model. A large number of experimental studies have been conducted on Synapse multi organ CT dataset (Synapse) automated cardiac challenge dataset (ACDC), and the results show that the proposed model is robust and generalization.

Funding Statement

This work was supported in part by the National Natural Science Foundation of China under Grant 61772319, Grant 62002200, Grant 62202268, and Grant 62272281; in part by the Shandong Provincial Science and Technology Support Program of Youth Innovation Team in Colleges under Grant 2021KJ069 and Grant 2019KJN042; and in part by the Yantai Science and Technology Innovation Development Plan under Grant 2022JCYJ031.

Footnotes

1

The datasets we used are provided by TransUNet’s authors. Please send an Email to jienengchen01 AT gmail.com to request the preprocessed data.

References

  • [1].Ronneberger O., Fischer P., and Brox T., “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Munich, Germany: Springer, 2015, pp. 234–241. [Google Scholar]
  • [2].Long J., Shelhamer E., and Darrell T., “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [DOI] [PubMed] [Google Scholar]
  • [3].Xiao X., Li S.an, Luo Z., and Li S., “Weighted res-UNet for high-quality retina vessel segmentation,” in Proc. 9th Int. Conf. Inf. Technol. Med. Educ. (ITME), Oct. 2018, pp. 327–331. [Google Scholar]
  • [4].Zhou Z., Rahman Siddiquee M. M., Tajbakhsh N., and Liang J., “UNet++: A nested U-Net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Granada, Spain: Springer, 2018, pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Huang H.et al. , “UNet 3+: A full-scale connected unet for medical image segmentation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 1055–1059. [Google Scholar]
  • [6].Hu J.et al. , “S-UNet: A bridge-style U-Net framework with a saliency mechanism for retinal vessel segmentation,” IEEE Access, vol. 7, pp. 174167–174177, 2019. [Google Scholar]
  • [7].Dosovitskiy A.et al. , “An image is worth 16×16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929. [Google Scholar]
  • [8].Chen J.et al. , “TransUNet: Transformers make strong encoders for medical image segmentation,” 2021, arXiv:2102.04306. [Google Scholar]
  • [9].Zhang Y., Liu H., and Hu Q., “Transfuse: Fusing transformers and CNNs for medical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Springer, 2021, pp. 14–24. [Google Scholar]
  • [10].Chang Y., Menghan H., Guangtao Z., and Xiao-Ping Z., “TransClaw U-Net: Claw U-Net with transformers for medical image segmentation,” 2021, arXiv:2107.05188. [Google Scholar]
  • [11].Chen B., Liu Y., Zhang Z., Lu G., and Wai Kin Kong A., “TransAttUnet: Multi-level attention-guided U-Net with transformer for medical image segmentation,” 2021, arXiv:2107.05274. [Google Scholar]
  • [12].Valanarasu J. M. J., Oza P., Hacihaliloglu I., and Patel V. M., “Medical transformer: Gated axial-attention for medical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Strasbourg, France: Springer, 2021, pp. 36–46. [Google Scholar]
  • [13].Tsai A.et al. , “A shape-based approach to the segmentation of medical imagery using level sets,” IEEE Trans. Med. Imag., vol. 22, no. 2, pp. 137–154, Feb. 2003. [DOI] [PubMed] [Google Scholar]
  • [14].Held K., Kops E. R., Krause B. J., Wells W. M. I. I. I., Kikinis R., and Müller-Gartner H.-W., “Markov random field segmentation of brain MR images,” IEEE Trans. Med. Imag., vol. 16, no. 6, pp. 878–886, Dec. 1997. [DOI] [PubMed] [Google Scholar]
  • [15].Li X., Chen H., Qi X., Dou Q., Fu C.-W., and Heng P.-A., “H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes,” IEEE Trans. Med. Imag., vol. 37, no. 12, pp. 2663–2674, Dec. 2017. [DOI] [PubMed] [Google Scholar]
  • [16].Schlemper J.et al. , “Attention gated networks: Learning to leverage salient regions in medical images,” Med. Image Anal., vol. 53, pp. 197–207, Apr. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Çiçek O., Abdulkadir A., Lienkamp S. S., Brox T., and Ronneberger O., “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Istanbul, Turkey: Springer, 2016, pp. 424–432. [Google Scholar]
  • [18].Milletari F., Navab N., and Ahmadi S.-A., “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in Proc. 4th Int. Conf. 3D Vis. (3DV), Oct. 2016, pp. 565–571. [Google Scholar]
  • [19].Devlin J., Chang M.-W., Lee K., and Toutanova K., “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805. [Google Scholar]
  • [20].Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., and Jégou H., “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357. [Google Scholar]
  • [21].Zhao H., Shi J., Qi X., Wang X., and Jia J., “Pyramid scene parsing network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2881–2890. [Google Scholar]
  • [22].Hatamizadeh A.et al. , “UNETR: Transformers for 3D medical image segmentation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Jan. 2022, pp. 574–584. [Google Scholar]
  • [23].Wang W., Chen C., Ding M., Yu H., Zha S., and Li J., “TransBTS: Multimodal brain tumor segmentation using transformer,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Strasbourg, France: Springer, 2021, pp. 109–119. [Google Scholar]
  • [24].Xie Y., Zhang J., Shen C., and Xia Y., “CoTr: Efficiently bridging CNN and transformer for 3D medical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Strasbourg, France: Springer, 2021, pp. 171–180. [Google Scholar]
  • [25].Ba J., Mnih V., and Kavukcuoglu K., “Multiple object recognition with visual attention,” 2014, arXiv:1412.7755. [Google Scholar]
  • [26].Mnih V.et al. , “Recurrent models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014, pp. 1–9. [Google Scholar]
  • [27].Elsayed G., Kornblith S., and Le Q. V., “Saccader: Improving accuracy of hard attention models for vision,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–13. [Google Scholar]
  • [28].Xu K.et al. , “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057. [Google Scholar]
  • [29].Dai J.et al. , “Deformable convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 764–773. [Google Scholar]
  • [30].Zhu X., Hu H., Lin S., and Dai J., “Deformable ConvNets v2: More deformable, better results,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 9308–9316. [Google Scholar]
  • [31].Zhu X., Su W., Lu L., Li B., Wang X., and Dai J., “Deformable DETR: Deformable transformers for end-to-end object detection,” 2020, arXiv:2010.04159. [Google Scholar]
  • [32].Yue X.et al. , “Vision transformer with progressive sampling,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2021, pp. 387–396. [Google Scholar]
  • [33].He K., Zhang X., Ren S., and Sun J., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778. [Google Scholar]
  • [34].Cao H.et al. , “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021. [Google Scholar]
  • [35].Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L., “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255. [Google Scholar]
  • [36].Robbins H. and Monro S., “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, 1951. [Google Scholar]

Articles from IEEE Journal of Translational Engineering in Health and Medicine are provided here courtesy of Institute of Electrical and Electronics Engineers

RESOURCES