CMNet: deep learning model for colon polyp segmentation based on dual-branch structure

Xuguang Cao; Kefeng Fan; Cun Xu; Huilin Ma; Kaijie Jiao

doi:10.1117/1.JMI.11.2.024004

. 2024 Mar 23;11(2):024004. doi: 10.1117/1.JMI.11.2.024004

CMNet: deep learning model for colon polyp segmentation based on dual-branch structure

Xuguang Cao ^a,^*,^†, Kefeng Fan ^a,^b, Cun Xu ^a,^†, Huilin Ma ^c, Kaijie Jiao ^a

PMCID: PMC10960180 PMID: 38525292

Abstract.

Purpose

Colon cancer is one of the top three diseases in gastrointestinal cancers, and colon polyps are an important trigger of colon cancer. Early diagnosis and removal of colon polyps can avoid the incidence of colon cancer. Currently, colon polyp removal surgery is mainly based on artificial-intelligence (AI) colonoscopy, supplemented by deep-learning technology to help doctors remove colon polyps. With the development of deep learning, the use of advanced AI technology to assist in medical diagnosis has become mainstream and can maximize the doctor’s diagnostic time and help doctors to better formulate medical plans.

Approach

We propose a deep-learning model for segmenting colon polyps. The model adopts a dual-branch structure, combines a convolutional neural network (CNN) with a transformer, and replaces ordinary convolution with deeply separable convolution based on ResNet; a stripe pooling module is introduced to obtain more effective information. The aggregated attention module (AAM) is proposed for high-dimensional semantic information, which effectively combines two different structures for the high-dimensional information fusion problem. Deep supervision and multi-scale training are added in the model training process to enhance the learning effect and generalization performance of the model.

Results

The experimental results show that the proposed dual-branch structure is significantly better than the single-branch structure, and the model using the AAM has a significant performance improvement over the model not using the AAM. Our model leads 1.1% and 1.5% in mIoU and mDice, respectively, when compared with state-of-the-art models in a fivefold cross-validation on the Kvasir-SEG dataset.

Conclusions

We propose and validate a deep learning model for segmenting colon polyps, using a dual-branch network structure. Our results demonstrate the feasibility of complementing traditional CNNs and transformer with each other. And we verified the feasibility of fusing different structures on high-dimensional semantics and successfully retained the high-dimensional information of different structures effectively.

Keywords: medical image analysis, deep learning, neural networks, polyp segmentation

1. Introduction

With the development of society, people’s diet and living habits have progressed, but the health problems that come along with them are becoming more and more serious. Poor diet and lifestyle habits may lead to colon cancer, with colon polyps being precursors of colon cancer. The identification of colon polyps and their surgical removal is an effective means of preventing colon cancer. Modern medicine uses deep learning technology to assist doctors in identifying and segmenting colon polyps, which improves the efficiency of doctors and increases the probability of colon polyps being detected.

In the field of semantic segmentation of colon polyps, the commonly used models are FCN,¹ SegNet,² UNet³ and variants of UNet, such as UNet++,⁴ ResUNet,⁵ etc. These models use the encoder-decoder structure based on convolutional neural network (CNN) to learn the input image pixel by pixel. In the subsequent development, ResNet⁶ and PSPNet⁷ models with feature pyramid structures⁸ have been proposed; these utilize images from different feature layers of the model for supervised learning, making the choice of model outputs free to cope with more complex tasks. However, the local feature extraction of CNN makes this kind of structure unable to learn the global connection well.

With the proposal of the transformer structure, researchers found that the transformer structure can realize global attention through the attention mechanism.⁹ A number of models based on the Transformer structure, such as ViT,¹⁰ Pyramid vision transformer (PVT),¹¹^,¹² SwinTransformer,¹³^,¹⁴ etc., have been proposed. Except for PVT, all of these models need to go through the feature extraction of CNN first to obtain the feature image and then input it into the network for learning, which cannot completely get rid of the limitations brought by CNN, whereas PVT directly takes the original image as the input of the network, learns the original image directly, and takes the Transformer structure independently as the backbone of the model. Although the above Transformer structure can realize global attention, it also lacks the learning of local details (Fig. 1).

Fig. 1 — Overall network structure of CMNet. The top is the SPResNet branch, the bottom is the PVT branch, and the center is the upsampling branch.

To solve the problems of the single structure mentioned above, this paper proposes a new colon polyp segmentation model, CMNet, that adopts a dual-branch structure, using SPResNet ¹⁵ and PVT to form an internal network, respectively, to take into account the detail extraction and global feature correlation and to combine the advantages of the different structures to improve the actual performance of the whole network. During the training process, the fused images of each layer are deeply supervised¹⁶ to assist the training and guide the model learning. The aggregate attention module (AAM) is designed in the top layer of the model to realize the direct correlation of the two structures while performing the attention computation; it can better utilize the high-dimensional semantic information, improve the accuracy of the pixel information of the model in the up-sampling stage, and enhance the model’s learning ability.

In summary, the dual-branch colon polyp segmentation network proposed in this paper has the following innovations.

1.
A new network structure is formed by combining the traditional CNN and the emerging Transformer structure.
2.
SPResNet is internally improved by replacing ordinary convolution with depth separable convolution, which speeds up the training speed of the network, and designing the embedding module when dealing with layer-by-layer linking, so it can better retain the information of this layer while fusing the information of the upper layer.
3.
The AAM is designed, and the attention computation operation carried out in the module enables the top layer information of the two pathways to be correlated, so the high-dimensional information of the two different pathways can better assist the up-sampling operation of the feature image.
4.
The use of multi-scale training can strengthen the learning ability of the network, adapting to different sizes of images and recognizing different sizes of disease regions, which plays a reinforcing role for the model.
5.
The use of deep supervision in the training process of the model can effectively guide the learning trend of the model, so the model converges to stability faster and ensures that each layer converges to the real image.

2. Methods

2.1. SPResNet

The SPResNet structure that we adopt is divided into 4 layers from bottom to top, and the feature map is reduced by 1/2 between each layer, ResNet50 is used as the backbone network of SPResNet for feature extraction, depth-separated convolution is used instead of ordinary convolutional layers in each residual block to speed up the computation, and strip pooling is introduced at the end of the module to obtain more image information. The traditional global average pooling and maximum-minimum pooling operations can only rely on the local window to obtain the image information within the window, and although the information within the window can be extracted, the relationship with the neighboring pixels is not considered. The operation logic of strip pooling module is similar to self-attention⁹ and non-local;¹⁷ the difference of stripe pooling is to replace the original $N \times N$ region with a stripe pooling region and then to replace the pooling kernel with $N \times 1$ or $1 \times N$ . This can effectively allow the backbone to capture long-distance dependencies, and at the same time, due to the narrow and long pooling kernel, it is conducive to capturing contextual relationships and prevent the interference of irrelevant regions. The module consists of two pooling parts, vertical stripe pooling and horizontal stripe pooling, which are summed to get the long range image information as

y_{i}^{h} = \frac{1}{W} \sum_{0 < = j < = W} x_{i, j},

(1)

y_{j}^{v} = \frac{1}{H} \sum_{0 < = i < = H} x_{i, j} .

(2)

Equation (1) denotes the horizontal pooling operation, which is transformed into an average pooling operation for rows. Similarly, Eq. (2) represents the vertical pooling operation, which is transformed into an average pooling operation for columns. This approach effectively mitigates the impact of irrelevant information by performing pooling operations in different directions.

2.2. PVT

The PVT structure¹¹^,¹² also uses the feature pyramid structure as the main feature extraction structure and adopts PVTv2¹² as the backbone network, using pre-trained weights on ImageNet.¹⁸ Compared with other transformer structures that require the original image to be processed by CNN to form a feature map, PVT can directly use the original image as input and convert the image into patches required by the transformer encoder through patch embedding and position encoding. The overall structure is divided into four parts, each with two units, namely patch embedding and position encoding. Patch embedding divides the entire original image or extracted feature map into patches of the same size. Position encoding adds position encoding to each patch to represent the position of the patch in the entire image, facilitating the network to learn the relationship between patches. The position encoding uses learnable position encoding and learns the position of patches during network training through initialized learnable parameters. The encoder part of the transformer serializes the input patch sequence through the standard Transformer encoder structure and performs linear mapping. After stage I, the original image is scaled to 1/8 of its original size, and each subsequent stage is reduced by ½; it is reduced to 1/32 of the original image in the last layer.

Because the PVT network downsamples the feature map after each part, scaling the image to 1/2 of its original size, the patch embedding information and position encoding information learned in the previous part are invalid for the new image. Therefore, in each part, new patch embedding and position encoding are required to ensure block partitioning and position encoding of the new feature map. The Transformer inside PVT uses the mutil-head attention module to calculate attention for each input patch, calculate the correlation between different patches, and obtain the global feature relationship of the entire image.

2.3. CMNet

CMNet is a novel dual-branch semantic segmentation network proposed by us; it consists of SPResNet and PVT as the upper and lower feature extraction pathways, respectively. In the SPResNet branch, each layer containing multiple residual blocks can associate the information between pixels within the convolution kernel during feature extraction. The strip pooling module can increase the receptive field during pooling and achieve long-distance information acquisition, enabling the network to extract features related to local correlation. The PVT branch can extract features from a global perspective and capture the overall contour and position information of segmentation regions in the image.

We link the two network structures layer by layer and perform dimensionality reduction before each connection. We found that the information that contributes to the segmentation task in high-dimensional images is not significant, and dimensionality reduction can save computational resources. Moreover, the information received from the upstream information can compensate for the missing information after dimensionality reduction. In the training phase, we adopt deep supervision to calculate the loss function for the output of each layer to ensure that the output image of each layer is closer to the real image and to facilitate downstream information fusion, making the received information more accurate. We found that input images of different sizes have an impact on the results during training as the transformer structure will pay different attention to objects of different sizes in the images. Therefore, we use multi-scale images for training to achieve good learning effects for objects of different sizes in images.

The overall structure of CMNet consists of four layers. Except for the top layer, the output of which is upsampled after passing through the AAM module, the other layers undergo dimensionality reduction and then concatenate the output feature maps of the two networks and add them to the upstream information. Then, the input feature map is processed by the concatenate block (CB), which contains the residual block RB to fuse the original feature information with the extracted information and complement the missing information due to the feature extraction. As our training batch size is small, the CB module embeds group normalization¹⁹ and SiLU activation functions.²⁰ Group normalization is more stable and accurate than batch normalization²¹ in small batch size training. SiLU adds linear weighting on the basis of ReLU²² and increases the buffer zone near zero, making it smoother when processing negative values and retaining more input information.

In our experiments, we found that high-dimensional semantic information cannot be directly obtained, and we wanted results through the prediction output for segmentation tasks because high-dimensional feature maps contain extremely rich semantic information and the expansion of receptive fields compresses the image to a very small range, making it possible to only judge the position of pixel blocks relative to the original image. However, the fusion of high-dimensional feature maps with the information brought by low-dimensional feature maps is crucial for judging the segmentation area. Therefore, we designed the AAM, in which we fused the high-dimensional semantic information of the top layers of the two networks and used attention mechanisms⁹^,¹⁷ to obtain attention points of two structures with different characteristics. The AAM structure is shown in Fig. 2. The AAM is expressed as

Aggregate attention = Self attention (Concate (y, \hat{y})),

(3)

Self attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d^{k}}}) * V .

(4)

Here, we denote the top-layer input of SPResNet as $y$ and the top-layer input of PVT as $\hat{y}$ . The matrices corresponding to the Query, Key, and Value of the fused features are denoted as Q, K, and V, respectively.

2.4. Dataset

We used the Kvasir-SEG dataset²³ and the CVC-ClinicDB dataset²⁴ as a way to evaluate the performance of our model against the state-of-the-art model. The Kvasir-SEG²³ colon polyp dataset contains 1000 images. The CVC-ClinicDB²⁴ colon polyp dataset contains 612 images.

To better evaluate the performance metrics between the models, we set up five evaluation scenarios, which are described below:

Scenario 1: 5-fold cross-validation test was performed on the CVC-ClinicDB dataset by dividing the dataset into a training set and a test set, dividing the training data into five equal parts, using four for training and one for testing for each training session, and testing and evaluating the model using the test set after completing the fivefold cross-validation.

Scenario 2: 5-fold cross-validation test was performed on the Kvasir-SEG dataset by dividing the dataset into a training set and a test set, dividing the training data into five equal parts, using four for training and one for testing for each training session, and testing and evaluating the model using the test set after completing the fivefold cross-validation.

Scenario 3: the Kvasir-SEG dataset and the CVC-Clinic dataset are combined and called the merged dataset; 10% of the data from each dataset is kept separately, and the rest of the data is blended for training, using a fivefold cross-validation training method. Dividing the training data into five equal parts, using four for training and one for testing for each training session, and testing and evaluating the model using the test set after completing the fivefold cross-validation. This is the combination proposed by PraNet²⁵

Scenario 4: the Kvasir-SEG dataset was used as the training set and the CVC-ClinicDB dataset as the test set to test the model performance.

Scenario 5: the CVC-ClinicDB dataset was used as the training set and the Kvasir-SEG dataset as the test set to test the model performance.

Due to the small size of the colon polyp dataset, we performed stochastic preprocessing on the data, including affine transformations and color contrast enhancement, to increase the variety of the data and alleviate the overfitting problem caused by the limited data volume.

2.5. Loss

Our model is trained using a joint loss function that incorporates weighted intersection over union (WIoU) and weighted binary cross entropy.²⁶ The loss function takes both the ground truth and predicted results as inputs, with the ground truth encoded using one-hot encoding and integrated into the second dimension. To facilitate the network’s learning, a category-based mask is generated for the entire binarized image. The loss function is formulated as

\sum_{i} = L_{IoU}^{w} (y, p) + L_{BCE}^{w} (y, p) .

(5)

We use a combination of weighted intersection over union (IoU) loss $L_{IoU}^{w}$ and weighted binary cross entropy (BCE) loss $L_{BCE}^{w}$ as our objective function. The ground truth and predicted images are denoted as $y$ and $p$ , respectively. The subscript $i$ represents the number of layers in the feature pyramid. We compute the loss function for each layer’s output and aggregate it into the back-propagation loss.

2.6. Evaluations

We use consistent evaluation metrics, namely recall, precision, intersection over union (IoU), and Dice Score, for various models. These metrics are computed based on the confusion matrix, which summarizes the classification prediction results of deep learning models. The equations for evaluation indicators are derived from the confusion matrix and expressed as

Recall = \frac{TP}{TP + FN},

(6)

Precision = \frac{TP}{TP + FP},

(7)

IoU = \frac{TP}{TP + FP + FN},

(8)

Dice = \frac{2 * TP}{2 TP + FP + FN},

(9)

where the TP is true positive, FN is false negative, FP is false positive, and TN is true negative.

3. Experiments and Results

3.1. Inplementation

The CMNet model was implemented in Pytorch and accelerated using an NVIDIA RTX3090 GPU with 24 GB memory. To prevent exceeding the GPU memory capacity, we set the batch size to 4. For polyp segmentation, we resized the input images to $352 \times 352$ and employed a multi-scale training strategy with scales of 0.75, 1.0, and 1.25. We used the AdamW optimizer with an initial learning rate of $1 e^{- 4}$ .²⁷

3.2. Baseline

The following advanced models were selected for comparative testing: UNet,³ AG-CUResNeSt,²⁸ FCBFormer,²⁹^,³⁰ HarDNet-DFUS,³¹ PVT-CASCADE,³² and PraNet.²⁵

3.3. Ablation Study

To verify the enhancement brought by the deep separable convolution to the models, we tested the floating point operations (FLOPs) and the number of parameters for all of the models. In our experiments, we used input data of shape (1, 3, 382, 382) for all models. In Table 1, we can see that the CMNet model with deep separable convolution has 10% lower FLOPs and 36% lower number of parameters compared with the CMNet model with normal convolution.

Table 1.

Comparison of FLOPs and number of parameters for different models using inputs of the same shape.

Method	FLOPs (G)	Parameters (M)
UNet³	70.018	13.395
SPResNet¹⁵	39.417	33.388
PVT¹²	21.135	44.958
PraNet²⁵	15.649	30.498
PVT-CASCADE³²	18.332	35.273
HarDNet-DFUS³¹	78.044	49.855
FCBFormer²⁹	86.377	52.944
AG-CUResNeSt²⁸	153.921	293.163
CMNet (ordinary convolution)	150.948	75.27
CMNet (deeply separable convolution)	136.302	48.257

Open in a new tab

Note: bold values represent optimal values and italic values represent the optimal values compared to the same method.

In scenarios 1, 2, and 3, we examine the effect of the attention aggregation module on the model performance. The experimental data are presented in Table 2, and we performed a T-test on the data to validate the reasonableness of the comparison data for significance testing. The T-test results are shown in Table 3. Experimental comparison plots are shown in Figs. 5 and 6.

Table 2.

5-fold cross-validation of the CMNet model with and without the AAM module under scenarios 1, 2, and 3 with the validation averages for each fold.

Dataset	Method	Fold	mIoU↑	mDice↑
Kvasir-SEG	CMNet (with AAM)	1	0.923	0.954
		2	0.930	0.959
		3	0.925	0.956
		4	0.924	0.956
		5	0.926	0.955
		KFoldmean	0.926	0.956
	CMNet (without AAM)	1	0.889	0.918
		2	0.898	0.923
		3	0.885	0.922
		4	0.889	0.919
		5	0.878	0.915
		KFoldmean	0.888	0.919
CVC-ClinicDB	CMNet (with AAM)	1	0.936	0.957
		2	0.933	0.953
		3	0.924	0.949
		4	0.934	0.956
		5	0.930	0.952
		KFoldmean	0.932	0.954
	CMNet (without AAM)	1	0.899	0.918
		2	0.897	0.896
		3	0.876	0.900
		4	0.894	0.916
		5	0.895	0.917
		KFoldmean	0.892	0.909
Merge	CMNet (with AAM)	1	0.931	0.957
		2	0.933	0.958
		3	0.928	0.955
		4	0.927	0.955
		5	0.936	0.960
		KFoldmean	0.931	0.957
	CMNet (without AAM)	1	0.901	0.923
		2	0.892	0.918
		3	0.891	0.917
		4	0.897	0.921
		5	0.890	0.916
		KFoldmean	0.894	0.919

Open in a new tab

Note: bold values represent optimal.

Table 3.

Evaluation of the models with and without the AAM on the test set, and the results were analyzed for significance based on T-test.

Dataset	Method	mIoU-statistic	mIoU- $P$ -value	mDice-statistic	mDice- $P$ -value
Kvasir-SEG	CMNet (with AAM)	10.91	$4.40 \times 10^{- 6}$	22.03	$1.90 \times 10^{- 8}$
Kvasir-SEG	CMNet (without AAM)	10.91	$4.40 \times 10^{- 6}$	22.03	$1.90 \times 10^{- 8}$
CVC-ClinicDB	CMNet (with AAM)	8.45	$2.92 \times 10^{- 5}$	8.94	$1.94 \times 10^{- 5}$
CVC-ClinicDB	CMNet (without AAM)	8.45	$2.92 \times 10^{- 5}$	8.94	$1.94 \times 10^{- 5}$
Merge	CMNet (with AAM)	13.86	$7.06 \times 10^{- 7}$	23.56	$1.12 \times 10^{- 8}$
Merge	CMNet (without AAM)	13.86	$7.06 \times 10^{- 7}$	23.56	$1.12 \times 10^{- 8}$

Open in a new tab

Fig. 5 — Experimental results of UNet, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, CMNet (withoutAAM), and CMNet under scenario 1.

Fig. 6 — Experimental results of UNet, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, CMNet (withoutAAM), and CMNet under scenario 2.

Tables 2 and 3 show the results of our significance experiments on the aggregate attention module. From the experimental data in the tables, we can see that the model using the AAM improves the mIoU and mDice tests of the Kvasir-SEG dataset by 4.1% and 3.8%, respectively. It improves the mIoU and mDice tests of the CVC-ClinicDB dataset by 4.3% and 4.7%, respectively. In the merged dataset test, both the mIoU and mDice tests improved by 3.97%. In the T-test, we used the results of the reviews of the two models on the test set as the source of the t-test data. The statistical values of both metrics are greater than 0, proving that the model with AAM outperforms the model without AAM on the test mean. The p-values of both metrics are less than 0.1, proving that the performance of the model with AAM outperforms the performance of the model without AAM.

To demonstrate the performance improvement of dual-branch networks compared with single-branch networks, we compare mIoU and Dice when using SPResNet and PVT, respectively, under scenario 1, 2, and 3. To ensure the fairness of the experiments, the same output head is used for SPResNet and PVT in the experiments.

We used the same training method and number of iterations for the model when performing the ablation test, and we can clearly see the limitations and shortcomings of using a single network from the comparison results in Figs. 3 and 4. In Tables 4–6, our model shows significant improvement over SPResNet and PVT in the tests on the Kvasir-SEG dataset, the CVC-ClinicDB dataset, and the merge dataset, respectively, and our model not only is able to learn the local details but also achieves an effective learning of the global contour.

Fig. 3 — Experimental results of SPResNet, PVT, and CMNet under scenario 1.

Fig. 4 — Experimental results of SPResNet, PVT, and CMNet under scenario 2.

Table 4.

Performance metrics of UNet, SPResNet, PVT, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, and CMNet in scenario 1.

Method	mIoU↑	mDice↑	mRecall↑	mPrecision↑
UNet³	0.712 ± 0.027	0.797 ± 0.022	0.813 ± 0.048	0.854 ± 0.028
SPResNet¹⁵	0.794 ± 0.021	0.860 ± 0.017	0.890 ± 0.018	0.881 ± 0.021
PVT¹²	0.841 ± 0.024	0.897 ± 0.012	0.915 ± 0.017	0.911 ± 0.010
PraNet²⁵	0.899 ± 0.008	0.933 ± 0.008	0.945 ± 0.009	0.941 ± 0.009
PVT-CASCADE³²	0.905 ± 0.007	0.928 ± 0.009	0.943 ± 0.005	0.940 ± 0.007
HarDNet-DFUS³¹	0.889 ± 0.005	0.927 ± 0.006	0.933 ± 0.012	0.937 ± 0.010
FCBFormer²⁹	0.921 ± 0.005	0.937 ± 0.007	0.948 ± 0.003	0.945 ± 0.009
AG-CUResNeSt²⁸	0.881 ± 0.007	0.925 ± 0.005	0.938 ± 0.004	0.923 ± 0.006
CMNet (ours)	0.932 ± 0.008	0.954 ± 0.005	0.967 ± 0.012	0.959 ± 0.004

Open in a new tab

Note: bold values represent optimal values and italic values represent the suboptimal values.

Table 5.

Performance metrics of UNet, SPResNet, PVT, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, and CMNet in scenario 2.

Method	mIoU↑	mDice↑	mRecall↑	mPrecision↑
UNet³	0.752 ± 0.015	0.834 ± 0.023	0.848 ± 0.012	0.873 ± 0.022
SPResNet¹⁵	0.761 ± 0.014	0.835 ± 0.012	0.885 ± 0.011	0.874 ± 0.01
PVT¹²	0.813 ± 0.029	0.875 ± 0.036	0.890 ± 0.019	0.900 ± 0.024
PraNet²⁵	0.832 ± 0.012	0.895 ± 0.015	0.900 ± 0.02	0.907 ± 0.013
PVT-CASCADE³²	0.870 ± 0.01	0.924 ± 0.009	0.943 ± 0.005	0.932 ± 0.008
HarDNet-DFUS³¹	0.822 ± 0.02	0.897 ± 0.012	0.907 ± 0.011	0.912 ± 0.015
FCBFormer²⁹	0.913 ± 0.006	0.941 ± 0.004	0.949 ± 0.005	0.959 ± 0.007
AG-CUResNeSt²⁸	0.852 ± 0.005	0.910 ± 0.004	0.907 ± 0.006	0.929 ± 0.005
CMNet(Ours)	0.926 ± 0.004	0.956 ± 0.003	0.962 ± 0.004	0.960 ± 0.004

Open in a new tab

Note: bold values represent optimal values and italic values represent the suboptimal values.

Table 6.

Performance metrics of UNet, SPResNet, PVT, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, and CMNet in scenario 3.

Method	Kvasir-SEG		CVC-ClinicDB
Method	mIoU↑	mDice↑	mIoU↑	mDice↑
UNet³	0.751	0.835	0.826	0.876
SPResNet¹⁵	0.742	0.821	0.778	0.841
PVT¹²	0.824	0.882	0.872	0.918
PraNet²⁵	0.841	0.892	0.859	0.909
PVT-CASCADE³²	0.871	0.918	0.890	0.919
HarDNet-DFUS³¹	0.833	0.897	0.859	0.911
FCBFormer²⁹	0.894	0.923	0.911	0.936
AG-CUResNeSt²⁸	0.842	0.899	0.866	0.909
CMNet (ours)	0.897	0.927	0.928	0.954

Open in a new tab

Note: bold values represent optimal values and italic values represent the suboptimal values.

In the above experiments, we verified that the dual-branch model outperforms the single-branch model, and we verified that the AAM module improves the performance of the model by making a significance test. To verify the state-of-the-art of our proposed model, we compare the model with the following state-of-the-art models: UNet, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, and AG-CUResNeSt.

We tested all of the models under scenario 1. As can be seen from the results in Fig. 5, our model has high accuracy in segmenting continuous and regular colon polyps, which can correctly identify the segmentation region. As can be seen from the second row of the picture, our model has excellent results in recognizing small targets. Compared with the second-ranked FCBFormer, our model outperforms it by 1.1% and 1.7% in mIoU and mDice, respectively, as shown in Table 4.

In the test of scenario 2 of Fig. 6, we chose the case of recognizing multiple small targets, with the fragmented lesion regions having high requirements on the segmentation performance of the model. In the first row, our proposed model does not recognize all colon polyps, but it also does not additionally recognize redundant regions; in contrast to other models, our model restricts the recognition area to the lesion regions. Second, for lesion regions that are close together, existing models tend to recognize multiple targets as one target, as shown in the third line, but our model can segment them well and accurately, for the lesion regions that are close together, the existing models tend to recognize multiple targets as one target, as shown in the third row, but our model is able to segment them well and accurately. Our model improves 1.1% and 1.5% on mIoU and mDice, respectively, as shown in Table 5, compared with the second-ranked FCBFormer.

Considering the model’s adaptability to different scenarios, we tested our model with the other model in scenarios 4 and 5. From Fig. 7, we can see that our model still maintains good performance in complex scenarios with good segmentation for both small and large targets, which proves that our model is well adapted to different situations (Table 7).

Table 7.

Performance metrics of UNet, PraNet, PVT-CASCADE, HarDNet-DFUS, FCBFormer, AG-CUResNeSt, and CMNet in scenarios 4 and 5.

Dataset	Model	mIoU↑	mDice↑
Train on Kvasir-SEG	UNet³	0.6676	0.7366
	PraNet²⁵	0.7369	0.7890
	HarDNet-DFUS³¹	0.7616	0.8450
	PVT-CASCADE³²	0.7765	0.8589
Test on CVC-ClinicDB	AG-CUResNeSt²⁸	0.7654	0.8533
	FCBFormer²⁹	0.8038	0.8735
	CMNet (ours)	0.8071	0.9060
Train on CVC-ClinicDB	UNet³	0.5446	0.6616
	PraNet²⁵	0.7073	0.7923
	HarDNet-DFUS³¹	0.8147	0.8979
	PVT-CASCADE³²	0.7884	0.8652
Test on Kvasir-SEG	AG-CUResNeSt²⁸	0.8072	0.8833
	FCBFormer²⁹	0.8214	0.8848
	CMNet (ours)	0.8385	0.9190

Open in a new tab

Note: bold values represent optimal values and italic values represent the suboptimal values.

Although our model achieves good results when processing images in the vast majority of datasets, as shown in Fig. 8, our model still produces poor segmentation results when dealing with large-scale sequential irregular targets and complex folds, which is a common problem with current colon polyp segmentation models, even among junior colonoscopists.

4. Conclusion

The research on colon polyp segmentation using deep learning techniques has been highly enthusiastic, and various effective solutions have been proposed. Based on the shortcomings of the existing techniques, we proposed our solution to the problem of colon polyp segmentation, aiming to promote the further development of this problem. We propose a new structure, CMNet, that realized the purpose of our original design through structural innovation, i.e., to retain the advantages of CNNs and transformers and minimize their shortcomings. We compare CMNet with the state-of-the-art models and experimentally demonstrate that CMNet achieves a state-of-the-art performance in multiple test environments, proving that our proposed model and idea are correct and effective.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 62271456) and the National Key Science and Technology Program 2030 (Grant No. 2021ZD0110600).

Biographies

Xuguang Cao is a postgraduate student at the School of Electronic Engineering and Automation, Guilin University of Electronic Technology. His current research interests include image processing, 3D reconstruction, and artificial intelligence.

Kefeng Fan received his PhD from the Xidian University, Xi’an, China. He is currently director at the Information Technology Research Center, China Electronics Standardization Institute. He has authored more than 130 papers. His current research interests include image processing and 3D reconstruction.

Cun Xu is a postgraduate student at the School of Electronic Engineering and Automation, Guilin University of Electronic Technology. His current research interests include image processing and artificial intelligence.

Huilin Ma is a postgraduate student at the college of electronic information, Guangxi Minzu University. Her current research interests include wireless charging and artificial intelligence.

Kaijie Jiao is a postgraduate student at the School of Electronic Engineering and Automation, Guilin University of Electronic Technology. His current research interests include image processing and machine learning.

Contributor Information

Xuguang Cao, Email: 18174708539@163.com.

Kefeng Fan, Email: fankf@126.com.

Cun Xu, Email: 18899599339@163.com.

Huilin Ma, Email: 18176462476@163.com.

Kaijie Jiao, Email: jiaokj98@163.com.

Disclosures

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Code and Data Availability

The code is available at https://github.com/MaybeMarvel/CMNet.

References

1.Long J., Shelhamer E., Darrell T., “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2014). [DOI] [PubMed] [Google Scholar]
2.Badrinarayanan V., Kendall A., Cipolla R., “Segnet: a deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2015). 10.1109/TPAMI.2016.2644615 [DOI] [PubMed] [Google Scholar]
3.Ronneberger O., Fischer P., Brox T., “U-net: convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci. 9351, 234–241 (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
4.Zhou Z., et al. , “UNet++: a nested u-net architecture for medical image segmentation,” Lect. Notes Comput. Sci. 11045, 3–11 (2018). 10.1007/978-3-030-00889-5_1 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Diakogiannis F. I., et al. , “Resunet-a: a deep learning framework for semantic segmentation of remotely sensed data,” ISPRS J. Photogramm. Remote Sens. 162, 94–114 (2020). 10.1016/j.isprsjprs.2020.01.013 [DOI] [Google Scholar]
6.He K., et al. , “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2015). 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
7.Zhao H., et al. , “Pyramid scene parsing network,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2016). 10.1109/CVPR.2017.660 [DOI] [Google Scholar]
8.Lin T.-Y., et al. , “Feature pyramid networks for object detection,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2016). 10.1109/CVPR.2017.106 [DOI] [Google Scholar]
9.Vaswani A., et al. , “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (2017). [Google Scholar]
10.Dosovitskiy A., et al. , “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv:2010.11929 (2020).
11.Wang W., et al. , “Pyramid vision transformer: a versatile backbone for dense prediction without convolutions,” in IEEE/CVF Int. Conf. Comput. Vision (ICCV) (2021). 10.1109/ICCV48922.2021.00061 [DOI] [Google Scholar]
12.Wang W., et al. , “Pvt v2: improved baselines with pyramid vision transformer,” Comp. Vis. Media 8, 415–424 (2021). 10.1007/s41095-022-0274-8 [DOI] [Google Scholar]
13.Liu Z., et al. , “Swin transformer v2: scaling up capacity and resolution,” in IEEE/CVF Conf. Comput. Vision and Pattern Recognit. (CVPR) (2021). 10.1109/CVPR52688.2022.01170 [DOI] [Google Scholar]
14.Liu Z., et al. , “Swin transformer: hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Compu. Vision, pp. 10012–10022 (2021). [Google Scholar]
15.Hou Q., et al. , “Strip pooling: rethinking spatial pooling for scene parsing,” in IEEE/CVF Conf. Comput. Vision and Pattern Recognit. (CVPR) (2020). 10.1109/CVPR42600.2020.00406 [DOI] [Google Scholar]
16.Lee C. Y., et al. , “Deeply-supervised nets,” in Artif. Intell. Statistics, pp. 562–570, PMLR; (2015). [Google Scholar]
17.Wang X., et al. , “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., pp. 7794–7803 (2018). [Google Scholar]
18.Deng J., et al. , “Imagenet: a large-scale hierarchical image database,” in IEEE Conf. Comput. Vision and Pattern Recognit., IEEE, pp. 248–255 (2009). 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]
19.Wu Y., He K., “Group normalization,” in Proc. Eur. Conf. Comput. Vision (ECCV), pp. 3–19 (2018). [Google Scholar]
20.Elfwing S., Uchibe E., Doya K., “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks 107, 3–11 (2018). 10.1016/j.neunet.2017.12.012 [DOI] [PubMed] [Google Scholar]
21.Ioffe S., Szegedy C., “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Mach. Learn. (2015). [Google Scholar]
22.Glorot X., Bordes A., Bengio Y., “Deep sparse rectifier neural networks,” in Proc. Fourteenth Int. Conf. Artif. Intell. Statistics, JMLR Workshop and Conference Proceedings, pp. 315–323 (2011). [Google Scholar]
23.Jha D., et al. , “Kvasir-seg: a segmented polyp dataset,” Lect. Notes Comput. Sci. 11962, 451–462 (2020). 10.1007/978-3-030-37734-2_37 [DOI] [Google Scholar]
24.Bernal J., et al. , “Wm-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians,” Comput. Med. Imaging Graphics 43, 99–111 (2015). 10.1016/j.compmedimag.2015.02.007 [DOI] [PubMed] [Google Scholar]
25.Fan D.-P., et al. , “Pranet: parallel reverse attention network for polyp segmentation,” in Int. Conf. Med. Image Comput. Computer-Assisted Intervention, Springer, pp. 263–273 (2020). [Google Scholar]
26.Wei J., Wang S., Huang Q., “ $F^{3} net$ : fusion, feedback and focus for salient object detection,” Proc. AAAI Conf. Artif. Intell. 34, 12321–12328 (2020). 10.1609/aaai.v34i07.6916 [DOI] [Google Scholar]
27.Loshchilov I., Hutter F., “Decoupled weight decay regularization,” arXiv:1711.05101 (2017).
28.Sang D. V., et al. , “AG-CUResNest: a novel method for colon polyp segmentation,” arXiv:2105.00402 (2021).
29.Sanderson E., Matuszewski B. J., “FCN-transformer feature fusion for polyp segmentation,” Lect. Notes Comput. Sci. 13413, 892–907 (2022). 10.1007/978-3-031-12053-4_65 [DOI] [Google Scholar]
30.Fitzgerald K., Matuszewski B., “FCB-SwinV2 transformer for polyp segmentation,” arXiv:2302.01027 (2023).
31.Liao T.-Y., et al. , “Hardnet-dfus: An enhanced harmonically-connected network for diabetic foot ulcer image segmentation and colonoscopy polyp segmentation,” arXiv:2209.07313 (2022).
32.Rahman M., Marculescu R., “Medical image segmentation via cascaded attention decoding,” in IEEE/CVF Winter Conf. Appl. Comput. Vision (WACV) (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code is available at https://github.com/MaybeMarvel/CMNet.

[r1] 1.Long J., Shelhamer E., Darrell T., “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2014). [DOI] [PubMed] [Google Scholar]

[r2] 2.Badrinarayanan V., Kendall A., Cipolla R., “Segnet: a deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2015). 10.1109/TPAMI.2016.2644615 [DOI] [PubMed] [Google Scholar]

[r3] 3.Ronneberger O., Fischer P., Brox T., “U-net: convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci. 9351, 234–241 (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]

[r4] 4.Zhou Z., et al. , “UNet++: a nested u-net architecture for medical image segmentation,” Lect. Notes Comput. Sci. 11045, 3–11 (2018). 10.1007/978-3-030-00889-5_1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Diakogiannis F. I., et al. , “Resunet-a: a deep learning framework for semantic segmentation of remotely sensed data,” ISPRS J. Photogramm. Remote Sens. 162, 94–114 (2020). 10.1016/j.isprsjprs.2020.01.013 [DOI] [Google Scholar]

[r6] 6.He K., et al. , “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2015). 10.1109/CVPR.2016.90 [DOI] [Google Scholar]

[r7] 7.Zhao H., et al. , “Pyramid scene parsing network,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2016). 10.1109/CVPR.2017.660 [DOI] [Google Scholar]

[r8] 8.Lin T.-Y., et al. , “Feature pyramid networks for object detection,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR) (2016). 10.1109/CVPR.2017.106 [DOI] [Google Scholar]

[r9] 9.Vaswani A., et al. , “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (2017). [Google Scholar]

[r10] 10.Dosovitskiy A., et al. , “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv:2010.11929 (2020).

[r11] 11.Wang W., et al. , “Pyramid vision transformer: a versatile backbone for dense prediction without convolutions,” in IEEE/CVF Int. Conf. Comput. Vision (ICCV) (2021). 10.1109/ICCV48922.2021.00061 [DOI] [Google Scholar]

[r12] 12.Wang W., et al. , “Pvt v2: improved baselines with pyramid vision transformer,” Comp. Vis. Media 8, 415–424 (2021). 10.1007/s41095-022-0274-8 [DOI] [Google Scholar]

[r13] 13.Liu Z., et al. , “Swin transformer v2: scaling up capacity and resolution,” in IEEE/CVF Conf. Comput. Vision and Pattern Recognit. (CVPR) (2021). 10.1109/CVPR52688.2022.01170 [DOI] [Google Scholar]

[r14] 14.Liu Z., et al. , “Swin transformer: hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Compu. Vision, pp. 10012–10022 (2021). [Google Scholar]

[r15] 15.Hou Q., et al. , “Strip pooling: rethinking spatial pooling for scene parsing,” in IEEE/CVF Conf. Comput. Vision and Pattern Recognit. (CVPR) (2020). 10.1109/CVPR42600.2020.00406 [DOI] [Google Scholar]

[r16] 16.Lee C. Y., et al. , “Deeply-supervised nets,” in Artif. Intell. Statistics, pp. 562–570, PMLR; (2015). [Google Scholar]

[r17] 17.Wang X., et al. , “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., pp. 7794–7803 (2018). [Google Scholar]

[r18] 18.Deng J., et al. , “Imagenet: a large-scale hierarchical image database,” in IEEE Conf. Comput. Vision and Pattern Recognit., IEEE, pp. 248–255 (2009). 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]

[r19] 19.Wu Y., He K., “Group normalization,” in Proc. Eur. Conf. Comput. Vision (ECCV), pp. 3–19 (2018). [Google Scholar]

[r20] 20.Elfwing S., Uchibe E., Doya K., “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks 107, 3–11 (2018). 10.1016/j.neunet.2017.12.012 [DOI] [PubMed] [Google Scholar]

[r21] 21.Ioffe S., Szegedy C., “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Mach. Learn. (2015). [Google Scholar]

[r22] 22.Glorot X., Bordes A., Bengio Y., “Deep sparse rectifier neural networks,” in Proc. Fourteenth Int. Conf. Artif. Intell. Statistics, JMLR Workshop and Conference Proceedings, pp. 315–323 (2011). [Google Scholar]

[r23] 23.Jha D., et al. , “Kvasir-seg: a segmented polyp dataset,” Lect. Notes Comput. Sci. 11962, 451–462 (2020). 10.1007/978-3-030-37734-2_37 [DOI] [Google Scholar]

[r24] 24.Bernal J., et al. , “Wm-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians,” Comput. Med. Imaging Graphics 43, 99–111 (2015). 10.1016/j.compmedimag.2015.02.007 [DOI] [PubMed] [Google Scholar]

[r25] 25.Fan D.-P., et al. , “Pranet: parallel reverse attention network for polyp segmentation,” in Int. Conf. Med. Image Comput. Computer-Assisted Intervention, Springer, pp. 263–273 (2020). [Google Scholar]

[r26] 26.Wei J., Wang S., Huang Q., “ $F^{3} net$ : fusion, feedback and focus for salient object detection,” Proc. AAAI Conf. Artif. Intell. 34, 12321–12328 (2020). 10.1609/aaai.v34i07.6916 [DOI] [Google Scholar]

[r27] 27.Loshchilov I., Hutter F., “Decoupled weight decay regularization,” arXiv:1711.05101 (2017).

[r28] 28.Sang D. V., et al. , “AG-CUResNest: a novel method for colon polyp segmentation,” arXiv:2105.00402 (2021).

[r29] 29.Sanderson E., Matuszewski B. J., “FCN-transformer feature fusion for polyp segmentation,” Lect. Notes Comput. Sci. 13413, 892–907 (2022). 10.1007/978-3-031-12053-4_65 [DOI] [Google Scholar]

[r30] 30.Fitzgerald K., Matuszewski B., “FCB-SwinV2 transformer for polyp segmentation,” arXiv:2302.01027 (2023).

[r31] 31.Liao T.-Y., et al. , “Hardnet-dfus: An enhanced harmonically-connected network for diabetic foot ulcer image segmentation and colonoscopy polyp segmentation,” arXiv:2209.07313 (2022).

[r32] 32.Rahman M., Marculescu R., “Medical image segmentation via cascaded attention decoding,” in IEEE/CVF Winter Conf. Appl. Comput. Vision (WACV) (2023). [Google Scholar]

PERMALINK

CMNet: deep learning model for colon polyp segmentation based on dual-branch structure

Xuguang Cao

Kefeng Fan

Cun Xu

Huilin Ma

Kaijie Jiao

Abstract.

Purpose

Approach

Results

Conclusions

1. Introduction

Fig. 1.

2. Methods

2.1. SPResNet

2.2. PVT

2.3. CMNet

Fig. 2.

2.4. Dataset

2.5. Loss

2.6. Evaluations

3. Experiments and Results

3.1. Inplementation

3.2. Baseline

3.3. Ablation Study

Table 1.

Table 2.

Table 3.

Fig. 5.

Fig. 6.

Fig. 3.

Fig. 4.

Table 4.

Table 5.

Table 6.

Fig. 7.

Table 7.

Fig. 8.

4. Conclusion

Acknowledgments

Biographies

Contributor Information

Disclosures

Code and Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases