Abstract
Endoscopic image segmentation plays a key role in assisting doctors to accurately locate focal areas and improve diagnostic efficiency. However, the existing methods are insufficient in utilizing local details and global semantic information at the same time, which makes it difficult to effectively segment organs and tissues with complex morphology, fuzzy boundaries and similar textures. Therefore, we propose a progressive fusion network (PFNet) in this paper. First, PFNet uses Pvtv2 with Transformer as the backbone encoder to capture multi-scale global features. Secondly, a noise filtering attention module (NFAM) is designed to suppress the noise and enhance the semantics of the multilevel features output by the encoder. Then, a boundary and location awareness module (BLAM) is proposed to generate high-quality boundary and position information by blending deep global features with shallow local details. Then, the auxiliary information embedding module (AIEM) is designed to embed the boundary and position information into each level feature dynamically to enhance the context-aware ability of the decoding process. Finally, the feature fusion module (FFM) supplemented the boundary and location information through layer by layer iteration to ensure the collaborative recovery of global semantics and local details. Through extensive experiments, we demonstrate that our proposed PFNet outperforms current state-of-the-art (SOTA) methods in segmentation performance on datasets including Ureter, Re-TMRS, Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS and CVC-300. In particular, the mDice on Re-TMRS dataset reached 91.07%, and the mDice on CVC-ClinicDB reached 93.09%.
Subject terms: Engineering, Mathematics and computing
Introduction
Medical image segmentation (MIS), as a critical technique within medical image analysis, leverages computer vision to precisely delineate specific regions within medical images. This technology is indispensable for clinical diagnostics by enabling accurate identification of anatomical structures including tissues, pathological regions, and organs, thereby facilitating disease assessment and therapeutic strategy development1. As a prevalent category in medical imaging, endoscopic visualization provides direct visualization of pathological alterations in anatomical structures. As a standard diagnostic tool, endoscopic systems have been comprehensively integrated into routine practice within general surgery. Furthermore, automated endoscopic image segmentation facilitates precise intraoperative identification of both surgical instruments and pathological regions, thereby enhancing procedural efficacy critical for optimizing clinical outcomes2.
The U-Net architecture3, leveraging convolutional neural networks (CNN), has emerged as the de facto benchmark framework through its symmetrical encoder-decoder topology. This configuration employs cascaded convolutional layers in both pathways, with lateral skip connections enabling multi-scale semantic fusion across shallow and deep feature hierarchies. The inherent architectural extensibility of U-Net has spawned numerous evolutionary variants including ResUNet4, ResUNet++5, UNet++6, and Attention-UNet7, each progressively optimizing baseline performance metrics.While convolutional operations excel in localized feature extraction through kernel-based spatial filtering, their fundamental limitation resides in constrained receptive fields – a critical impediment to modeling long-range contextual dependencies and integrating holistic image semantics. Although sequential downsampling operations theoretically expand receptive coverage to capture broader spatial patterns, this paradigm introduces dual challenges: progressive erosion of fine-grained anatomical details and escalated computational redundancy through feature map dimensionality reduction. Particularly in endoscopic image segmentation scenarios, where clinical workflows demand high-resolution visualization (typically
1280
1024 pixels), conventional CNN-based architectures frequently suffer segmentation fidelity degradation when critical morphological features are compromised during aggressive downsampling.
The emergence of Vision Transformer (ViT)8 enables Transformer9 to be well applied in computer vision. Subsequent architectural innovations include Swin-Transformer10, which implements hierarchical feature learning through shifted window partitioning, achieving computational efficiency while preserving multi-scale representational capacity. In medical imaging domains, TransUNet11 strategically concatenates CNN-derived low-level features with Transformer modules to synergize local texture patterns with global semantic context. UCTransNet12 introduces Channel Transformer (CTrans) modules as an advanced alternative to conventional skip connections, effectively bridging semantic disparities between encoder-decoder feature hierarchies. DS-TransUnet13 further extends this paradigm through dual-scale Swin-Transformer encoders that perform multi-resolution feature fusion prior to decoder integration. However, a critical limitation persists in these Transformer-based approaches: the self-attention mechanism, while powerful for capturing long-range dependencies, exhibits an intrinsic deficiency in modeling fine-grained local details and pixel-wise spatial correlations.This limitation is particularly detrimental in medical image segmentation tasks, such as endoscopic analysis, where precise delineation of subtle tissue boundaries, lesion margins, and anatomical structures at the micron level is paramount. The inability to effectively capture these intricate local features frequently manifests as boundary ambiguity and inaccurate localization in segmentation outputs, directly impacting diagnostic reliability and therapeutic planning. Consequently, bridging this gap in local modeling capability while leveraging Transformer’s strength in global context understanding represents a fundamental challenge and a core objective in advancing MIS for endoscopic imagery.
To address this critical challenge of Transformer’s local detail modeling deficiency in endoscopic MIS, we introduce a novel Progressive Fusion Network (PFNet). The central guiding principle of PFNet is the progressive and effective fusion of multi-scale features, specifically designed to overcome the local modeling gap inherent in standard Transformer architectures. Our core hypothesis is that explicitly and gradually integrating high-resolution local features (rich in boundary and positional information) with deep, semantically strong global features derived from the Transformer encoder can synergistically compensate for Transformer’s weakness in fine-grained perception.
This study introduces a novel progressive fusion network (PFNet) designed for medical endoscopic image segmentation. Building upon this core principle of progressive fusion to overcome Transformer’s local modeling limitation, our framework systematically integrates multi-scale contextual features to enhance both structural completeness and boundary precision in clinical imaging analysis. The primary innovations of this research are fourfold:
This work introduces a progressive fusion network (PFNet) for endoscopic medical image segmentation. Driven by the core objective of mitigating Transformer’s local detail deficiency, PFNet leverages a Pvtv214-based Transformer encoder for global feature extraction. Adopting a progressive fusion strategy, the network generates high-quality boundary and position features, which are then gradually embedded into each decoder output layer and subsequent fusion processes. This approach yields endoscopic image segmentation results characterized by accurate positioning and clear boundaries.
A noise filtering attention module (NFAM) is developed to further extract potential local detail features of organs and tissues while enhancing semantic features and filtering background noise. This module provides a cleaner foundation essential for the progressive fusion process.
Instrumental to the progressive fusion strategy are the proposed boundary and position awareness module (BLAM) and auxiliary information embedding module (AIEM). BLAM fuses deep global features with shallow local details to generate high-quality boundary and position features. AIEM then dynamically embeds this boundary and position information into each level of decoder features, enhancing context-aware ability during decoding and progressively refining the feature representation with crucial local cues.
A novel feature fusion module (FFM) is proposed, diverging from conventional approaches by abandoning multi-scale features from each layer. Instead, FFM synchronously fuses high-quality boundary and position features while recovering global features layer by layer from deep representations. This design improves segmentation accuracy, prevents semantic information loss, and embodies the progressive fusion principle by iteratively incorporating enhanced local information during feature reconstruction.
Related works
CNN in medical image segmentation
Convolutional Neural Networks (CNNs) have established themselves as foundational architectures in Medical Image Segmentation (MIS) research1,15,16. The field has witnessed progressive evolution from early Fully Convolutional Networks (FCNs)17 for universal pixel-wise segmentation – achieved through end-to-end convolutional operations replacing dense layers – to the seminal U-Net framework3. This paradigm shift introduced a symmetric encoder-decoder topology augmented with skip connections, enabling multi-scale feature map fusion during upsampling to optimize detail preservation and segmentation fidelity.
Subsequent architectural innovations have focused on mitigating CNN limitations in MIS contexts. UNet++6 addresses encoder-decoder semantic disparity through nested dense skip pathways, while Attention U-Net7 implements dynamic region-of-interest prioritization via integrated attention gates. The residual learning paradigms in Res-UNet4 preserve gradient flow through dedicated skip connections across encoder stages. Multi-task frameworks like SFANet18 employ dual decoders for concurrent region/boundary prediction, whereas SKNets19 enhance scale adaptability through selective kernel feature fusion. To handle anatomical variability, ACSNet20 deploys Adaptive Selection Modules (ASM) for context-aware integration of global semantics and local textures. Advanced approaches include PraNet’s21 two-stage coarse-to-fine segmentation with reverse attention mechanisms, HarDNet-MSEG’s22 computational optimization via pruned dense connections, and DCRNet’s23 cross-image similarity learning. Recent innovations further encompass EU-Net’s24 multi-scale non-local attention blocks with spatial cross-layer filtering, and SANet’s25 color-invariant segmentation through chromatic normalization and probabilistic refinement.
Nevertheless, persistent challenges stem from CNN architectures’ intrinsic constraints in modeling global contextual relationships and long-range spatial dependencies. These limitations frequently manifest as partial anatomical detail degradation and boundary delineation inaccuracies, ultimately compromising segmentation robustness–particularly in clinically critical scenarios demanding micron-level precision.
Vision transformer in medical image segmentation
The Transformer architecture8, initially developed for sequence modeling through self-attention mechanisms, has transcended its natural language processing (NLP) origins to reshape computer vision paradigms. By computing pairwise correlations across all input positions and generating context-aware representations, this framework demonstrates exceptional capacity for capturing global dependencies. Vision Transformer (ViT)9 pioneers this adaptation by processing images as sequences of positional-encoded non-overlapping patches, achieving state-of-the-art classification performance. Building upon this foundation, SETR26 employs pure Transformer encoders for feature extraction coupled with specialized decoders for pixel-level segmentation. Subsequent innovations focus on hybrid architectures that integrate CNN-inspired hierarchical structures with Transformer components10,27–32, exemplified by Swin Transformer’s10 linear-complexity shifted window attention and PVTv2’s14 overlapping patch embedding strategy for enhanced feature representation.
In medical image segmentation (MIS), Transformer-based architectures demonstrate growing clinical relevance. TransUNet11 establishes the viability of CNN-Transformer hybrid encoders for global context modeling, while UCTransNet12 replaces conventional skip connections with Channel Transformer (CTrans) modules to bridge semantic gaps. DS-TransUNet13 advances multi-scale feature extraction through dual Swin Transformer encoders, and TransFuse33 introduces cross-modality fusion via parallel CNN-Transformer branches. Clinical specializations emerge through architectures like Polyp-PVT34, which leverages PVTv2 for enhanced polyp segmentation, and PVT-CASCADE35 implementing hierarchical cascaded attention decoding with feature prioritization mechanisms. HSNet36 further refines this paradigm through cross-semantic attention filtering and hybrid semantic complementation for anatomical detail preservation.
This architectural evolution reveals fundamental tradeoffs: While CNNs excel at extracting localized patterns through spatial inductive biases, their limited receptive fields constrain global semantic integration crucial for precise anatomical delineation. Conversely, ViT-derived models8,9 achieve comprehensive contextual modeling via self-attention mechanisms, yet confront dual challenges of local feature discriminability limitations and quadratic computational scaling with input resolution – particularly problematic in high-resolution endoscopic imaging. These complementary strengths and weaknesses underscore the critical need for architectures that synergize CNN’s local processing proficiency with Transformer’s global relational reasoning capabilities in medical imaging applications.
Methods
In this section, we present an overview of the overall architecture and related modules of PFNet. Specifically, we first describe the entire network structure, then provide detailed explanations for each module within it. These include the Pvtv2 backbone network for feature extraction and our proposed four modules: the Noise Filtering Attention Module (NFAM), the Boundary and Location Awareness Module (BLAM), the Auxiliary Information Embedding Module (AIEM), and the Feature Fusion Module (FFM). Finally, we present the composition of our loss function.
Overall architecture
The architecture of our designed PFNet is shown in Fig. 1. We utilize the Pvtv2 encoder to extract multi-level semantic features from the image. The output feature layers are then passed through the Noise Filtering Attention Module (NFAM) to filter out background noise and further refine the features. Subsequently, we construct the Boundary and Location Awareness Module (BLAM) to obtain complete boundary and location information. Through our proposed Auxiliary Information Embedding Module (AIEM), this boundary and location information is integrated into each feature layer. Finally, using our designed Feature Fusion Module (FFM), we fuse the boundary and location information while progressively restoring global context from deep features. This ensures that the boundary and location information are not lost, resulting in target regions with accurate, clear boundaries and correct positional localization.
Fig. 1.
The overall structure of the PFNet. We adopt Pvtv2 as the decoder of the network. In the decoder part, a progressive fusion network structure is proposed, which first generates high-quality boundary and location features, and then progressively embeds these features into each feature layer of the decoder output and the subsequent fusion process. To this end, we propose four modules: BLAM (Boundary and Location Awareness Module), NFAM (Noise Filtering Attention Module), AIEM (Auxiliary Information Embedding Module), and FFM (Feature Fusion Module).
The processing of underlying noise by NFAM is the foundation of progressive fusion. Injecting boundary and location information at different stages of BLAM and AIEM is a key link in progressive fusion. FFM realizes the gradual fusion of features at different levels.
Specifically, we first process the image to be segmented using a Pvtv2 encoder. Leveraging the Vision Transformer’s strong ability to capture global information, we obtain rich multiscale global information from endoscopic images of organs and tissues. Initially, we perform denoising operations on the four different-scale feature layers
output by Pvtv2. Using the Noise Filtering Attention Module (NFAM), we filter out irrelevant background noise, enhance semantic information, and obtain higher-quality multiscale feature layers
.
represents high-level features that contain abundant global information, such as organ and tissue positional information; while
represents low-level features that capture rich detailed information like texture, color, and boundary characteristics. After denoising, we utilize the Boundary and Location Awareness Module (BLAM) to fuse positional information from
with boundary information from
. This results in a high-quality boundary and position information feature layer
. Next, we employ the Auxiliary Information Embedding Module (AIEM) to incorporate this high-quality boundary and position information features layer
into the original multiscale features
This enhances the semantic information of each feature layer, producing improved features
. We then upsample these enhanced features so that each feature layer has the same dimensions as
. Finally, we pass each feature layer
through three Feature Fusion Modules (FFM). Starting with high-level features, this hierarchical approach progressively restores organ and tissue global information to generate a prediction map. Notably, our FFM module differs from traditional approaches by continuously integrating the high-quality boundary and positional information from
during each fusion step. This ensures that semantic integrity of organs and tissues is preserved throughout the endoscopic image processing.
Encoder
Inspired by the outstanding performance of Transformer in image processing tasks, we chose Pvtv2 as our primary feature extraction tool. Pvtv2 significantly reduces the computational resources required for attention mechanisms through its use of a pyramid structure and a technique called linear spatial reduction attention. Additionally, Pvtv2 employs an overlapping block embedding approach, where adjacent image blocks share a 50% overlap. This allows each block to contain more contextual information, thereby enhancing the local continuity of features and improving boundary awareness. In essence, Pvtv2 optimizes its structure and incorporates overlapping strategies to achieve both computational efficiency and improved feature extraction quality.
Noise filtering attention module
While the Transformer’s self-attention mechanism demonstrates superior long-range dependency modeling through its global receptive field, enabling enhanced feature representation and contextual correlation learning, its vanilla implementation encounters critical limitations in medical imaging scenarios. The quadratic computational overhead inherent in standard self-attention operations proves particularly prohibitive when processing high-resolution endoscopic data, while the fixed-scale attention weights fail to effectively capture the heterogeneous anatomical structures characteristic of multi-organ endoscopic environments. Therefore, after learning from the ideas of PSPNet37 and ANNNet38, we optimized the traditional self-attention mechanism and designed our NFAM module, which can better capture the complete semantic information and filter out the background noise. As shown in Fig.2, the input feature layer
undergoes multi-scale pyramid pooling operations37. This method not only reduces the computational load but also ensures that features of different sizes are captured by different receptive fields, which is particularly useful for identifying organs or tissues of different sizes in endoscopic images. Specifically, we applied
pyramid pooling operations to downsample the feature layer. These processed features were then reshaped and concatenated.In this design::
is directly obtained by reshaping the input feature layer
.
are derived from the reshaped features after the pyramid pooling operation. After computing the attention mechanism using Q, K, V the results are reshaped:
![]() |
1 |
![]() |
2 |
![]() |
3 |
represents the linear projection, n denotes the number of heads, and
is the dimension of each head, which equals
.
Fig. 2.
Description of the noise filtering attention module (NFAM).
Boundary and location awareness module
Inspired by Amulet39 and Egnet40, good boundary priors and position priors are conducive to effective target segmentation and localization. Although the underlying features contain rich edge detail information, they lack global position information. Therefore, to effectively extract boundary and positional features of organs and tissues, it is essential to fuse them with semantic information from high-level features. As shown in Fig.3, we achieve this by fusing low-level and high-level features to perceive and model the complete boundary and location features of organs and tissues. Specifically, we first reduce the channel dimensions of high-level features T and low-level features L using two
convolution layers to 64 (
) and 32 (
), respectively. This reduces computational load without affecting feature extraction. Next, we upsample
to match the size of
, then concatenate it with
. Finally, we pass the concatenated features through a
convolution layer followed by a
convolution layer, and then apply a Sigmoid function to produce single-channel edge features
.The details of our BLAM layer are as follows:
![]() |
4 |
![]() |
5 |
![]() |
6 |
Fig. 3.
Description of the boundary and location awareness module (BLAM).
Auxiliary information embedding module
O. Oktay et al proposed Attention U-Net in 20187. This network model can suppress the regions of unsegmented targets, highlight the features of the segmented target regions, and significantly improve the accuracy of medical image segmentation. They propose a core module, Attention Gate, which dynamically adjusts weights to focus on areas of interest through a soft-attention mechanism. However, Attention Gate is computationally intensive in high-resolution endoscopic images and relies on strict data alignment. To this end, our AIEM introduces group convolution to design lightweight attention structure. We use group convolution instead of traditional convolution for feature fusion within groups. The computational complexity of group convolution is significantly lower than that of traditional convolution methods. At the same time, we add a BatchNorm layer and a ReLU layer after convolution of input features, and get the structure of New Attention Gate (NAG). As shown in Fig.4, the high-quality boundary and positional information feature layer T is restored to the same number of channels as the feature layer N (the layer to be embedded with auxiliary information) using a
convolution layer. The two feature layers are then input into the modified NAG to embed boundary and positional information into the feature layer N. Finally, we process the outputs of the four feature layers through ECA41, SA42, and GHAP43 to further enhance the fused features. The detailed formulas for our AIEM (Advanced Integration and Enhancement Module) are as follows:
![]() |
7 |
![]() |
8 |
![]() |
9 |
![]() |
10 |
Fig. 4.
Description of the auxiliary information embedding module (AIEM).
Feature fusion module
In order to dynamically balance the global semantics and local details in the decoding process and avoid the loss of boundary and position information, we introduced the gateway attention module CA44, the spatial attention module SA42 and the pixel attention module PA45, and designed the FFM. As shown in Fig.5, we restore the high-quality boundary and positional information feature layer
to the same number of channels as the input feature layer using a
convolution, resulting in
. We then concatenate this with the high-level feature layer
(to be fused) and the low-level feature layer
, forming a concatenated feature layer F. Next, we pass F through both channel attention module CA44and spatial attention module SA42 to obtain
and
. The channel attention CA selects features representing the global information of organs and tissues, ignoring those representing background or noise. The spatial attention SA identifies the approximate location of organs and tissues in the image, ignoring blank regions or irrelevant organs and tissues.We then add these two output feature layers together and combine them with the original feature layer
to pass through a pixel attention PA45, which precisely identifies the boundaries of organs and tissues while assigning each pixel inside the organ or tissue a weight representing its importance. This results in a weight layer P. Subsequently, we multiply this weight P with the high-level feature layer
(which contains more global information), and multiply
with the low-level feature layer
(which contains less detailed information). We then sum these two results to produce an output. Finally, we pass this through a
convolution layer to generate the final result.The details of our FFM (Feature Fusion Module) are as follows:
![]() |
11 |
![]() |
12 |
![]() |
13 |
![]() |
14 |
Fig. 5.
Description of the feature fusion module (FFM).
Theoretical rationale for progressive fusion
The effectiveness of progressive fusion compared to single-stage fusion stems from its ability to systematically reconcile the fundamental disparity between feature hierarchies. Low-level features possess high spatial fidelity crucial for boundary delineation but are inherently noisy and semantically ambiguous. Conversely, high-level features offer robust semantic context but lack fine spatial detail due to downsampling. Directly merging these divergent representations often results in mutual interference: coarse semantics suppress vital local details, while noise contaminates high-level context. Our progressive strategy circumvents this through sequential refinement: the NFAM module first attenuates noise in low-level features. The BLAM module then leverages this cleaner input, fusing it with deep semantics to generate explicit, high-quality boundary and position guidance. Critically, the AIEM module embeds this guidance dynamically and progressively into each corresponding decoder layer during the feature reconstruction process. This allows the network to iteratively assimilate and refine structural cues step-by-step. Finally, the FFM synchronizes the reconstruction of global semantics with the layer-wise infusion of these progressively enhanced local signals. This orchestrated sequence enables controlled transformation and harmonious integration, preserving essential spatial details for precise segmentation—a key advantage over the blunt approach of single-stage fusion, particularly in demanding endoscopic imaging scenarios.
Loss function
Our loss function L is calculated between the final segmentation result Prediction and Ground truth, and can be expressed as:
![]() |
15 |
Where
and
are weighted IoU (intersection over union) loss46 and weighted binary cross entropy (BCE) loss46, They restrict the prediction graph in terms of global structure (object level) and local detail (pixel level).
Experiments
Datesets
To rigorously assess the segmentation capabilities of our proposed PFNet framework in endoscopic imaging applications, we developed two original datasets: the Ureter dataset derived from ureteroplasty surgical recordings and the Re-TMRS dataset obtained through renal tumor endoscopic examinations, along with five open-access colonoscopic polyp datasets. Detailed specifications of these biomedical image collections are systematically presented in Table 1. For experimental validation, the Ureter dataset was randomly divided into training (2,292 images) and testing subsets (572 images) for ureteral structure segmentation. Similarly, the Re-TMRS dataset underwent stratified partitioning with 2,258 samples allocated for model training and 565 reserved for renal tumor segmentation evaluation. In alignment with PraNet’s experimental configuration for polyp detection, we employed 900 Kvasir47 and 550 CVC-ClinicDB48 endoscopic images for training cohort establishment. The validation phase utilized residual 100 Kvasir and 62 CVC-ClinicDB specimens to examine the model’s feature learning capacity. Notably, three additional benchmarks - CVC-ColonDB49, ETIS50, and CVC-30051 were strictly excluded from training protocols to objectively evaluate the framework’s cross-domain generalization potential.
Table 1.
Detailed information of the seven datasets used in this work.
| Datasets | Segmentation target | Image | Original size | Input size | Train | Validation | Test |
|---|---|---|---|---|---|---|---|
| Ureter | Ureter | 2864 | ![]() |
![]() |
2292 | N/A | 572 |
| Re-TMRS | Renal tumor | 2823 | ![]() |
![]() |
2258 | N/A | 565 |
| Kvasir | Polyp | 1000 | ![]() |
![]() |
900 | N/A | 100 |
| CVC-ClinicDB | Polyp | 612 | ![]() |
![]() |
550 | N/A | 62 |
| CVC-ColonDB | Polyp | 380 | ![]() |
![]() |
N/A | N/A | 380 |
| ETIS | Polyp | 196 | ![]() |
![]() |
N/A | N/A | 196 |
| CVC-300 | Polyp | 60 | ![]() |
![]() |
N/A | N/A | 60 |
Evaluation metrics
We selected 9 evaluation metrics that are widely used to evaluate the superiority of medical image segmentation performance, respectively, Mean Absolute Error(MAE), S-measure(
), mean Dice(mDice), mean Intersection over Union(mIoU), Enhanced-alignment measure(
), Weighted F-measure(
), Balanced error rate(Ber), Classification accuracy(Acc) and Hausdorff distance(HD).
![]() |
16 |
![]() |
17 |
![]() |
18 |
![]() |
19 |
![]() |
20 |
![]() |
21 |
![]() |
22 |
![]() |
23 |
![]() |
24 |
where N is the total number of pixels,
is he predicted value of the bottom i pixels,
is the true value of the i th pixels,
is based on object-level similarity,
is based on regional level similarity,
is the weight coefficient (take 0.5),
is a pixel-level alignment function,Precision and Recall are associated with four values i.e., true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN):
,
,
and
are the weighted Precision and Recall. We set
to 1. A and B are two regions and
and
are their boundary curves.
and
are the distance functions between the two curves, which is defined as:
![]() |
25 |
Implementation details
The PFNet network we propose is implemented based on the Pytorch 2.0.0 and Python3.8.0 frameworks. All models are trained on an NVIDIA A800 GPU with 80G memory. The AdamW optimizer52 is used with a learning rate of
and a weight decay of
. We adopt a unified training strategy, first adjusting the size of the input images to
, and then employing a multi-scale strategy of [0.75, 1.0, 1.25] to enable the network to process organs of different sizes. In addition, the batch size of the network is set to 16, and the maximum number of training epochs is set to 100.
Experimental results and analysis
In this section, we conducted experiments on seven datasets: Ureter, Re-TMRS, Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300, to verify the effectiveness of our network and compare it with previous state-of-the-art (SOTA) methods, including U-Net3, PraNet21, TransUNet11, SwinUnet53, TGDAUNet54, DA-TransUNet55, MSA2Net56, NPD-Net57, CIFG-Net58, and VMUNet59. As shown in Tables 2, 3, and 4, the PFNet network outperformed all previous SOTA methods on the kidney and ureter datasets, and also achieved better performance in terms of mDice, mIoU, and HD evaluation metrics on the four publicly available polyp datasets: Kvasir, CVC-ColonDB, CVC-ClinicDB, and CVC-300. Although our network did not surpass previous SOTA methods on the ETIS dataset, the gap is small. Overall, our PFNet network performs better than existing endoscopic image segmentation networks on the 7 endoscopic image datasets. Below, we provide quantitative and qualitative comparisons for each dataset to demonstrate the performance of our method.
Table 2.
Quantitative results and comparison of segmentation on the Re-TMRS dataset.
| Dataset | Model | Year | MAE
|
![]() |
![]() |
![]() |
mDice
|
mIoU
|
Ber
|
Acc
|
HD
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Re-TMRS | UNet | 2015 | 0.0156 | 0.8402 | 0.8927 | 0.9553 | 0.8586 | 0.7868 | 0.0806 | 0.9845 | 11.3301 |
| PraNet | 2020 | 0.0143 | 0.8610 | 0.9158 | 0.9677 | 0.8857 | 0.8186 | 0.0583 | 0.9863 | 11.0000 | |
| TransUNet | 2021 | 0.0156 | 0.8423 | 0.8884 | 0.9565 | 0.8629 | 0.7818 | 0.0758 | 0.9844 | 11.2312 | |
| SwinUNet | 2022 | 0.0204 | 0.7754 | 0.8540 | 0.9434 | 0.8169 | 0.7209 | 0.0828 | 0.9796 | 12.3411 | |
| TGDADNet | 2023 | 0.0157 | 0.8444 | 0.8998 | 0.9611 | 0.8717 | 0.7992 | 0.0606 | 0.9843 | 11.2302 | |
| DA-TransUNet | 2023 | 0.0156 | 0.8430 | 0.8920 | 0.9546 | 0.8665 | 0.7893 | 0.0683 | 0.9844 | 11.2923 | |
Net |
2023 | 0.0122 | 0.8764 | 0.9201 | 0.9725 | 0.8996 | 0.8357 | 0.0481 | 0.9880 | 10.8512 | |
| NPD-Net | 2024 | 0.0120 | 0.8824 | 0.9221 | 0.9745 | 0.9054 | 0.8431 | 0.0442 | 0.9882 | 10.6721 | |
| CIFG-Net | 2024 | 0.0154 | 0.8405 | 0.9232 | 0.9420 | 0.8792 | 0.8293 | 0.0395 | 0.9871 | 10.7822 | |
| VMUNet | 2024 | 0.0178 | 0.8263 | 0.8888 | 0.9601 | 0.8620 | 0.7788 | 0.0621 | 0.9823 | 11.6902 | |
| PFNet(ours) | 2025 | 0.0115 | 0.8881 | 0.9249 | 0.9715 | 0.9107 | 0.8490 | 0.0416 | 0.9886 | 10.4501 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Table 3.
Quantitative results and comparison of segmentation on the Ureter dataset.
| Dataset | Model | Year | MAE
|
![]() |
![]() |
![]() |
mDice
|
mIoU
|
Ber
|
Acc
|
HD
|
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ureter | Unet | 2015 | 0.0087 | 0.8588 | 0.8926 | 0.9710 | 0.8792 | 0.8029 | 0.0637 | 0.9913 | 9.7301 |
| PraNet | 2020 | 0.0077 | 0.8746 | 0.9046 | 0.9769 | 0.8964 | 0.8229 | 0.0495 | 0.9923 | 9.4112 | |
| TransUNet | 2021 | 0.0099 | 0.8403 | 0.8767 | 0.9684 | 0.8615 | 0.7731 | 0.0746 | 0.9901 | 10.0821 | |
| SwinUNet | 2022 | 0.0114 | 0.8154 | 0.8584 | 0.9630 | 0.8399 | 0.7402 | 0.0881 | 0.9886 | 10.7012 | |
| TGDAUNet | 2023 | 0.0098 | 0.8368 | 0.8824 | 0.9620 | 0.8635 | 0.7805 | 0.0643 | 0.9902 | 9.9421 | |
| DA-TransUNet | 2023 | 0.0090 | 0.8512 | 0.8839 | 0.9657 | 0.8719 | 0.7904 | 0.0652 | 0.9910 | 9.8622 | |
Net |
2023 | 0.0080 | 0.8749 | 0.9052 | 0.9774 | 0.8953 | 0.8225 | 0.0535 | 0.9921 | 9.3931 | |
| NPD-Net | 2024 | 0.0077 | 0.8772 | 0.9080 | 0.9778 | 0.9005 | 0.8291 | 0.0454 | 0.9924 | 9.2500 | |
| CIFG-Net | 2024 | 0.0111 | 0.8133 | 0.9016 | 0.9217 | 0.8575 | 0.8072 | 0.0445 | 0.9912 | 9.6521 | |
| VMUNet | 2024 | 0.0107 | 0.8310 | 0.8811 | 0.9692 | 0.8614 | 0.7725 | 0.0664 | 0.9894 | 10.4511 | |
| PFNet(ours) | 2025 | 0.0073 | 0.8859 | 0.9115 | 0.9815 | 0.9069 | 0.8379 | 0.0436 | 0.9928 | 9.1523 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Table 4.
Quantitative results and comparison of segmentation on the Kvasir dataset.
| Dataset | Model | Year | mDice
|
mIoU
|
HD
|
|---|---|---|---|---|---|
| Kvasir | UNet | 2015 | 0.8111 | 0.7293 | 9.0212 |
| PraNet | 2020 | 0.9004 | 0.8475 | 6.7811 | |
| TransUNet | 2021 | 0.8914 | 0.8325 | 6.7421 | |
| SwinUNet | 2022 | 0.8691 | 0.7961 | 7.4811 | |
| TGDAUNet | 2023 | 0.9207 | 0.8699 | 6.4921 | |
| DA-TransUNet | 2023 | 0.8447 | 0.7731 | 7.2023 | |
Net |
2023 | 0.9122 | 0.8600 | 6.5500 | |
| NPD-Net | 2024 | 0.9064 | 0.8530 | 6.5012 | |
| CIFG-Net | 2024 | 0.9080 | 0.8720 | 6.4721 | |
| VMUNet | 2024 | 0.8620 | 0.7952 | 7.4511 | |
| PFNet(ours) | 2025 | 0.9234 | 0.8732 | 6.4021 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Quantitative Analysis. As shown in Table 2, our network’s results are almost superior to previous SOTA methods across the eleven evaluation metrics adopted. Specifically, our network improved the mDice and mIoU metrics by 5.12%, 6.22%, 4.87% and 4.78%, 6.72%, 7.02% compared to the classic U-Net (with traditional convolution as the backbone), TransUNet (with Transformer as the backbone), and VMUNet (with Mamba as the backbone), respectively; and reduced the HD metric by 0.8800, 0.7811, and 1.2401. Furthermore, compared to the current best NPD-Net network, our network also improved the mDice and mIoU metrics by 0.53% and 0.59%, but we achieved at least a 1% improvement compared to other networks in recent years. Thus, we can conclude that the performance of PFNet on the kidney dataset is superior to all current SOTA methods. As shown in Table 3, our network’s performance on the ureter dataset, while maintaining the performance on the kidney dataset, has improved, achieving improvements over the current SOTA methods across the eleven evaluation metrics adopted. Specifically, our Acc accuracy reached 99.28%, and the HD was reduced to 9.1523, with mDice and mIoU improving by 0.64% and 0.88% compared to NPD-Net. Therefore, the performance of our PFNet on the ureter dataset is also superior to all current SOTA methods. The Ureter and Re-TMRS experimental results of mDice are shown in Fig. 6. As shown in Tables 4, 5, 6 and 7, our network’s performance on the Kvasir, CVC-ColonDB, CVC-ClinicDB, and CVC-300 datasets is better than all previous SOTA methods. For the mIoU metric, our network improved by 0.33% and 0.74% on the Kvasir and CVC-ColonDB datasets compared to the second-best network. The mDice on the CVC-ClinicDB dataset reached 93.09%, an increase of 0.47% compared to the second-best NPD-Net. As shown in Table 8, there is a certain gap on the ETIS dataset, but it does not affect the overall performance of our network.
Fig. 6.
Ureter and Re-TRMS experimental results of mDice.
Table 5.
Quantitative results and comparison of segmentation on the CVC-300 dataset.
| Dataset | Model | Year | mDice
|
mIoU
|
HD
|
|---|---|---|---|---|---|
| CVC-300 | UNet | 2015 | 0.7653 | 0.6816 | 6.0421 |
| PraNet | 2020 | 0.8727 | 0.8039 | 5.0722 | |
| TransUNet | 2021 | 0.8504 | 0.7601 | 5.2711 | |
| SwinUNet | 2022 | 0.8276 | 0.7333 | 5.7721 | |
| TGDAUNet | 2023 | 0.8730 | 0.8084 | 5.1929 | |
| DA-TransUNet | 2023 | 0.8277 | 0.7619 | 5.0708 | |
Net |
2023 | 0.8857 | 0.8182 | 5.0531 | |
| NPD-Net | 2024 | 0.8708 | 0.8068 | 5.1112 | |
| CIFG-Net | 2024 | 0.8470 | 0.7895 | 5.0714 | |
| VMUNet | 2024 | 0.8488 | 0.7706 | 5.3621 | |
| PFNet(ours) | 2025 | 0.8868 | 0.8196 | 4.9909 |
indicates that higher is better,
indicates that lower is better. The red represents the best results, blue represents the second best results.
Table 6.
Quantitative results and comparison of segmentation on the CVC-ColonDB dataset.
| Dataset | Model | Year | mDice
|
mIoU
|
HD
|
|---|---|---|---|---|---|
| CVC-ColonDB | UNet | 2015 | 0.6228 | 0.5265 | 8.4420 |
| PraNet | 2020 | 0.7180 | 0.6466 | 6.8322 | |
| TransUNet | 2021 | 0.7535 | 0.6682 | 6.8114 | |
| SwinUNet | 2022 | 0.6896 | 0.5943 | 7.6532 | |
| TGDAUNet | 2023 | 0.7743 | 0.7034 | 6.3321 | |
| DA-TransUNet | 2023 | 0.6818 | 0.5926 | 7.2626 | |
Net |
2023 | 0.7486 | 0.6691 | 6.8012 | |
| NPD-Net | 2024 | 0.8165 | 0.7353 | 6.3422 | |
| CIFG-Net | 2024 | 0.8047 | 0.7317 | 6.3720 | |
| VMUNet | 2024 | 0.7093 | 0.6239 | 7.3610 | |
| PFNet(ours) | 2025 | 0.8169 | 0.7391 | 6.2720 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Table 7.
Quantitative results and comparison of segmentation on the CVC-ClinicDB dataset.
| Dataset | Model | Year | mDice
|
mIoU
|
HD
|
|---|---|---|---|---|---|
| CVC-ClinicDB | UNet | 2015 | 0.8640 | 0.7894 | 5.1221 |
| PraNet | 2020 | 0.9016 | 0.8576 | 4.0721 | |
| TransUNet | 2021 | 0.8652 | 0.8079 | 4.4523 | |
| SwinUNet | 2022 | 0.8416 | 0.7679 | 4.7912 | |
| TGDAUNet | 2023 | 0.9213 | 0.8700 | 4.0332 | |
| DA-TransUNet | 2023 | 0.8620 | 0.7988 | 4.5431 | |
Net |
2023 | 0.9219 | 0.8731 | 4.0731 | |
| NPD-Net | 2024 | 0.9262 | 0.8802 | 3.9411 | |
| CIFG-Net | 2024 | 0.9153 | 0.8817 | 3.7814 | |
| VMUNet | 2024 | 0.8549 | 0.7859 | 4.7823 | |
| PFNet(ours) | 2025 | 0.9309 | 0.8826 | 3.7617 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Table 8.
Quantitative results and comparison of segmentation on the ETIS dataset.
| Dataset | Model | Year | mDice
|
mIoU
|
HD
|
|---|---|---|---|---|---|
| ETIS | UNet | 2015 | 0.5227 | 0.4247 | 11.9123 |
| PraNet | 2020 | 0.6300 | 0.5757 | 9.1512 | |
| TransUNet | 2021 | 0.6578 | 0.5760 | 9.3332 | |
| SwinUNet | 2022 | 0.5683 | 0.4798 | 10.4542 | |
| TGDAUNet | 2023 | 0.7600 | 0.6789 | 8.6612 | |
| DA-TransUNet | 2023 | 0.6360 | 0.5579 | 8.7812 | |
Net |
2023 | 0.6692 | 0.6006 | 8.4927 | |
| NPD-Net | 2024 | 0.7806 | 0.7025 | 8.4613 | |
| CIFG-Net | 2024 | 0.7345 | 0.6806 | 8.5719 | |
| VMUNet | 2024 | 0.5585 | 0.4851 | 10.6006 | |
| PFNet(ours) | 2025 | 0.7727 | 0.6938 | 8.4027 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results, italic represents the second best results.
Qualitative Analysis. To more intuitively observe the segmentation effects of our network on endoscopic images, we visualized the segmentation results of our network and the previous 10 methods, as shown in Fig.7. The first four columns are the visualization segmentation results of the ureter dataset, and the last four columns are the visualization segmentation results of the kidney tumor dataset. From the Fig., we can clearly observe that previous SOTA methods, when segmenting endoscopic images, may have some misjudgments and missed judgments, and the boundaries of the segmented targets are slightly blurred, while our PFNet, in contrast, has more accurate segmentation results, reducing the occurrence of misjudgments and missed judgments. Specifically, in the first column of images, U-Net, PraNet, TGDAUNet, MSA2Net, in addition to segmenting the correct areas, made misjudgments, mistaking internal tissues that do not belong to the ureter for ureter areas. Other networks, such as TransUNet, not only made misjudgments but also missed judgments, failing to segment areas belonging to the ureter. Our PFNet correctly segmented the ureter area, excluding some light, fat, and surgical instruments, without misjudgments, demonstrating stronger robustness. From the sixth and last columns of images, we can clearly see that the previous ten SOTA methods almost cannot accurately and completely segment the target areas of kidney tumors. When the segmented areas are large, it is easy to miss judgments, while our PFNet can still accurately segment the kidney tumors even when the segmented areas are large. Overall, our PFNet is more precise in segmenting tissues and organs of different sizes and shapes (such as ureters, kidney tumors), and the segmentation boundaries are clearer.
Fig. 7.
Visualization results of the segmentation of the Ureter and Re-TMRS dataset.
Analysis of the model’s parameter quantity, computational complexity and inference time.As shown in Table 9, while our model achieves the best segmentation performance, its parameter count, computational complexity and inference time remain excellent. Although the best results were not achieved in terms of the number of parameters, computational complexity and reasoning time, it has already met the requirements for its application in medical image segmentation.
Table 9.
The results of the parameter quantity, computational complexity and inference time of the model adopted in the experiment.
| Model | Params(M) | FLOPs(G) | Time(ms) |
|---|---|---|---|
| UNet | 7.81 | 14.59 | 16.88 |
| PraNet | 32.52 | 13.18 | 72.35 |
| TransUNet | 105.13 | 24.57 | 124.87 |
| SwinUNet | 27.15 | 5.86 | 42.57 |
| TGDAUNet | 62.51 | 21.13 | 75.87 |
| DA-TransUNet | 112.29 | 23.12 | 89.78 |
Net |
28.56 | 19.92 | 61.12 |
| NPD-Net | 25.57 | 14.55 | 72.16 |
| CIFG-Net | 24.29 | 16.79 | 71.98 |
| VMUNet | 44.27 | 7.56 | 32.43 |
| PFNet(ours) | 25.59 | 15.32 | 45.12 |
Ablation study
In our PFNet network, we offer four new modules (NFAM, BLAM, AIEM and FFM) to improve the segmentation performance of endoscopic images. In order to verify the impact and contribution of each module to the final segmentation performance, we performed an ablative study of the segmentation task of the renal and urinary datasets. Specifically, we directly removed the NFAM, and for the BLAM, AIEM, and FFM, we adopted simple
convolutions to change the channel numbers and simple addition operations as replacements. As shown in Table 10, the quantitative results of our ablation experiments are presented, where we denote the above experiments as “w/o NFAM”, “w/o BLAM”, “w/o AIEM”, and “w/o FFM”. From Table 10, we can observe that in the “w/o NFAM” case, the performance metrics for both the kidney tumor dataset and the ureter dataset have decreased. From the visualization results in Fig. 8, we can also see that w/o NFAM resulted in some over-segmentation and unsegmented parts. This leads us to conclude that NFAM effectively filters out irrelevant noise information and extracts the correct semantic information. In the “w/o BLAM” case, the mDice, mIoU, and HD metrics all decreased, and the segmented results showed unclear and inaccurate areas. Therefore, if we simply add high-level features to low-level features and then proceed with subsequent progressive fusion, we cannot achieve good segmentation results. This also demonstrates that our BLAM can effectively fuse high-level feature layers with low-level feature layers to generate high-quality boundary and positional information layers. In the “w/o AIEM” and “w/o FFM” cases, we can clearly observe that the mDice, mIoU, and HD metrics have significantly decreased, especially for the “w/o AIEM” case, where the mDice metric on the kidney tumor dataset decreased by more than 1%. From Fig. 8, it is also very clear that “w/o AIEM” and “w/o FFM” resulted in obvious over-segmentation or under-segmentation, and in the boundary areas, they both showed segmentation that was unclear and discontinuous. Therefore, it can be concluded that our progressive fusion method combined with the designed AIEM and FFM can segment more accurately and with clearer boundaries the organs and tissues in endoscopic images. From the results of research that combines qualitative and quantitative, each element is essential and increases the accuracy and robustness of the model.
Table 10.
Quantitative results of the ablation experiments.
| Datasets | Metric | w/o NFAM | w/o AIEM | w/o BLAM | w/o FFM | PFNet(ours) |
|---|---|---|---|---|---|---|
| Re-TMRS | mDice
|
0.9079 | 0.9009 | 0.9092 | 0.9013 | 0.9107 |
mIoU
|
0.8404 | 0.8350 | 0.8462 | 0.8413 | 0.8490 | |
HD
|
10.5121 | 10.6425 | 10.6022 | 10.5532 | 10.4516 | |
| Ureter | mDice
|
0.9028 | 0.9004 | 0.9045 | 0.9038 | 0.9069 |
mIoU
|
0.8323 | 0.8236 | 0.8341 | 0.8342 | 0.8379 | |
HD
|
9.3232 | 9.2122 | 9.1939 | 9.2223 | 9.1504 |
indicates that higher is better,
indicates that lower is better.
Bold represents the best results.
Fig. 8.
Visualization results of the ablation experiment for the Ureter and Re-TMRS dataset.
Attention ablation experiment
To verify the influence of the attention mechanism on the experimental results, we conducted an attention ablation experiment. Among them, “w/o NFAM-att”,“w/o AIEM-att” and“w/o FFM-att” respectively represent each experiment. Taking the ureteral dataset as an example, we calculated the mDice values under each experiment respectively, as well as the overall parameter count and computational complexity. As shown in Table 11, the mDice values of the three experiments, “w/o NFAM-att”,“w/o AIEM-att” and“w/o FFM-att”, all decreased. However, it can also be found that its performance in terms of the number of parameters and computational complexity has not improved significantly. Therefore, it can be concluded that adopting the attention mechanism on each module is a very effective approach.
Table 11.
Results of the attention ablation experiment (Ureter dataset).
| Model | mDice | FLOPs(G) | Params(M) |
|---|---|---|---|
| PFNet(ours) | 0.9069 | 15.32 | 25.59 |
| w/o NFAM-att | 0.9035 | 12.93 | 23.43 |
| w/o AIEM-att | 0.9018 | 13.45 | 24.12 |
| w/o FFM-att | 0.9049 | 13.55 | 24.19 |
Hyperparameter sensitivity study
We analyzed fusion weights in FFM(
,
,
) and number of attention heads in NFAM(h), and the analysis results are shown in Tables 12 and 13. As can be seen from the results in Table 12, in the FFM module, the best result is achieved by fusing the three input feature layers with the same fusion weight. However, if the fusion weight of any one feature layer is too high, the segmentation performance will decline. As can be seen from the results in Table 13, with the increase of the number of attention heads, the segmentation performance can be improved to a certain extent. However, once the number of attention heads exceeds 4, overfitting begins to occur, resulting in a decline in segmentation performance. So, the number of attention heads we adopted in NFAM is 4.
Table 12.
Experimental results of fusion weights in FFM module.
| Datasets | ![]() |
![]() |
![]() |
mDice |
|---|---|---|---|---|
| Re-TMRS | 1 | 1 | 1 | 0.9107 |
| 2 | 1 | 1 | 0.9052 | |
| 1 | 2 | 1 | 0.9041 | |
| 1 | 1 | 2 | 0.9019 | |
| Ureter | 1 | 1 | 1 | 0.9069 |
| 2 | 1 | 1 | 0.9005 | |
| 1 | 2 | 1 | 0.8978 | |
| 1 | 1 | 2 | 0.8988 |
Bold represents the best results.
Table 13.
Experimental results of number of attention heads in NFAM model.
| Datasets | h | mDice |
|---|---|---|
| Re-TMRS | 1 | 0.9081 |
| 2 | 0.9090 | |
| 3 | 0.9091 | |
| 4 | 0.9107 | |
| 5 | 0.9093 | |
| 6 | 0.8989 | |
| Ureter | 1 | 0.9028 |
| 2 | 0.9034 | |
| 3 | 0.9049 | |
| 4 | 0.9069 | |
| 5 | 0.9060 | |
| 6 | 0.8891 |
Bold represents the best results.
Conclusion
In this paper, an endoscopic image segmentation network (PFNet) based on progressive feature fusion is proposed, which solves the limitations of existing methods in global semantic modeling and local detail preservation through multi-module collaboration mechanism. The NFAM module enhances feature semantic representation through noise filtering and cross-layer attention mechanism. BLAM module integrates multi-scale features to generate high-precision boundary and position clues. AIEM modules dynamically embed boundary and position information into the decoding process to improve context awareness. FFM module realizes the complementary optimization of global-local information through progressive fusion strategy. Experiments have shown that PFNet significantly outperforms 11 SOTA methods on multiple endoscopic datasets, especially when dealing with small targets, fuzzy boundaries, and light interference. However, PFNet still has limitations in processing extremely low-contrast images, and the inference speed needs to be optimized. Future work will explore lightweight designs to improve real-time performance and extend to 3D medical image segmentation tasks. This study provides an efficient tool for endoscopically assisted diagnosis, and the code and model have been open source to promote the development of the field.
Author contributions
All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by L.F. and Y.C The first draft of the manuscript was written by L.F. and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Data availability
The datasets generated and/or analysed during the current study are not publicly available due the data are owned by a third party and authors do not have permission to share the data but are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Azad, R. et al. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. (2024). [DOI] [PubMed]
- 2.Jia, X., Xing, X., Yuan, Y. & Meng, M.Q.-H. Deep learning-driven models for endoscopic image analysis. Adv. Artif. Intell. Comput. Data Sci. Med. Life Sci.271–300, (2021).
- 3.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18 (eds Ronneberger, O. et al.) 234–241 (Springer, 2015). [Google Scholar]
- 4.Xiao, X., Lian, S., Luo, Z. & Li, S. Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th International Conference on Information Technology in Medicine and Education (ITME) (eds Xiao, X. et al.) 327–331 (IEEE, 2018). [Google Scholar]
- 5.Jha, D. et al. A comprehensive study on colorectal polyp segmentation with resunet++, conditional random field and test-time augmentation. IEEE J. Biomed. Health Inform.25, 2029–2040 (2021). [DOI] [PubMed] [Google Scholar]
- 6.Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings 4, 3–11 (Springer, 2018). [DOI] [PMC free article] [PubMed]
- 7.Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
- 8.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
- 9.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- 10.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
- 11.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
- 12.Wang, H., Cao, P., Wang, J. & Zaiane, O. R. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proc. AAAI Conf. Artif. Intell.36, 2441–2449 (2022). [Google Scholar]
- 13.Lin, A. et al. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas.71, 1–15 (2022). [Google Scholar]
- 14.Wang, W. et al. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media8, 415–424 (2022). [Google Scholar]
- 15.Wang, R. et al. Medical image segmentation using deep learning: A survey. IET Image Proc.16, 1243–1267 (2022). [Google Scholar]
- 16.Sekou, T. B., Hidane, M., Olivier, J. & Cardot, H. From patch to image segmentation using fully convolutional networks–application to retinal images. arXiv preprint arXiv:1904.03892 (2019).
- 17.Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proc. IEEE conference on computer vision and pattern recognition, 3431–3440 (2015). [DOI] [PubMed]
- 18.Fang, Y., Chen, C., Yuan, Y. & Tong, K.-Y. Selective feature aggregation network with area-boundary constraints for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22 (eds Fang, Y. et al.) 302–310 (Springer, 2019). [Google Scholar]
- 19.Li, X., Wang, W., Hu, X. & Yang, J. Selective kernel networks. In Proc. IEEE/CVF conference on computer vision and pattern recognition, 510–519 (2019).
- 20.Zhang, R. et al. Adaptive context selection for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23 (eds Zhang, R. et al.) 253–262 (Springer, 2020). [Google Scholar]
- 21.Fan, D.-P. et al. Pranet: Parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Fan, D.-P. et al.) 263–273 (Springer, 2020). [Google Scholar]
- 22.Huang, C.-H., Wu, H.-Y. & Lin, Y.-L. Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:2101.07172 (2021).
- 23.Yin, Z., Liang, K., Ma, Z. & Guo, J. Duplex contextual relation network for polyp segmentation. In 2022 IEEE 19th international symposium on biomedical imaging (ISBI), 1–5 (IEEE, 2022).
- 24.Patel, K., Bur, A. M. & Wang, G. Enhanced u-net: A feature enhancement network for polyp segmentation. In 2021 18th conference on robots and vision (CRV), 181–188 (IEEE, 2021). [DOI] [PMC free article] [PubMed]
- 25.Wei, J. et al. Shallow attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part I 24 (eds Wei, J. et al.) 699–708 (Springer, 2021). [Google Scholar]
- 26.Zheng, S. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6881–6890 (2021).
- 27.Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proc. IEEE/CVF international conference on computer vision, 568–578 (2021).
- 28.Pan, Z., Zhuang, B., Liu, J., He, H. & Cai, J. Scalable vision transformers with hierarchical pooling. In: Proc. IEEE/cvf international conference on computer vision, 377–386 (2021).
- 29.Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proc. IEEE/CVF conference on computer vision and pattern recognition, 12124–12134 (2022).
- 30.Tu, Z. et al. Maxvit: Multi-axis vision transformer. In European Conference on Computer Vision (ed. Tu, Z.) 459–479 (Springer, 2022). [Google Scholar]
- 31.Rao, Y. et al. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural. Inf. Process. Syst.35, 10353–10366 (2022). [Google Scholar]
- 32.Wu, Y.-H., Liu, Y., Zhan, X. & Cheng, M.-M. P2t: Pyramid pooling transformer for scene understanding. IEEE Trans. Pattern Anal. Mach. Intell.45, 12760–12771 (2022). [DOI] [PubMed] [Google Scholar]
- 33.Zhang, Y., Liu, H. & Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24, 14–24 (Springer, 2021).
- 34.Dong, B. et al. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932 (2021).
- 35.Rahman, M. M. & Marculescu, R. Medical image segmentation via cascaded attention decoding. In: Proc. IEEE/CVF winter conference on applications of computer vision, 6222–6231 (2023).
- 36.Zhang, W. et al. Hsnet: A hybrid semantic network for polyp segmentation. Comput. Biol. Med.150, 106173 (2022). [DOI] [PubMed] [Google Scholar]
- 37.Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In: Proc. IEEE conference on computer vision and pattern recognition, 2881–2890 (2017).
- 38.Zhu, Z., Xu, M., Bai, S., Huang, T. & Bai, X. Asymmetric non-local neural networks for semantic segmentation. In: Proc. IEEE/CVF international conference on computer vision, 593–602 (2019).
- 39.Zhang, P., Wang, D., Lu, H., Wang, H. & Ruan, X. Amulet: Aggregating multi-level convolutional features for salient object detection. In: Proc. IEEE international conference on computer vision, 202–211 (2017).
- 40.Zhao, J.-X. et al. Egnet: Edge guidance network for salient object detection. In: Proc. IEEE/CVF international conference on computer vision, 8779–8788 (2019).
- 41.Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, 11534–11542 (2020).
- 42.Wang, H., Fan, Y., Wang, Z., Jiao, L. & Schiele, B. Parameter-free spatial attention network for person re-identification. arXiv preprint arXiv:1811.12150 (2018).
- 43.Ruan, J., Xie, M., Gao, J., Liu, T. & Fu, Y. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Ruan, J. et al.) 481–490 (Springer, 2023). [Google Scholar]
- 44.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In: Proc. European conference on computer vision (ECCV), 3–19 (2018).
- 45.Qin, X., Wang, Z., Bai, Y., Xie, X. & Jia, H. Ffa-net: Feature fusion attention network for single image dehazing. Proc.AAAI Conf. Artif. Intell.34, 11908–11915 (2020). [Google Scholar]
- 46.Wei, J., Wang, S. & Huang, Q. net: Fusion, feedback and focus for salient object detection. Proc. AAAI Conf. Artif. Intell.34, 12321–12328 (2020). [Google Scholar]
- 47.Jha, D. et al. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26 (ed. Jha, D.) 451–462 (Springer, 2020). [Google Scholar]
- 48.Bernal, J. et al. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph.43, 99–111 (2015). [DOI] [PubMed] [Google Scholar]
- 49.Tajbakhsh, N., Gurudu, S. R. & Liang, J. Automated polyp detection in colonoscopy videos using shape and context information. IEEE Trans. Med. Imaging35, 630–644 (2015). [DOI] [PubMed] [Google Scholar]
- 50.Silva, J., Histace, A., Romain, O., Dray, X. & Granado, B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg.9, 283–293 (2014). [DOI] [PubMed] [Google Scholar]
- 51.Vzquez, D., et al. A benchmark for endoluminal scene segmentation of colonoscopy images. J. Healthc. Eng.2017, 4037190 (2017). [DOI] [PMC free article] [PubMed]
- 52.Loshchilov, I. & Hutter, F. International conference on learning representations (2019).
- 53.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision (ed. Cao, H.) 205–218 (Springer, 2022). [Google Scholar]
- 54.Song, P., Li, J., Fan, H. & Fan, L. Tgdaunet: Transformer and gcnn based dual-branch attention unet for medical image segmentation. Comput. Biol. Med.167, 107583 (2023). [DOI] [PubMed] [Google Scholar]
- 55.Sun, G. et al. Da-transunet: Integrating spatial and channel dual attention with transformer u-net for medical image segmentation. Front. Bioeng. Biotechnol.12, 1398237 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Kolahi, S. G. et al. net: Multi-scale adaptive attention-guided network for medical image segmentation. arXiv preprint arXiv:2407.21640 (2024).
- 57.Yu, Z. et al. A novel non-pretrained deep supervision network for polyp segmentation. Pattern Recogn.154, 110554 (2024). [Google Scholar]
- 58.Li, W., Huang, Z., Li, F., Zhao, Y. & Zhang, H. Cifg-net: Cross-level information fusion and guidance network for polyp segmentation. Comput. Biol. Med.169, 107931 (2024). [DOI] [PubMed] [Google Scholar]
- 59.Ruan, J., Li, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analysed during the current study are not publicly available due the data are owned by a third party and authors do not have permission to share the data but are available from the corresponding author on reasonable request.

































































































