Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 28;15:42793. doi: 10.1038/s41598-025-27007-8

A progressive fusion network for endoscopic medical image segmentation

Lihong Fu 1,2, Zhengping Li 1,2,, Chao Xu 1,2, Yixuan Chen 1,3
PMCID: PMC12663145  PMID: 41315489

Abstract

Endoscopic image segmentation plays a key role in assisting doctors to accurately locate focal areas and improve diagnostic efficiency. However, the existing methods are insufficient in utilizing local details and global semantic information at the same time, which makes it difficult to effectively segment organs and tissues with complex morphology, fuzzy boundaries and similar textures. Therefore, we propose a progressive fusion network (PFNet) in this paper. First, PFNet uses Pvtv2 with Transformer as the backbone encoder to capture multi-scale global features. Secondly, a noise filtering attention module (NFAM) is designed to suppress the noise and enhance the semantics of the multilevel features output by the encoder. Then, a boundary and location awareness module (BLAM) is proposed to generate high-quality boundary and position information by blending deep global features with shallow local details. Then, the auxiliary information embedding module (AIEM) is designed to embed the boundary and position information into each level feature dynamically to enhance the context-aware ability of the decoding process. Finally, the feature fusion module (FFM) supplemented the boundary and location information through layer by layer iteration to ensure the collaborative recovery of global semantics and local details. Through extensive experiments, we demonstrate that our proposed PFNet outperforms current state-of-the-art (SOTA) methods in segmentation performance on datasets including Ureter, Re-TMRS, Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS and CVC-300. In particular, the mDice on Re-TMRS dataset reached 91.07%, and the mDice on CVC-ClinicDB reached 93.09%.

Subject terms: Engineering, Mathematics and computing

Introduction

Medical image segmentation (MIS), as a critical technique within medical image analysis, leverages computer vision to precisely delineate specific regions within medical images. This technology is indispensable for clinical diagnostics by enabling accurate identification of anatomical structures including tissues, pathological regions, and organs, thereby facilitating disease assessment and therapeutic strategy development1. As a prevalent category in medical imaging, endoscopic visualization provides direct visualization of pathological alterations in anatomical structures. As a standard diagnostic tool, endoscopic systems have been comprehensively integrated into routine practice within general surgery. Furthermore, automated endoscopic image segmentation facilitates precise intraoperative identification of both surgical instruments and pathological regions, thereby enhancing procedural efficacy critical for optimizing clinical outcomes2.

The U-Net architecture3, leveraging convolutional neural networks (CNN), has emerged as the de facto benchmark framework through its symmetrical encoder-decoder topology. This configuration employs cascaded convolutional layers in both pathways, with lateral skip connections enabling multi-scale semantic fusion across shallow and deep feature hierarchies. The inherent architectural extensibility of U-Net has spawned numerous evolutionary variants including ResUNet4, ResUNet++5, UNet++6, and Attention-UNet7, each progressively optimizing baseline performance metrics.While convolutional operations excel in localized feature extraction through kernel-based spatial filtering, their fundamental limitation resides in constrained receptive fields – a critical impediment to modeling long-range contextual dependencies and integrating holistic image semantics. Although sequential downsampling operations theoretically expand receptive coverage to capture broader spatial patterns, this paradigm introduces dual challenges: progressive erosion of fine-grained anatomical details and escalated computational redundancy through feature map dimensionality reduction. Particularly in endoscopic image segmentation scenarios, where clinical workflows demand high-resolution visualization (typically Inline graphic1280Inline graphic1024 pixels), conventional CNN-based architectures frequently suffer segmentation fidelity degradation when critical morphological features are compromised during aggressive downsampling.

The emergence of Vision Transformer (ViT)8 enables Transformer9 to be well applied in computer vision. Subsequent architectural innovations include Swin-Transformer10, which implements hierarchical feature learning through shifted window partitioning, achieving computational efficiency while preserving multi-scale representational capacity. In medical imaging domains, TransUNet11 strategically concatenates CNN-derived low-level features with Transformer modules to synergize local texture patterns with global semantic context. UCTransNet12 introduces Channel Transformer (CTrans) modules as an advanced alternative to conventional skip connections, effectively bridging semantic disparities between encoder-decoder feature hierarchies. DS-TransUnet13 further extends this paradigm through dual-scale Swin-Transformer encoders that perform multi-resolution feature fusion prior to decoder integration. However, a critical limitation persists in these Transformer-based approaches: the self-attention mechanism, while powerful for capturing long-range dependencies, exhibits an intrinsic deficiency in modeling fine-grained local details and pixel-wise spatial correlations.This limitation is particularly detrimental in medical image segmentation tasks, such as endoscopic analysis, where precise delineation of subtle tissue boundaries, lesion margins, and anatomical structures at the micron level is paramount. The inability to effectively capture these intricate local features frequently manifests as boundary ambiguity and inaccurate localization in segmentation outputs, directly impacting diagnostic reliability and therapeutic planning. Consequently, bridging this gap in local modeling capability while leveraging Transformer’s strength in global context understanding represents a fundamental challenge and a core objective in advancing MIS for endoscopic imagery.

To address this critical challenge of Transformer’s local detail modeling deficiency in endoscopic MIS, we introduce a novel Progressive Fusion Network (PFNet). The central guiding principle of PFNet is the progressive and effective fusion of multi-scale features, specifically designed to overcome the local modeling gap inherent in standard Transformer architectures. Our core hypothesis is that explicitly and gradually integrating high-resolution local features (rich in boundary and positional information) with deep, semantically strong global features derived from the Transformer encoder can synergistically compensate for Transformer’s weakness in fine-grained perception.

This study introduces a novel progressive fusion network (PFNet) designed for medical endoscopic image segmentation. Building upon this core principle of progressive fusion to overcome Transformer’s local modeling limitation, our framework systematically integrates multi-scale contextual features to enhance both structural completeness and boundary precision in clinical imaging analysis. The primary innovations of this research are fourfold:

  1. This work introduces a progressive fusion network (PFNet) for endoscopic medical image segmentation. Driven by the core objective of mitigating Transformer’s local detail deficiency, PFNet leverages a Pvtv214-based Transformer encoder for global feature extraction. Adopting a progressive fusion strategy, the network generates high-quality boundary and position features, which are then gradually embedded into each decoder output layer and subsequent fusion processes. This approach yields endoscopic image segmentation results characterized by accurate positioning and clear boundaries.

  2. A noise filtering attention module (NFAM) is developed to further extract potential local detail features of organs and tissues while enhancing semantic features and filtering background noise. This module provides a cleaner foundation essential for the progressive fusion process.

  3. Instrumental to the progressive fusion strategy are the proposed boundary and position awareness module (BLAM) and auxiliary information embedding module (AIEM). BLAM fuses deep global features with shallow local details to generate high-quality boundary and position features. AIEM then dynamically embeds this boundary and position information into each level of decoder features, enhancing context-aware ability during decoding and progressively refining the feature representation with crucial local cues.

  4. A novel feature fusion module (FFM) is proposed, diverging from conventional approaches by abandoning multi-scale features from each layer. Instead, FFM synchronously fuses high-quality boundary and position features while recovering global features layer by layer from deep representations. This design improves segmentation accuracy, prevents semantic information loss, and embodies the progressive fusion principle by iteratively incorporating enhanced local information during feature reconstruction.

Related works

CNN in medical image segmentation

Convolutional Neural Networks (CNNs) have established themselves as foundational architectures in Medical Image Segmentation (MIS) research1,15,16. The field has witnessed progressive evolution from early Fully Convolutional Networks (FCNs)17 for universal pixel-wise segmentation – achieved through end-to-end convolutional operations replacing dense layers – to the seminal U-Net framework3. This paradigm shift introduced a symmetric encoder-decoder topology augmented with skip connections, enabling multi-scale feature map fusion during upsampling to optimize detail preservation and segmentation fidelity.

Subsequent architectural innovations have focused on mitigating CNN limitations in MIS contexts. UNet++6 addresses encoder-decoder semantic disparity through nested dense skip pathways, while Attention U-Net7 implements dynamic region-of-interest prioritization via integrated attention gates. The residual learning paradigms in Res-UNet4 preserve gradient flow through dedicated skip connections across encoder stages. Multi-task frameworks like SFANet18 employ dual decoders for concurrent region/boundary prediction, whereas SKNets19 enhance scale adaptability through selective kernel feature fusion. To handle anatomical variability, ACSNet20 deploys Adaptive Selection Modules (ASM) for context-aware integration of global semantics and local textures. Advanced approaches include PraNet’s21 two-stage coarse-to-fine segmentation with reverse attention mechanisms, HarDNet-MSEG’s22 computational optimization via pruned dense connections, and DCRNet’s23 cross-image similarity learning. Recent innovations further encompass EU-Net’s24 multi-scale non-local attention blocks with spatial cross-layer filtering, and SANet’s25 color-invariant segmentation through chromatic normalization and probabilistic refinement.

Nevertheless, persistent challenges stem from CNN architectures’ intrinsic constraints in modeling global contextual relationships and long-range spatial dependencies. These limitations frequently manifest as partial anatomical detail degradation and boundary delineation inaccuracies, ultimately compromising segmentation robustness–particularly in clinically critical scenarios demanding micron-level precision.

Vision transformer in medical image segmentation

The Transformer architecture8, initially developed for sequence modeling through self-attention mechanisms, has transcended its natural language processing (NLP) origins to reshape computer vision paradigms. By computing pairwise correlations across all input positions and generating context-aware representations, this framework demonstrates exceptional capacity for capturing global dependencies. Vision Transformer (ViT)9 pioneers this adaptation by processing images as sequences of positional-encoded non-overlapping patches, achieving state-of-the-art classification performance. Building upon this foundation, SETR26 employs pure Transformer encoders for feature extraction coupled with specialized decoders for pixel-level segmentation. Subsequent innovations focus on hybrid architectures that integrate CNN-inspired hierarchical structures with Transformer components10,2732, exemplified by Swin Transformer’s10 linear-complexity shifted window attention and PVTv2’s14 overlapping patch embedding strategy for enhanced feature representation.

In medical image segmentation (MIS), Transformer-based architectures demonstrate growing clinical relevance. TransUNet11 establishes the viability of CNN-Transformer hybrid encoders for global context modeling, while UCTransNet12 replaces conventional skip connections with Channel Transformer (CTrans) modules to bridge semantic gaps. DS-TransUNet13 advances multi-scale feature extraction through dual Swin Transformer encoders, and TransFuse33 introduces cross-modality fusion via parallel CNN-Transformer branches. Clinical specializations emerge through architectures like Polyp-PVT34, which leverages PVTv2 for enhanced polyp segmentation, and PVT-CASCADE35 implementing hierarchical cascaded attention decoding with feature prioritization mechanisms. HSNet36 further refines this paradigm through cross-semantic attention filtering and hybrid semantic complementation for anatomical detail preservation.

This architectural evolution reveals fundamental tradeoffs: While CNNs excel at extracting localized patterns through spatial inductive biases, their limited receptive fields constrain global semantic integration crucial for precise anatomical delineation. Conversely, ViT-derived models8,9 achieve comprehensive contextual modeling via self-attention mechanisms, yet confront dual challenges of local feature discriminability limitations and quadratic computational scaling with input resolution – particularly problematic in high-resolution endoscopic imaging. These complementary strengths and weaknesses underscore the critical need for architectures that synergize CNN’s local processing proficiency with Transformer’s global relational reasoning capabilities in medical imaging applications.

Methods

In this section, we present an overview of the overall architecture and related modules of PFNet. Specifically, we first describe the entire network structure, then provide detailed explanations for each module within it. These include the Pvtv2 backbone network for feature extraction and our proposed four modules: the Noise Filtering Attention Module (NFAM), the Boundary and Location Awareness Module (BLAM), the Auxiliary Information Embedding Module (AIEM), and the Feature Fusion Module (FFM). Finally, we present the composition of our loss function.

Overall architecture

The architecture of our designed PFNet is shown in Fig. 1. We utilize the Pvtv2 encoder to extract multi-level semantic features from the image. The output feature layers are then passed through the Noise Filtering Attention Module (NFAM) to filter out background noise and further refine the features. Subsequently, we construct the Boundary and Location Awareness Module (BLAM) to obtain complete boundary and location information. Through our proposed Auxiliary Information Embedding Module (AIEM), this boundary and location information is integrated into each feature layer. Finally, using our designed Feature Fusion Module (FFM), we fuse the boundary and location information while progressively restoring global context from deep features. This ensures that the boundary and location information are not lost, resulting in target regions with accurate, clear boundaries and correct positional localization.

Fig. 1.

Fig. 1

The overall structure of the PFNet. We adopt Pvtv2 as the decoder of the network. In the decoder part, a progressive fusion network structure is proposed, which first generates high-quality boundary and location features, and then progressively embeds these features into each feature layer of the decoder output and the subsequent fusion process. To this end, we propose four modules: BLAM (Boundary and Location Awareness Module), NFAM (Noise Filtering Attention Module), AIEM (Auxiliary Information Embedding Module), and FFM (Feature Fusion Module).

The processing of underlying noise by NFAM is the foundation of progressive fusion. Injecting boundary and location information at different stages of BLAM and AIEM is a key link in progressive fusion. FFM realizes the gradual fusion of features at different levels.

Specifically, we first process the image to be segmented using a Pvtv2 encoder. Leveraging the Vision Transformer’s strong ability to capture global information, we obtain rich multiscale global information from endoscopic images of organs and tissues. Initially, we perform denoising operations on the four different-scale feature layers Inline graphic output by Pvtv2. Using the Noise Filtering Attention Module (NFAM), we filter out irrelevant background noise, enhance semantic information, and obtain higher-quality multiscale feature layers Inline graphic. Inline graphic represents high-level features that contain abundant global information, such as organ and tissue positional information; while Inline graphic represents low-level features that capture rich detailed information like texture, color, and boundary characteristics. After denoising, we utilize the Boundary and Location Awareness Module (BLAM) to fuse positional information from Inline graphic with boundary information from Inline graphic. This results in a high-quality boundary and position information feature layer Inline graphic. Next, we employ the Auxiliary Information Embedding Module (AIEM) to incorporate this high-quality boundary and position information features layer Inline graphic into the original multiscale features Inline graphic This enhances the semantic information of each feature layer, producing improved features Inline graphic. We then upsample these enhanced features so that each feature layer has the same dimensions as Inline graphic. Finally, we pass each feature layer Inline graphic through three Feature Fusion Modules (FFM). Starting with high-level features, this hierarchical approach progressively restores organ and tissue global information to generate a prediction map. Notably, our FFM module differs from traditional approaches by continuously integrating the high-quality boundary and positional information from Inline graphic during each fusion step. This ensures that semantic integrity of organs and tissues is preserved throughout the endoscopic image processing.

Encoder

Inspired by the outstanding performance of Transformer in image processing tasks, we chose Pvtv2 as our primary feature extraction tool. Pvtv2 significantly reduces the computational resources required for attention mechanisms through its use of a pyramid structure and a technique called linear spatial reduction attention. Additionally, Pvtv2 employs an overlapping block embedding approach, where adjacent image blocks share a 50% overlap. This allows each block to contain more contextual information, thereby enhancing the local continuity of features and improving boundary awareness. In essence, Pvtv2 optimizes its structure and incorporates overlapping strategies to achieve both computational efficiency and improved feature extraction quality.

Noise filtering attention module

While the Transformer’s self-attention mechanism demonstrates superior long-range dependency modeling through its global receptive field, enabling enhanced feature representation and contextual correlation learning, its vanilla implementation encounters critical limitations in medical imaging scenarios. The quadratic computational overhead inherent in standard self-attention operations proves particularly prohibitive when processing high-resolution endoscopic data, while the fixed-scale attention weights fail to effectively capture the heterogeneous anatomical structures characteristic of multi-organ endoscopic environments. Therefore, after learning from the ideas of PSPNet37 and ANNNet38, we optimized the traditional self-attention mechanism and designed our NFAM module, which can better capture the complete semantic information and filter out the background noise. As shown in Fig.2, the input feature layer Inline graphic undergoes multi-scale pyramid pooling operations37. This method not only reduces the computational load but also ensures that features of different sizes are captured by different receptive fields, which is particularly useful for identifying organs or tissues of different sizes in endoscopic images. Specifically, we applied Inline graphic pyramid pooling operations to downsample the feature layer. These processed features were then reshaped and concatenated.In this design::Inline graphic is directly obtained by reshaping the input feature layer Inline graphic. Inline graphic are derived from the reshaped features after the pyramid pooling operation. After computing the attention mechanism using QKV the results are reshaped:

graphic file with name d33e571.gif 1
graphic file with name d33e575.gif 2
graphic file with name d33e579.gif 3

Inline graphicrepresents the linear projection, n denotes the number of heads, and Inline graphic is the dimension of each head, which equals Inline graphic.

Fig. 2.

Fig. 2

Description of the noise filtering attention module (NFAM).

Boundary and location awareness module

Inspired by Amulet39 and Egnet40, good boundary priors and position priors are conducive to effective target segmentation and localization. Although the underlying features contain rich edge detail information, they lack global position information. Therefore, to effectively extract boundary and positional features of organs and tissues, it is essential to fuse them with semantic information from high-level features. As shown in Fig.3, we achieve this by fusing low-level and high-level features to perceive and model the complete boundary and location features of organs and tissues. Specifically, we first reduce the channel dimensions of high-level features T and low-level features L using two Inline graphic convolution layers to 64 (Inline graphic) and 32 (Inline graphic), respectively. This reduces computational load without affecting feature extraction. Next, we upsample Inline graphic to match the size of Inline graphic, then concatenate it with Inline graphic. Finally, we pass the concatenated features through a Inline graphic convolution layer followed by a Inline graphic convolution layer, and then apply a Sigmoid function to produce single-channel edge features Inline graphic.The details of our BLAM layer are as follows:

graphic file with name d33e675.gif 4
graphic file with name d33e679.gif 5
graphic file with name d33e683.gif 6

Fig. 3.

Fig. 3

Description of the boundary and location awareness module (BLAM).

Auxiliary information embedding module

O. Oktay et al proposed Attention U-Net in 20187. This network model can suppress the regions of unsegmented targets, highlight the features of the segmented target regions, and significantly improve the accuracy of medical image segmentation. They propose a core module, Attention Gate, which dynamically adjusts weights to focus on areas of interest through a soft-attention mechanism. However, Attention Gate is computationally intensive in high-resolution endoscopic images and relies on strict data alignment. To this end, our AIEM introduces group convolution to design lightweight attention structure. We use group convolution instead of traditional convolution for feature fusion within groups. The computational complexity of group convolution is significantly lower than that of traditional convolution methods. At the same time, we add a BatchNorm layer and a ReLU layer after convolution of input features, and get the structure of New Attention Gate (NAG). As shown in Fig.4, the high-quality boundary and positional information feature layer T is restored to the same number of channels as the feature layer N (the layer to be embedded with auxiliary information) using a Inline graphic convolution layer. The two feature layers are then input into the modified NAG to embed boundary and positional information into the feature layer N. Finally, we process the outputs of the four feature layers through ECA41, SA42, and GHAP43 to further enhance the fused features. The detailed formulas for our AIEM (Advanced Integration and Enhancement Module) are as follows:

graphic file with name d33e734.gif 7
graphic file with name d33e738.gif 8
graphic file with name d33e742.gif 9
graphic file with name d33e747.gif 10

Fig. 4.

Fig. 4

Description of the auxiliary information embedding module (AIEM).

Feature fusion module

In order to dynamically balance the global semantics and local details in the decoding process and avoid the loss of boundary and position information, we introduced the gateway attention module CA44, the spatial attention module SA42 and the pixel attention module PA45, and designed the FFM. As shown in Fig.5, we restore the high-quality boundary and positional information feature layer Inline graphic to the same number of channels as the input feature layer using a Inline graphic convolution, resulting in Inline graphic. We then concatenate this with the high-level feature layer Inline graphic (to be fused) and the low-level feature layer Inline graphic, forming a concatenated feature layer F. Next, we pass F through both channel attention module CA44and spatial attention module SA42 to obtain Inline graphic and Inline graphic. The channel attention CA selects features representing the global information of organs and tissues, ignoring those representing background or noise. The spatial attention SA identifies the approximate location of organs and tissues in the image, ignoring blank regions or irrelevant organs and tissues.We then add these two output feature layers together and combine them with the original feature layerInline graphic to pass through a pixel attention PA45, which precisely identifies the boundaries of organs and tissues while assigning each pixel inside the organ or tissue a weight representing its importance. This results in a weight layer P. Subsequently, we multiply this weight P with the high-level feature layer Inline graphic (which contains more global information), and multiply Inline graphic with the low-level feature layer Inline graphic (which contains less detailed information). We then sum these two results to produce an output. Finally, we pass this through a Inline graphic convolution layer to generate the final result.The details of our FFM (Feature Fusion Module) are as follows:

graphic file with name d33e856.gif 11
graphic file with name d33e860.gif 12
graphic file with name d33e864.gif 13
graphic file with name d33e868.gif 14

Fig. 5.

Fig. 5

Description of the feature fusion module (FFM).

Theoretical rationale for progressive fusion

The effectiveness of progressive fusion compared to single-stage fusion stems from its ability to systematically reconcile the fundamental disparity between feature hierarchies. Low-level features possess high spatial fidelity crucial for boundary delineation but are inherently noisy and semantically ambiguous. Conversely, high-level features offer robust semantic context but lack fine spatial detail due to downsampling. Directly merging these divergent representations often results in mutual interference: coarse semantics suppress vital local details, while noise contaminates high-level context. Our progressive strategy circumvents this through sequential refinement: the NFAM module first attenuates noise in low-level features. The BLAM module then leverages this cleaner input, fusing it with deep semantics to generate explicit, high-quality boundary and position guidance. Critically, the AIEM module embeds this guidance dynamically and progressively into each corresponding decoder layer during the feature reconstruction process. This allows the network to iteratively assimilate and refine structural cues step-by-step. Finally, the FFM synchronizes the reconstruction of global semantics with the layer-wise infusion of these progressively enhanced local signals. This orchestrated sequence enables controlled transformation and harmonious integration, preserving essential spatial details for precise segmentation—a key advantage over the blunt approach of single-stage fusion, particularly in demanding endoscopic imaging scenarios.

Loss function

Our loss function L is calculated between the final segmentation result Prediction and Ground truth, and can be expressed as:

graphic file with name d33e883.gif 15

Where Inline graphic and Inline graphic are weighted IoU (intersection over union) loss46 and weighted binary cross entropy (BCE) loss46, They restrict the prediction graph in terms of global structure (object level) and local detail (pixel level).

Experiments

Datesets

To rigorously assess the segmentation capabilities of our proposed PFNet framework in endoscopic imaging applications, we developed two original datasets: the Ureter dataset derived from ureteroplasty surgical recordings and the Re-TMRS dataset obtained through renal tumor endoscopic examinations, along with five open-access colonoscopic polyp datasets. Detailed specifications of these biomedical image collections are systematically presented in Table 1. For experimental validation, the Ureter dataset was randomly divided into training (2,292 images) and testing subsets (572 images) for ureteral structure segmentation. Similarly, the Re-TMRS dataset underwent stratified partitioning with 2,258 samples allocated for model training and 565 reserved for renal tumor segmentation evaluation. In alignment with PraNet’s experimental configuration for polyp detection, we employed 900 Kvasir47 and 550 CVC-ClinicDB48 endoscopic images for training cohort establishment. The validation phase utilized residual 100 Kvasir and 62 CVC-ClinicDB specimens to examine the model’s feature learning capacity. Notably, three additional benchmarks - CVC-ColonDB49, ETIS50, and CVC-30051 were strictly excluded from training protocols to objectively evaluate the framework’s cross-domain generalization potential.

Table 1.

Detailed information of the seven datasets used in this work.

Datasets Segmentation target Image Original size Input size Train Validation Test
Ureter Ureter 2864 Inline graphic Inline graphic 2292 N/A 572
Re-TMRS Renal tumor 2823 Inline graphic Inline graphic 2258 N/A 565
Kvasir Polyp 1000 Inline graphic Inline graphic 900 N/A 100
CVC-ClinicDB Polyp 612 Inline graphic Inline graphic 550 N/A 62
CVC-ColonDB Polyp 380 Inline graphic Inline graphic N/A N/A 380
ETIS Polyp 196 Inline graphic Inline graphic N/A N/A 196
CVC-300 Polyp 60 Inline graphic Inline graphic N/A N/A 60

Evaluation metrics

We selected 9 evaluation metrics that are widely used to evaluate the superiority of medical image segmentation performance, respectively, Mean Absolute Error(MAE), S-measure(Inline graphic), mean Dice(mDice), mean Intersection over Union(mIoU), Enhanced-alignment measure(Inline graphic), Weighted F-measure(Inline graphic), Balanced error rate(Ber), Classification accuracy(Acc) and Hausdorff distance(HD).

graphic file with name d33e1231.gif 16
graphic file with name d33e1235.gif 17
graphic file with name d33e1239.gif 18
graphic file with name d33e1243.gif 19
graphic file with name d33e1247.gif 20
graphic file with name d33e1251.gif 21
graphic file with name d33e1255.gif 22
graphic file with name d33e1259.gif 23
graphic file with name d33e1263.gif 24

where N is the total number of pixels, Inline graphic is he predicted value of the bottom i pixels, Inline graphic is the true value of the i th pixels, Inline graphic is based on object-level similarity, Inline graphic is based on regional level similarity, Inline graphic is the weight coefficient (take 0.5), Inline graphic is a pixel-level alignment function,Precision and Recall are associated with four values i.e., true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN): Inline graphic, Inline graphic, Inline graphic and Inline graphic are the weighted Precision and Recall. We set Inline graphic to 1. A and B are two regions and Inline graphic and Inline graphic are their boundary curves. Inline graphic and Inline graphic are the distance functions between the two curves, which is defined as:

graphic file with name d33e1347.gif 25

Implementation details

The PFNet network we propose is implemented based on the Pytorch 2.0.0 and Python3.8.0 frameworks. All models are trained on an NVIDIA A800 GPU with 80G memory. The AdamW optimizer52 is used with a learning rate of Inline graphic and a weight decay of Inline graphic. We adopt a unified training strategy, first adjusting the size of the input images to Inline graphic, and then employing a multi-scale strategy of [0.75, 1.0, 1.25] to enable the network to process organs of different sizes. In addition, the batch size of the network is set to 16, and the maximum number of training epochs is set to 100.

Experimental results and analysis

In this section, we conducted experiments on seven datasets: Ureter, Re-TMRS, Kvasir, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300, to verify the effectiveness of our network and compare it with previous state-of-the-art (SOTA) methods, including U-Net3, PraNet21, TransUNet11, SwinUnet53, TGDAUNet54, DA-TransUNet55, MSA2Net56, NPD-Net57, CIFG-Net58, and VMUNet59. As shown in Tables 2, 3, and 4, the PFNet network outperformed all previous SOTA methods on the kidney and ureter datasets, and also achieved better performance in terms of mDice, mIoU, and HD evaluation metrics on the four publicly available polyp datasets: Kvasir, CVC-ColonDB, CVC-ClinicDB, and CVC-300. Although our network did not surpass previous SOTA methods on the ETIS dataset, the gap is small. Overall, our PFNet network performs better than existing endoscopic image segmentation networks on the 7 endoscopic image datasets. Below, we provide quantitative and qualitative comparisons for each dataset to demonstrate the performance of our method.

Table 2.

Quantitative results and comparison of segmentation on the Re-TMRS dataset.

Dataset Model Year MAEInline graphic Inline graphic Inline graphic Inline graphic mDiceInline graphic mIoUInline graphic BerInline graphic AccInline graphic HDInline graphic
Re-TMRS UNet 2015 0.0156 0.8402 0.8927 0.9553 0.8586 0.7868 0.0806 0.9845 11.3301
PraNet 2020 0.0143 0.8610 0.9158 0.9677 0.8857 0.8186 0.0583 0.9863 11.0000
TransUNet 2021 0.0156 0.8423 0.8884 0.9565 0.8629 0.7818 0.0758 0.9844 11.2312
SwinUNet 2022 0.0204 0.7754 0.8540 0.9434 0.8169 0.7209 0.0828 0.9796 12.3411
TGDADNet 2023 0.0157 0.8444 0.8998 0.9611 0.8717 0.7992 0.0606 0.9843 11.2302
DA-TransUNet 2023 0.0156 0.8430 0.8920 0.9546 0.8665 0.7893 0.0683 0.9844 11.2923
Inline graphicNet 2023 0.0122 0.8764 0.9201 0.9725 0.8996 0.8357 0.0481 0.9880 10.8512
NPD-Net 2024 0.0120 0.8824 0.9221 0.9745 0.9054 0.8431 0.0442 0.9882 10.6721
CIFG-Net 2024 0.0154 0.8405 0.9232 0.9420 0.8792 0.8293 0.0395 0.9871 10.7822
VMUNet 2024 0.0178 0.8263 0.8888 0.9601 0.8620 0.7788 0.0621 0.9823 11.6902
PFNet(ours) 2025 0.0115 0.8881 0.9249 0.9715 0.9107 0.8490 0.0416 0.9886 10.4501

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Table 3.

Quantitative results and comparison of segmentation on the Ureter dataset.

Dataset Model Year MAEInline graphic Inline graphic Inline graphic Inline graphic mDiceInline graphic mIoUInline graphic BerInline graphic AccInline graphic HDInline graphic
Ureter Unet 2015 0.0087 0.8588 0.8926 0.9710 0.8792 0.8029 0.0637 0.9913 9.7301
PraNet 2020 0.0077 0.8746 0.9046 0.9769 0.8964 0.8229 0.0495 0.9923 9.4112
TransUNet 2021 0.0099 0.8403 0.8767 0.9684 0.8615 0.7731 0.0746 0.9901 10.0821
SwinUNet 2022 0.0114 0.8154 0.8584 0.9630 0.8399 0.7402 0.0881 0.9886 10.7012
TGDAUNet 2023 0.0098 0.8368 0.8824 0.9620 0.8635 0.7805 0.0643 0.9902 9.9421
DA-TransUNet 2023 0.0090 0.8512 0.8839 0.9657 0.8719 0.7904 0.0652 0.9910 9.8622
Inline graphicNet 2023 0.0080 0.8749 0.9052 0.9774 0.8953 0.8225 0.0535 0.9921 9.3931
NPD-Net 2024 0.0077 0.8772 0.9080 0.9778 0.9005 0.8291 0.0454 0.9924 9.2500
CIFG-Net 2024 0.0111 0.8133 0.9016 0.9217 0.8575 0.8072 0.0445 0.9912 9.6521
VMUNet 2024 0.0107 0.8310 0.8811 0.9692 0.8614 0.7725 0.0664 0.9894 10.4511
PFNet(ours) 2025 0.0073 0.8859 0.9115 0.9815 0.9069 0.8379 0.0436 0.9928 9.1523

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Table 4.

Quantitative results and comparison of segmentation on the Kvasir dataset.

Dataset Model Year mDiceInline graphic mIoUInline graphic HDInline graphic
Kvasir UNet 2015 0.8111 0.7293 9.0212
PraNet 2020 0.9004 0.8475 6.7811
TransUNet 2021 0.8914 0.8325 6.7421
SwinUNet 2022 0.8691 0.7961 7.4811
TGDAUNet 2023 0.9207 0.8699 6.4921
DA-TransUNet 2023 0.8447 0.7731 7.2023
Inline graphicNet 2023 0.9122 0.8600 6.5500
NPD-Net 2024 0.9064 0.8530 6.5012
CIFG-Net 2024 0.9080 0.8720 6.4721
VMUNet 2024 0.8620 0.7952 7.4511
PFNet(ours) 2025 0.9234 0.8732 6.4021

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Quantitative Analysis. As shown in Table 2, our network’s results are almost superior to previous SOTA methods across the eleven evaluation metrics adopted. Specifically, our network improved the mDice and mIoU metrics by 5.12%, 6.22%, 4.87% and 4.78%, 6.72%, 7.02% compared to the classic U-Net (with traditional convolution as the backbone), TransUNet (with Transformer as the backbone), and VMUNet (with Mamba as the backbone), respectively; and reduced the HD metric by 0.8800, 0.7811, and 1.2401. Furthermore, compared to the current best NPD-Net network, our network also improved the mDice and mIoU metrics by 0.53% and 0.59%, but we achieved at least a 1% improvement compared to other networks in recent years. Thus, we can conclude that the performance of PFNet on the kidney dataset is superior to all current SOTA methods. As shown in Table 3, our network’s performance on the ureter dataset, while maintaining the performance on the kidney dataset, has improved, achieving improvements over the current SOTA methods across the eleven evaluation metrics adopted. Specifically, our Acc accuracy reached 99.28%, and the HD was reduced to 9.1523, with mDice and mIoU improving by 0.64% and 0.88% compared to NPD-Net. Therefore, the performance of our PFNet on the ureter dataset is also superior to all current SOTA methods. The Ureter and Re-TMRS experimental results of mDice are shown in Fig. 6. As shown in Tables 4, 5, 6 and 7, our network’s performance on the Kvasir, CVC-ColonDB, CVC-ClinicDB, and CVC-300 datasets is better than all previous SOTA methods. For the mIoU metric, our network improved by 0.33% and 0.74% on the Kvasir and CVC-ColonDB datasets compared to the second-best network. The mDice on the CVC-ClinicDB dataset reached 93.09%, an increase of 0.47% compared to the second-best NPD-Net. As shown in Table 8, there is a certain gap on the ETIS dataset, but it does not affect the overall performance of our network.

Fig. 6.

Fig. 6

Ureter and Re-TRMS experimental results of mDice.

Table 5.

Quantitative results and comparison of segmentation on the CVC-300 dataset.

Dataset Model Year mDiceInline graphic mIoUInline graphic HDInline graphic
CVC-300 UNet 2015 0.7653 0.6816 6.0421
PraNet 2020 0.8727 0.8039 5.0722
TransUNet 2021 0.8504 0.7601 5.2711
SwinUNet 2022 0.8276 0.7333 5.7721
TGDAUNet 2023 0.8730 0.8084 5.1929
DA-TransUNet 2023 0.8277 0.7619 5.0708
Inline graphicNet 2023 0.8857 0.8182 5.0531
NPD-Net 2024 0.8708 0.8068 5.1112
CIFG-Net 2024 0.8470 0.7895 5.0714
VMUNet 2024 0.8488 0.7706 5.3621
PFNet(ours) 2025 0.8868 0.8196 4.9909

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better. The red represents the best results, blue represents the second best results.

Table 6.

Quantitative results and comparison of segmentation on the CVC-ColonDB dataset.

Dataset Model Year mDiceInline graphic mIoUInline graphic HDInline graphic
CVC-ColonDB UNet 2015 0.6228 0.5265 8.4420
PraNet 2020 0.7180 0.6466 6.8322
TransUNet 2021 0.7535 0.6682 6.8114
SwinUNet 2022 0.6896 0.5943 7.6532
TGDAUNet 2023 0.7743 0.7034 6.3321
DA-TransUNet 2023 0.6818 0.5926 7.2626
Inline graphicNet 2023 0.7486 0.6691 6.8012
NPD-Net 2024 0.8165 0.7353 6.3422
CIFG-Net 2024 0.8047 0.7317 6.3720
VMUNet 2024 0.7093 0.6239 7.3610
PFNet(ours) 2025 0.8169 0.7391 6.2720

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Table 7.

Quantitative results and comparison of segmentation on the CVC-ClinicDB dataset.

Dataset Model Year mDiceInline graphic mIoUInline graphic HDInline graphic
CVC-ClinicDB UNet 2015 0.8640 0.7894 5.1221
PraNet 2020 0.9016 0.8576 4.0721
TransUNet 2021 0.8652 0.8079 4.4523
SwinUNet 2022 0.8416 0.7679 4.7912
TGDAUNet 2023 0.9213 0.8700 4.0332
DA-TransUNet 2023 0.8620 0.7988 4.5431
Inline graphicNet 2023 0.9219 0.8731 4.0731
NPD-Net 2024 0.9262 0.8802 3.9411
CIFG-Net 2024 0.9153 0.8817 3.7814
VMUNet 2024 0.8549 0.7859 4.7823
PFNet(ours) 2025 0.9309 0.8826 3.7617

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Table 8.

Quantitative results and comparison of segmentation on the ETIS dataset.

Dataset Model Year mDiceInline graphic mIoUInline graphic HDInline graphic
ETIS UNet 2015 0.5227 0.4247 11.9123
PraNet 2020 0.6300 0.5757 9.1512
TransUNet 2021 0.6578 0.5760 9.3332
SwinUNet 2022 0.5683 0.4798 10.4542
TGDAUNet 2023 0.7600 0.6789 8.6612
DA-TransUNet 2023 0.6360 0.5579 8.7812
Inline graphicNet 2023 0.6692 0.6006 8.4927
NPD-Net 2024 0.7806 0.7025 8.4613
CIFG-Net 2024 0.7345 0.6806 8.5719
VMUNet 2024 0.5585 0.4851 10.6006
PFNet(ours) 2025 0.7727 0.6938 8.4027

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results, italic represents the second best results.

Qualitative Analysis. To more intuitively observe the segmentation effects of our network on endoscopic images, we visualized the segmentation results of our network and the previous 10 methods, as shown in Fig.7. The first four columns are the visualization segmentation results of the ureter dataset, and the last four columns are the visualization segmentation results of the kidney tumor dataset. From the Fig., we can clearly observe that previous SOTA methods, when segmenting endoscopic images, may have some misjudgments and missed judgments, and the boundaries of the segmented targets are slightly blurred, while our PFNet, in contrast, has more accurate segmentation results, reducing the occurrence of misjudgments and missed judgments. Specifically, in the first column of images, U-Net, PraNet, TGDAUNet, MSA2Net, in addition to segmenting the correct areas, made misjudgments, mistaking internal tissues that do not belong to the ureter for ureter areas. Other networks, such as TransUNet, not only made misjudgments but also missed judgments, failing to segment areas belonging to the ureter. Our PFNet correctly segmented the ureter area, excluding some light, fat, and surgical instruments, without misjudgments, demonstrating stronger robustness. From the sixth and last columns of images, we can clearly see that the previous ten SOTA methods almost cannot accurately and completely segment the target areas of kidney tumors. When the segmented areas are large, it is easy to miss judgments, while our PFNet can still accurately segment the kidney tumors even when the segmented areas are large. Overall, our PFNet is more precise in segmenting tissues and organs of different sizes and shapes (such as ureters, kidney tumors), and the segmentation boundaries are clearer.

Fig. 7.

Fig. 7

Visualization results of the segmentation of the Ureter and Re-TMRS dataset.

Analysis of the model’s parameter quantity, computational complexity and inference time.As shown in Table 9, while our model achieves the best segmentation performance, its parameter count, computational complexity and inference time remain excellent. Although the best results were not achieved in terms of the number of parameters, computational complexity and reasoning time, it has already met the requirements for its application in medical image segmentation.

Table 9.

The results of the parameter quantity, computational complexity and inference time of the model adopted in the experiment.

Model Params(M) FLOPs(G) Time(ms)
UNet 7.81 14.59 16.88
PraNet 32.52 13.18 72.35
TransUNet 105.13 24.57 124.87
SwinUNet 27.15 5.86 42.57
TGDAUNet 62.51 21.13 75.87
DA-TransUNet 112.29 23.12 89.78
Inline graphicNet 28.56 19.92 61.12
NPD-Net 25.57 14.55 72.16
CIFG-Net 24.29 16.79 71.98
VMUNet 44.27 7.56 32.43
PFNet(ours) 25.59 15.32 45.12

Ablation study

In our PFNet network, we offer four new modules (NFAM, BLAM, AIEM and FFM) to improve the segmentation performance of endoscopic images. In order to verify the impact and contribution of each module to the final segmentation performance, we performed an ablative study of the segmentation task of the renal and urinary datasets. Specifically, we directly removed the NFAM, and for the BLAM, AIEM, and FFM, we adopted simple Inline graphic convolutions to change the channel numbers and simple addition operations as replacements. As shown in Table 10, the quantitative results of our ablation experiments are presented, where we denote the above experiments as “w/o NFAM”, “w/o BLAM”, “w/o AIEM”, and “w/o FFM”. From Table 10, we can observe that in the “w/o NFAM” case, the performance metrics for both the kidney tumor dataset and the ureter dataset have decreased. From the visualization results in Fig. 8, we can also see that w/o NFAM resulted in some over-segmentation and unsegmented parts. This leads us to conclude that NFAM effectively filters out irrelevant noise information and extracts the correct semantic information. In the “w/o BLAM” case, the mDice, mIoU, and HD metrics all decreased, and the segmented results showed unclear and inaccurate areas. Therefore, if we simply add high-level features to low-level features and then proceed with subsequent progressive fusion, we cannot achieve good segmentation results. This also demonstrates that our BLAM can effectively fuse high-level feature layers with low-level feature layers to generate high-quality boundary and positional information layers. In the “w/o AIEM” and “w/o FFM” cases, we can clearly observe that the mDice, mIoU, and HD metrics have significantly decreased, especially for the “w/o AIEM” case, where the mDice metric on the kidney tumor dataset decreased by more than 1%. From Fig. 8, it is also very clear that “w/o AIEM” and “w/o FFM” resulted in obvious over-segmentation or under-segmentation, and in the boundary areas, they both showed segmentation that was unclear and discontinuous. Therefore, it can be concluded that our progressive fusion method combined with the designed AIEM and FFM can segment more accurately and with clearer boundaries the organs and tissues in endoscopic images. From the results of research that combines qualitative and quantitative, each element is essential and increases the accuracy and robustness of the model.

Table 10.

Quantitative results of the ablation experiments.

Datasets Metric w/o NFAM w/o AIEM w/o BLAM w/o FFM PFNet(ours)
Re-TMRS mDice Inline graphic 0.9079 0.9009 0.9092 0.9013 0.9107
mIoU Inline graphic 0.8404 0.8350 0.8462 0.8413 0.8490
HD Inline graphic 10.5121 10.6425 10.6022 10.5532 10.4516
Ureter mDice Inline graphic 0.9028 0.9004 0.9045 0.9038 0.9069
mIoU Inline graphic 0.8323 0.8236 0.8341 0.8342 0.8379
HD Inline graphic 9.3232 9.2122 9.1939 9.2223 9.1504

Inline graphic indicates that higher is better, Inline graphic indicates that lower is better.

Bold represents the best results.

Fig. 8.

Fig. 8

Visualization results of the ablation experiment for the Ureter and Re-TMRS dataset.

Attention ablation experiment

To verify the influence of the attention mechanism on the experimental results, we conducted an attention ablation experiment. Among them, “w/o NFAM-att”,“w/o AIEM-att” and“w/o FFM-att” respectively represent each experiment. Taking the ureteral dataset as an example, we calculated the mDice values under each experiment respectively, as well as the overall parameter count and computational complexity. As shown in Table 11, the mDice values of the three experiments, “w/o NFAM-att”,“w/o AIEM-att” and“w/o FFM-att”, all decreased. However, it can also be found that its performance in terms of the number of parameters and computational complexity has not improved significantly. Therefore, it can be concluded that adopting the attention mechanism on each module is a very effective approach.

Table 11.

Results of the attention ablation experiment (Ureter dataset).

Model mDice FLOPs(G) Params(M)
PFNet(ours) 0.9069 15.32 25.59
w/o NFAM-att 0.9035 12.93 23.43
w/o AIEM-att 0.9018 13.45 24.12
w/o FFM-att 0.9049 13.55 24.19

Hyperparameter sensitivity study

We analyzed fusion weights in FFM(Inline graphic,Inline graphic,Inline graphic) and number of attention heads in NFAM(h), and the analysis results are shown in Tables 12 and 13. As can be seen from the results in Table 12, in the FFM module, the best result is achieved by fusing the three input feature layers with the same fusion weight. However, if the fusion weight of any one feature layer is too high, the segmentation performance will decline. As can be seen from the results in Table 13, with the increase of the number of attention heads, the segmentation performance can be improved to a certain extent. However, once the number of attention heads exceeds 4, overfitting begins to occur, resulting in a decline in segmentation performance. So, the number of attention heads we adopted in NFAM is 4.

Table 12.

Experimental results of fusion weights in FFM module.

Datasets Inline graphic Inline graphic Inline graphic mDice
Re-TMRS 1 1 1 0.9107
2 1 1 0.9052
1 2 1 0.9041
1 1 2 0.9019
Ureter 1 1 1 0.9069
2 1 1 0.9005
1 2 1 0.8978
1 1 2 0.8988

Bold represents the best results.

Table 13.

Experimental results of number of attention heads in NFAM model.

Datasets h mDice
Re-TMRS 1 0.9081
2 0.9090
3 0.9091
4 0.9107
5 0.9093
6 0.8989
Ureter 1 0.9028
2 0.9034
3 0.9049
4 0.9069
5 0.9060
6 0.8891

Bold represents the best results.

Conclusion

In this paper, an endoscopic image segmentation network (PFNet) based on progressive feature fusion is proposed, which solves the limitations of existing methods in global semantic modeling and local detail preservation through multi-module collaboration mechanism. The NFAM module enhances feature semantic representation through noise filtering and cross-layer attention mechanism. BLAM module integrates multi-scale features to generate high-precision boundary and position clues. AIEM modules dynamically embed boundary and position information into the decoding process to improve context awareness. FFM module realizes the complementary optimization of global-local information through progressive fusion strategy. Experiments have shown that PFNet significantly outperforms 11 SOTA methods on multiple endoscopic datasets, especially when dealing with small targets, fuzzy boundaries, and light interference. However, PFNet still has limitations in processing extremely low-contrast images, and the inference speed needs to be optimized. Future work will explore lightweight designs to improve real-time performance and extend to 3D medical image segmentation tasks. This study provides an efficient tool for endoscopically assisted diagnosis, and the code and model have been open source to promote the development of the field.

Author contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by L.F. and Y.C The first draft of the manuscript was written by L.F. and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Data availability

The datasets generated and/or analysed during the current study are not publicly available due the data are owned by a third party and authors do not have permission to share the data but are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Azad, R. et al. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. (2024). [DOI] [PubMed]
  • 2.Jia, X., Xing, X., Yuan, Y. & Meng, M.Q.-H. Deep learning-driven models for endoscopic image analysis. Adv. Artif. Intell. Comput. Data Sci. Med. Life Sci.271–300, (2021).
  • 3.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18 (eds Ronneberger, O. et al.) 234–241 (Springer, 2015). [Google Scholar]
  • 4.Xiao, X., Lian, S., Luo, Z. & Li, S. Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th International Conference on Information Technology in Medicine and Education (ITME) (eds Xiao, X. et al.) 327–331 (IEEE, 2018). [Google Scholar]
  • 5.Jha, D. et al. A comprehensive study on colorectal polyp segmentation with resunet++, conditional random field and test-time augmentation. IEEE J. Biomed. Health Inform.25, 2029–2040 (2021). [DOI] [PubMed] [Google Scholar]
  • 6.Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings 4, 3–11 (Springer, 2018). [DOI] [PMC free article] [PubMed]
  • 7.Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
  • 8.Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
  • 9.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 10.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
  • 11.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
  • 12.Wang, H., Cao, P., Wang, J. & Zaiane, O. R. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proc. AAAI Conf. Artif. Intell.36, 2441–2449 (2022). [Google Scholar]
  • 13.Lin, A. et al. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas.71, 1–15 (2022). [Google Scholar]
  • 14.Wang, W. et al. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media8, 415–424 (2022). [Google Scholar]
  • 15.Wang, R. et al. Medical image segmentation using deep learning: A survey. IET Image Proc.16, 1243–1267 (2022). [Google Scholar]
  • 16.Sekou, T. B., Hidane, M., Olivier, J. & Cardot, H. From patch to image segmentation using fully convolutional networks–application to retinal images. arXiv preprint arXiv:1904.03892 (2019).
  • 17.Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proc. IEEE conference on computer vision and pattern recognition, 3431–3440 (2015). [DOI] [PubMed]
  • 18.Fang, Y., Chen, C., Yuan, Y. & Tong, K.-Y. Selective feature aggregation network with area-boundary constraints for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22 (eds Fang, Y. et al.) 302–310 (Springer, 2019). [Google Scholar]
  • 19.Li, X., Wang, W., Hu, X. & Yang, J. Selective kernel networks. In Proc. IEEE/CVF conference on computer vision and pattern recognition, 510–519 (2019).
  • 20.Zhang, R. et al. Adaptive context selection for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23 (eds Zhang, R. et al.) 253–262 (Springer, 2020). [Google Scholar]
  • 21.Fan, D.-P. et al. Pranet: Parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Fan, D.-P. et al.) 263–273 (Springer, 2020). [Google Scholar]
  • 22.Huang, C.-H., Wu, H.-Y. & Lin, Y.-L. Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:2101.07172 (2021).
  • 23.Yin, Z., Liang, K., Ma, Z. & Guo, J. Duplex contextual relation network for polyp segmentation. In 2022 IEEE 19th international symposium on biomedical imaging (ISBI), 1–5 (IEEE, 2022).
  • 24.Patel, K., Bur, A. M. & Wang, G. Enhanced u-net: A feature enhancement network for polyp segmentation. In 2021 18th conference on robots and vision (CRV), 181–188 (IEEE, 2021). [DOI] [PMC free article] [PubMed]
  • 25.Wei, J. et al. Shallow attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part I 24 (eds Wei, J. et al.) 699–708 (Springer, 2021). [Google Scholar]
  • 26.Zheng, S. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6881–6890 (2021).
  • 27.Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proc. IEEE/CVF international conference on computer vision, 568–578 (2021).
  • 28.Pan, Z., Zhuang, B., Liu, J., He, H. & Cai, J. Scalable vision transformers with hierarchical pooling. In: Proc. IEEE/cvf international conference on computer vision, 377–386 (2021).
  • 29.Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proc. IEEE/CVF conference on computer vision and pattern recognition, 12124–12134 (2022).
  • 30.Tu, Z. et al. Maxvit: Multi-axis vision transformer. In European Conference on Computer Vision (ed. Tu, Z.) 459–479 (Springer, 2022). [Google Scholar]
  • 31.Rao, Y. et al. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural. Inf. Process. Syst.35, 10353–10366 (2022). [Google Scholar]
  • 32.Wu, Y.-H., Liu, Y., Zhan, X. & Cheng, M.-M. P2t: Pyramid pooling transformer for scene understanding. IEEE Trans. Pattern Anal. Mach. Intell.45, 12760–12771 (2022). [DOI] [PubMed] [Google Scholar]
  • 33.Zhang, Y., Liu, H. & Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24, 14–24 (Springer, 2021).
  • 34.Dong, B. et al. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932 (2021).
  • 35.Rahman, M. M. & Marculescu, R. Medical image segmentation via cascaded attention decoding. In: Proc. IEEE/CVF winter conference on applications of computer vision, 6222–6231 (2023).
  • 36.Zhang, W. et al. Hsnet: A hybrid semantic network for polyp segmentation. Comput. Biol. Med.150, 106173 (2022). [DOI] [PubMed] [Google Scholar]
  • 37.Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In: Proc. IEEE conference on computer vision and pattern recognition, 2881–2890 (2017).
  • 38.Zhu, Z., Xu, M., Bai, S., Huang, T. & Bai, X. Asymmetric non-local neural networks for semantic segmentation. In: Proc. IEEE/CVF international conference on computer vision, 593–602 (2019).
  • 39.Zhang, P., Wang, D., Lu, H., Wang, H. & Ruan, X. Amulet: Aggregating multi-level convolutional features for salient object detection. In: Proc. IEEE international conference on computer vision, 202–211 (2017).
  • 40.Zhao, J.-X. et al. Egnet: Edge guidance network for salient object detection. In: Proc. IEEE/CVF international conference on computer vision, 8779–8788 (2019).
  • 41.Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proc. IEEE/CVF conference on computer vision and pattern recognition, 11534–11542 (2020).
  • 42.Wang, H., Fan, Y., Wang, Z., Jiao, L. & Schiele, B. Parameter-free spatial attention network for person re-identification. arXiv preprint arXiv:1811.12150 (2018).
  • 43.Ruan, J., Xie, M., Gao, J., Liu, T. & Fu, Y. Ege-unet: an efficient group enhanced unet for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Ruan, J. et al.) 481–490 (Springer, 2023). [Google Scholar]
  • 44.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In: Proc. European conference on computer vision (ECCV), 3–19 (2018).
  • 45.Qin, X., Wang, Z., Bai, Y., Xie, X. & Jia, H. Ffa-net: Feature fusion attention network for single image dehazing. Proc.AAAI Conf. Artif. Intell.34, 11908–11915 (2020). [Google Scholar]
  • 46.Wei, J., Wang, S. & Huang, Q. net: Fusion, feedback and focus for salient object detection. Proc. AAAI Conf. Artif. Intell.34, 12321–12328 (2020). [Google Scholar]
  • 47.Jha, D. et al. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26 (ed. Jha, D.) 451–462 (Springer, 2020). [Google Scholar]
  • 48.Bernal, J. et al. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph.43, 99–111 (2015). [DOI] [PubMed] [Google Scholar]
  • 49.Tajbakhsh, N., Gurudu, S. R. & Liang, J. Automated polyp detection in colonoscopy videos using shape and context information. IEEE Trans. Med. Imaging35, 630–644 (2015). [DOI] [PubMed] [Google Scholar]
  • 50.Silva, J., Histace, A., Romain, O., Dray, X. & Granado, B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg.9, 283–293 (2014). [DOI] [PubMed] [Google Scholar]
  • 51.Vzquez, D., et al. A benchmark for endoluminal scene segmentation of colonoscopy images. J. Healthc. Eng.2017, 4037190 (2017). [DOI] [PMC free article] [PubMed]
  • 52.Loshchilov, I. & Hutter, F. International conference on learning representations (2019).
  • 53.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision (ed. Cao, H.) 205–218 (Springer, 2022). [Google Scholar]
  • 54.Song, P., Li, J., Fan, H. & Fan, L. Tgdaunet: Transformer and gcnn based dual-branch attention unet for medical image segmentation. Comput. Biol. Med.167, 107583 (2023). [DOI] [PubMed] [Google Scholar]
  • 55.Sun, G. et al. Da-transunet: Integrating spatial and channel dual attention with transformer u-net for medical image segmentation. Front. Bioeng. Biotechnol.12, 1398237 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Kolahi, S. G. et al. net: Multi-scale adaptive attention-guided network for medical image segmentation. arXiv preprint arXiv:2407.21640 (2024).
  • 57.Yu, Z. et al. A novel non-pretrained deep supervision network for polyp segmentation. Pattern Recogn.154, 110554 (2024). [Google Scholar]
  • 58.Li, W., Huang, Z., Li, F., Zhao, Y. & Zhang, H. Cifg-net: Cross-level information fusion and guidance network for polyp segmentation. Comput. Biol. Med.169, 107931 (2024). [DOI] [PubMed] [Google Scholar]
  • 59.Ruan, J., Li, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analysed during the current study are not publicly available due the data are owned by a third party and authors do not have permission to share the data but are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES