Abstract.
Purpose
In recent years, the continuous advancement of convolutional neural networks (CNNs) has led to the widespread integration of deep neural networks as a mainstream approach in clinical diagnostic support. Particularly, the utilization of CNN-based medical image segmentation has delivered favorable outcomes for aiding clinical diagnosis. Within this realm, network architectures based on the U-shaped structure and incorporating skip connections, along with their diverse derivatives, have gained extensive utilization across various medical image segmentation tasks. Nonetheless, two primary challenges persist. First, certain organs or tissues present considerable complexity, substantial morphological variations, and size discrepancies, posing significant challenges for achieving highly accurate segmentation. Second, the predominant focus of current deep neural networks on single-resolution feature extraction limits the effective extraction of feature information from complex medical images, thereby contributing to information loss via continuous pooling operations and contextual information interaction constraints within the U-shaped structure.
Approach
We proposed a five-layer pyramid segmentation network (PS5-Net), a multiscale segmentation network with diverse resolutions that is founded on the U-Net architecture. Initially, this network effectively leverages the distinct features of images at varying resolutions across different dimensions, departing from prior single-resolution feature extraction methods to adapt to intricate and variable segmentation scenarios. Subsequently, to comprehensively integrate feature information from diverse resolutions, a kernel selection module is proposed to assign weights to features across different dimensions, enhancing the fusion of feature information from various resolutions. Within the feature extraction network denoted as PS-UNet, we preserve the classical structure of the traditional U-Net while enhancing it through the incorporation of dilated convolutions.
Results
PS5-Net attains a Dice score of 0.9613 for liver segmentation on the CHLISC dataset and 0.8587 on the ISIC2018 dataset for skin lesion segmentation. Comparative analysis with diverse medical image segmentation methodologies in recent years reveals that PS5-Net has achieved the highest scores and substantial advancements.
Conclusions
PS5-Net effectively harnesses the rich semantic information available at different resolutions, facilitating a comprehensive and nuanced understanding of the input medical images. By capitalizing on global contextual connections, the network adeptly captures the intricate interplay of features and dependencies across the entire image, resulting in more accurate and robust segmentation outcomes. The experimental validation of PS5-Net underscores its superior performance in medical image segmentation tasks, offering promising prospects for enhancing diagnostic and analytical processes within clinical settings. These results highlight the potential of PS5-Net to significantly contribute to the advancement of medical imaging technologies and ultimately improve patient care through more precise and reliable image analysis.
Keywords: multiscale feature fusion, semantic segmentation, artificial intelligence, convolutional neural network, deep learning
1. Introduction
The primary objective of medical image segmentation is to enhance the visualization of organs, tissues, or pathological structures within the images. The continuous advancement of deep neural networks has led to an increasingly significant role for medical image segmentation in computer-aided diagnosis and intelligent medical treatment. This advancement has resulted in improved efficiency and accuracy in disease or pathological examination by aiding clinicians in making accurate diagnoses through the segmentation of key targets in medical images and subsequent feature extraction.
Early techniques for medical image segmentation relied on methods such as edge detection, template matching, statistical shape models, active contours, and traditional machine learning.1,2 However, with the continuous evolution of deep neural networks, convolutional neural network (CNN)-based medical image segmentation methods have gained popularity and demonstrated promising results across various fields. Nonetheless, inherent challenges exist in medical images, including blurring, noise, low contrast compared with traditional RGB images, and complex feature representation. For instance, liver organ segmentation encounters difficulties due to variations in shapes and sizes at different body positions, as well as challenges in accurately extracting boundary information. Consequently, liver organ image segmentation research has emerged as a focal point within the medical image segmentation domain. Similarly, skin disease segmentation presents challenges due to the irregular nature of lesion areas, irrespective of whether a specific type of disease is segmented. These two forms of segmentation are of great clinical importance, particularly in the context of using computed tomography (CT) images of the liver area for diagnosing liver diseases and employing skin disease segmentation to aid in determining diseases and lesion areas in clinical medicine.
However, the majority of experimental images utilized in clinical settings are predominantly manually segmented, placing significant demands on doctors’ segmentation expertise and skills. Manual segmentation is also susceptible to factors such as high labor intensity, potentially compromising absolute efficiency and accuracy. Consequently, research on CT image segmentation of liver and skin lesions holds substantial academic and clinical significance.
This paper presents the five-layer pyramid segmentation network (PS5-Net), a medical image segmentation network designed to adapt to complex scenarios. The network leverages the multidimensional feature information presented in input feature maps at various resolutions, and integrates this information for optimized selection to achieve adaptability. Specifically, the original input image serves as the input to PS5-Net, which is then downsampled into five different resolution feature maps that are processed in parallel within the internal feature extraction network. While maintaining the foundational framework of the traditional U-shaped network, substantial experimentation has demonstrated the superior structural performance of this network design. However, to further enhance the network’s inductive bias capability and address the traditional U-network’s limited processing of local contextual features, a contextual information interaction module named no sigmoid atrous fusion (NSAF) with dilation convolution is incorporated at the original jump connection location to enable effective multiscale feature representation. To counteract information loss resulting from continuous pooling operations, the pooling fusion block (PFB) network block is integrated into the internal feature extraction network to facilitate a reasonable nonlinear downsampling process aimed at retaining maximal information, and denoted as PS-UNet. Subsequently, the five different resolutions of the original feature maps are simultaneously processed through PS-UNet to capture information at five distinct scales, representing the core focus of this study. In pursuit of this objective, an attention kernel selecting module is devised to linearly fuse feature maps in each layer of the pyramid using convolutional kernels of varying sizes, thereby enabling different feature maps to be represented by convolutional kernels of differing dimensions. In addition, feature weight assignment is modulated by introducing an attention mechanism to effectively extract global information while preserving local details with greater efficiency.
Quantitative experiments conducted on the CHLISC public medical image dataset and the ISIC2018 dataset demonstrate that PS5-Net yields substantial performance enhancements. Its efficacy has been verified in the segmentation of liver organs and skin lesion areas in medical images, bearing clinical significance. The primary contributions of this study are enumerated as follows.
-
1.
In this paper, we propose the PS5-Net medical image segmentation model, featuring a five-layer pyramidal segmentation architecture in which each layer of the pyramid aligns with a distinct resolution scale. Through the fusion of resolution features across the five scales, effective interplay between global and local information is realized.
-
2.
An attention kernel selection module capable of adaptively fusing features across various scales is proposed. Through the incorporation of the attention mechanism, a linear combination approach is employed to merge multiple pyramid branches utilizing convolutional kernels of differing sizes, thereby enabling feature maps at different scales to be represented by distinct kernels.
-
3.
Within the feature extraction network, the foundational architecture of the U-shaped network serves as the fundamental framework, while integrating the contextual interaction module NSAF with dilation convolution to attain efficient multiscale information representation. This approach upholds the facile deployment attributes and strong generalization capabilities inherent in the U-shaped network.
-
4.
This paper validates the performance of PS5-Net using the CHLISC dataset and ISIC2018 dataset. The experimental results demonstrate substantial enhancements in precise medical image segmentation achieved by PS5-Net, yielding superior evaluation scores compared with other established U-Net models and their related variants. Furthermore, it contributes valuable insights for the advancement of medical image segmentation methodologies.
2. Related Work
Currently, within the domain of medical image segmentation, Shelhamer et al.3 initially introduced the fullly convolutional network to address pixel-level segmentation tasks. Subsequently, the segmentation model represented by U-shaped network4 has become the mainstream segmentation model in the field of medical image segmentation, and various improved U-shaped networks have been derived based on the U-shaped network segmentation model, such as R2U-Net,5 Nested-UNet,6 Attention U-Net,7 Focus Netv2,8 and Denseres-UNet.9 Various attempts have been made subsequently, but none of them can be separated from the classical U-shaped segmentation network. The main innovation of the U-shaped segmentation network is the proposal of a segmentation structure that includes an encoder and a decoder, both equipped with downsampling and upsampling functions respectively. These functions are symmetrical to each other, creating what is commonly known as a U-shaped structure. The entire network uses convolution to extract local features, the encoder uses cascaded convolutional layers to extract the high-level semantic representation, and the decoder reuses the encoder’s high-resolution feature maps to recover the spatial information lost in the high-level representation using jump connections. This jump connection recovers some of the spatial information lost in the encoder, similar to Deep Residual Net proposed by He et al.10 in 2016. Based on the U-Net architecture, U-Net++11 and U-Net3+12 introduce a dense connection mechanism that allows for flexible encoder and decoder depths. This approach reduces semantic depth while enhancing segmentation accuracy within the U-Net network. Since the empirical acceptance domain of CNN is much smaller than the theoretical one, especially at the high level, it means that the global information is not fully integrated when extracting features. Although this training method reduces the hyperparameters, it takes up more memory as well as causes partial redundancy in the training dataset. In addressing the aforementioned challenges, TCC-MSFCN13 leverages a multiscale architecture to extract diverse perceptual field features tailored to objects of varying resolutions. To mitigate the loss of global information resulting from procedures such as pooling, specialized branches with high resolution are employed for compensating edge pixels. TCC-MSFCN demonstrates exceptional performance on both the Warwick-quantitative understanding (QU) dataset and the cerebral angiography (CRAG) dataset. Research analogous to this approach14–17 has demonstrated noteworthy efficacy on formidable datasets. Nevertheless, traditional U-Net networks may demonstrate superior performance in relatively straightforward segmentation tasks; however, when dealing with complex organs or tissues, these networks are prone to losing crucial global information, particularly as a result of downsampling. This loss can significantly impede segmentation accuracy, highlighting the challenge faced by CNNs that solely consider single-channel features or treat all channel-wise features equally in achieving optimal results.
In recent years, the remarkable advancement of hardware technology and deep neural networks has substantially enhanced the formidable contextual representation capabilities of deep learning. Consequently, the research focus has shifted from one-dimensional considerations to multiscale feature extraction. In 2019, Mou et al.18 proposed CS-Net and used it for retinal lesion segmentation, which combines spatial attention and channel attention to further adaptively combine. In 2020, Fang and Yan19 proposed the pyramid input pyramid output feature abstraction network (PIPO-FAN) model for multi-organ segmentation, which integrates multiscale strategies and pyramidal feature analysis into a U-shaped structure to enhance the model performance. In the domain of medical imaging, characterized by significant variation in organ shape and size, the precise segmentation of small targets presents a challenge. To address this, Lyu et al.20 employed a multiscale contextual linking module to extract feature information while preserving resolution, thus facilitating accurate segmentation of small targets. A multiresolution fusion model was introduced to refine the final result, incorporating a refinement process implemented by a decoder to merge and predict coarse, deep features alongside fine, and shallow features. Ultimately, a weighted fusion layer was employed to automatically amalgamate outputs from multiple scales.
In the context of enhancing segmentation precision, multiresolution fusion represents an effective approach. For instance, Zhuang21 maximized the information gleaned from feature maps at various resolutions by employing a sequentially cascaded U-Net to iteratively present feature maps at different resolutions, resulting in improved segmentation accuracy. In Ref. 22, a more sophisticated utilization of cross-connections is employed to predict three distinct pathways, effectively integrating information across different stages, with empirical experiments robustly demonstrating the efficacy of the model.
The key to multiscale resolution fusion lies in the adoption of an efficient fusion method. To tackle the challenge of effectively fusing multiscale resolution features, Qin et al.23 proposed a multiscale feature fusion module with multiple information interaction paths based on residual learning. This approach enables the adaptive fusion of image features at different scales, addressing the issue commonly encountered in networks that struggle with utilizing low-resolution image features. In Refs. 24 and 25, after connecting to multiscale resolution images, the image information is fused by a fusion module with an attention mechanism. The experiments all demonstrate the importance of a multiscale resolution feature fusion module with an attention mechanism for image feature fusion. In 2019, Zhao et al.26 proposed a kernel selection module in the field of underlying computer vision that uses different size convolutional kernels for each resolution feature map for linear fusion, allowing different feature maps to be represented by kernels of different sizes. Follow-up experiments show that the model exhibits efficient performance in the underlying vision network.
The aforementioned studies have made substantial contributions to the field of medical image segmentation, offering numerous valuable methods and concepts. In the Sec. 3, we intend to integrate established techniques and methodologies to enhance the segmentation performance of the model, substantiating the efficacy of the proposed method through theoretical elaboration and empirical validation.
3. Methods
3.1. Network Architecture
Figure 1 illustrates the complete network architecture of the proposed PS5-Net, which comprises a multiscale feature extraction stage and a feature fusion stage. Within the multiscale feature extraction stage, we devised a more efficient feature extraction network by leveraging the U-Net as the benchmark network, thereby inheriting its advantages. This modified version is denoted as PS-UNet and corresponds to the yellow cube in Fig. 1. Subsequently, in the second stage, we designed a kernel selection module featuring an attention mechanism, represented by the blue cube in Fig. 1. Further elaboration on the entire network is provided in theSec. 3.2–3.4.
Fig. 1.
The overall structure of our proposed PS5-Net, with yellow circles() representing the feature maps at different scales and the symbol || indicating the concat operation, the channel dimension of is 18.
3.2. Multiscale Feature Extraction Stage
The concept of a pyramid has found extensive application in the realm of computer vision, particularly in tasks related to image super-resolution,27 scene resolution,28 and similar areas. To the best of our knowledge, the utilization of the pyramid concept in the domain of medical image segmentation has been relatively limited. In addition, in medical image segmentation tasks, the effective receptive field of CNN is notably smaller than its theoretical counterpart, resulting in reduced acceptance of higher-level information. This limitation hinders the extraction of comprehensive global information from the image.
For the above problem, a five-level pyramid structure is defined to downsample the input original feature map to obtain the receptive field in different dimensions, and the semantic information carried by different resolutions can contain both local and global features. We set such five downsampling pooling kernels as , , , , , respectively. The downsampled results correspond to , , , , and in Fig. 1, respectively. Then these five feature maps with different resolution sizes are fed uniformly into PS-UNet for feature extraction, which is a feature extraction network based on the improved U-Net network, and the specific details about PS-UNet are described in detail below. Please note that the five PS-UNets here do not share weights. Following feature extraction by the PS-UNet, the resulting feature maps undergo individual upsampling processes to achieve uniform resolution, after which they are integrated. Notably, we employ the original feature maps as a form of residual learning in conjunction with the outputs from the initial stage. Subsequent focus will be directed to the feature extraction network PS-UNet in Sec. 3.3.
3.3. PS-UNet
Figure 2 shows the complete structure of PS-UNet, where the encoder–decoder structure of U-Net is retained for more efficient extraction of contextual information. The encoder is used to encode the features of the original image to obtain multiscale contextual features. Each unit in the encoder block consists of two convolutions, each followed by a corrected linear unit (ReLU29), followed by a downsampling operation. For the coding stage of the whole network, successive downsampling operations will cause information loss. For this problem, Refs. 30 and 31 use different pooling functions to reduce information loss, drawing on the nonlinear fusion pooling module (PFB module) proposed in Ref. 30, which is used instead of passing the original pooling function. The NSAF module is proposed to optimize the original skip connection for a more efficient multiscale feature representation. The PFB module and the NSAF module are described in detail in Sects 3.3.1 and 3.3.2.
Fig. 2.
The network structure of PS-UNet.
3.3.1. Pooling fusion block
Pooling operations are integral to the downsampling of features in CNNs. Through the integration of convolution and pooling operations, more abstract contextual features can be derived. Nevertheless, with an increase in the number of network layers, there is a risk of losing minor feature information or finer details, consequently impacting the accuracy of liver image segmentation.
To mitigate the information loss stemming from the successive pooling operations of the PS-UNet, we opt to utilize the PFB module instead of conventional downsampling. This choice facilitates nonlinear fusion pooling operations aimed at minimizing the loss of information during the downsampling process. The structure of the PFB module is shown in Fig. 3. The input feature maps will go through the maximum pooling and average pooling operations, respectively, this operation makes the feature maps complete downsampling. Then, the downsampled feature maps are connected together for fusion to detect more contextual features. Finally, the number of channels of the feature maps is adjusted by the 1*1 convolutional layer.
Fig. 3.
The structure of PFB.
3.3.2. No sigmoid atrous fusion module
To mitigate the semantic gap issue, the original U-Net incorporates skip connections to integrate image features across different scales. However, this approach inevitably gives rise to inadequate processing of local contextual features, and limitations in multiscale feature representation due to the U-Net’s restricted receptive field. Consequently, the NSAF module is devised to enhance the original skip connection for more effective multiscale feature representation (refer to Fig. 4).
Fig. 4.
The structure of NSAF.
For multiscale feature representation, the conventional method involved utilizing convolutional kernels of varying sizes to capture different receptive fields and amalgamate them for multiscale feature representation.32 However, large convolutional kernels entail increased hardware demands. To conserve computational resources and enhance the model’s computational efficiency, we enable the network to attain diverse receptive fields using dilated convolutional layers with varied dilation rates.33 This approach ensures that multiscale features are adequately expressed.
Specifically, in this module, the input derives from two sources: the high- and low-level image features within the PS-UNet network (refer to Fig. 2). The deeper features are upsampled using interpolation to ensure uniform feature map sizes. This is followed by the integration of the high- and low-level image features, and their subsequent convolution with dilation rates of 1, 2, and 4, respectively, to acquire varying receptive field sizes. Subsequently, the resulting three feature maps are concatenated to achieve multiscale feature representation.
Subsequently, to acquire discriminative features, conventional approaches often incorporate some form of attention module. Among these, the squeeze-and-excitation (SE) block34 stands out as the most widespread attention block and has demonstrated impressive performance across various deep convolutional neural network (DCNNs).35 Notably, it is lightweight and can be seamlessly integrated into diverse network models as an additional module. Nonetheless, the SE block exclusively considers weight information on a single scale and does not allocate weights to features across multiple scales. Drawing inspiration from Ref. 36, we introduce a selective attention module capable of assigning weights to corresponding scales for convolutional layers with distinct expansion rates. This enables more efficient utilization of the benefits offered by dilation convolution.
Specifically, the feature map is passed through the dilation convolution module with dilation rates of 1, 2, and 4 to obtain , , and , respectively. Then, we connect the information of the three branches by element-wise summation
| (1) |
Then, sequentially passes through a global average pooling37 and two fully connected layers, which are manipulated in the same way as the traditional channel attention, with the difference that no more sigmoid functions are used as outputs. , , and , respectively, are the outputs of the second layer, and they are manipulated by the softmax function, similar to a kind of door-controlled switch
| (2) |
where , , and . And , , correspond to the soft attention of , , and , respectively. Note that is the ’th element of , likewise and . The final output feature mapping is computed by combining the different dilation rates and their weights of attention as follows:
| (3) |
where , , and need to satisfy and , . Finally, a convolution layer is used to compress the dimensionality for feature fusion.
3.4. Attention Kernel Selecting Module
The above content provides a detailed description of the pairwise scale feature extraction stage, and the next content will describe the multiscale feature fusion part in detail. The five layers of original feature maps are connected together after feature extraction at different scales. Referring to Refs. 10 and 23 for using connections with residuals to reduce information loss in the feature extraction process, we connect the five layers of different scales with the original feature maps and merge them together as the input of the next feature fusion part. To elaborate on this, it is actually a concatenation operation, corresponding to in Fig. 1. Please note that the purpose of PS-UNet in the middle of the process is not to obtain an output image such as the traditional U-Net structure. Instead, it takes advantage of the U-Net network to extract feature maps. Therefore, the final output channel number of each PS-UNet is 3. As a result, the channel number of is 18.
Then we feed the combined feature maps into the attention kernel selection module, which corresponds to the blue cube in Fig. 1. The attention kernel selection module is the core module of the multiscale feature fusion stage, and Ref. 38 demonstrates through extensive experiments that channel attention can more fully enhance useful features and weaken useless features, thus providing a better feature screening and enhancement effect. Then, combining the inspiration from Refs. 26, 39, and 40, to select kernels of different sizes for each channel in the connected multiscale results, we devise a parallel attention kernel selection module with channel attention, which is illustrated in the structure depicted in Fig. 5.
Fig. 5.
Structure of the attention kernel selection module. GAP denotes global average pooling, and FC denotes the fully connected layer, followed by the use of the ReLU activation function.
Specifically, as shown in Fig. 5, the input feature map . For the channel attention part,41–43 the incoming feature map is convolved three times in parallel with kernel sizes of 3, 5, and 7 to obtain , and . Next, the information from all branches is integrated by element-level summation, and for the channel attention part, the next operations are the same as those in Sec. 3.3.2, referring to Eqs. (1)–(3).
4. Experiment Data and Preprocessing
4.1. Data Preparation and Experimental Environment
4.1.1. CHAOS
The combined (CT-MR) healthy abdominal organ segmentation challenge (CHAOS) dataset was released in 2017 by the Medical Image Computing and Computer-Assisted Intervention Society. It contains 1862 CT and magnetic resonance imaging (MRI) liver slice images and corresponding segmentation labels. The dataset aims to develop and evaluate liver segmentation models.
The dataset uses the medical standard digital imaging and communications in medicine (DICOM) format with an image resolution of to capture the delicate structure of the human liver. For the convenience of model training, we cropped the images to . The labels are in binary PNG format, containing only the liver (value 1) and background (value 0) regions.
To achieve a 7:2:1 split of the training, validation, and test sets, we randomly selected 1862 images for training, 514 images for validation, and 257 images for testing from the total 2569 images. This split prevents data leakage and evaluates the model’s generalization ability.
4.1.2. ISIC2018 dataset
The ISIC2018 dataset corresponds to task 1 of the ISIC2018 challenge. The original task aims to automatically binarize skin lesion segmentation based on dermoscopic images. The ISIC2018 dataset contains 2594 cases. Similarly, we resize the images to a resolution of . Finally, we evaluate the model effect using the open test dataset in the original task, which contains 1000 cases.
In terms of the experimental environment, the computer has a Intel(R) Xeon(R) Platinum 8260C CPU @ 2.30 GHz, a NVIDIA GeForce RTX 3090 with a total of 24 GB of graphics memory, 30.1 GB of RAM, and an operating system of Ubuntu18.04.
The software configuration is Anaconda, PyCharm, Python 3.9, and the deep learning framework used is PyTorch 1.9.0. Adam algorithm is used as the optimizer to optimize the loss function, and the initial learning rate of the Adam algorithm is set to 0.001 in the experiment.
4.2. Model Training and Accuracy Assessment
We chose to use a combination of BCEWithLogitsLoss and DiceLoss44,45 as the loss function of the neural network. Reference 44 experimentally verified that the combination of multiple loss functions is positive for network performance and medical segmentation. DiceLoss is more applicable to various body liver data with different size segmentation regions while preventing the occurrence of an overfitting phenomenon by increasing the optimization of the actual segmentation region. According to the Dice coefficient formula, it can be observed that when the values of |GT| and |PR| are excessively large or small, it can significantly impact the results of the loss function, resulting in unstable training. To make the training process more stable, considering the respective advantages and disadvantages of BCEWithLogitsLoss and DiceLoss, we choose to define a variable to control the effect of both loss functions on the network when training. The formula of the loss function with variable is
| (4) |
where the variable takes the value in the interval [0,1].
The experimental accuracy is evaluated using the evaluation metrics commonly used in semantic segmentation: the Dice, precision, recall, and Hausdorff distance (HD).
-
1.
Dice coefficient
Dice coefficient is a common evaluation index of segmentation models, and its value ranges from [0,1], the closer the result is to 1, the better the segmentation effect of the model is. The Dice coefficient combines accuracy and recall rates to evaluate the segmentation performance. It accounts for variations in the size of segmented areas within a dataset by considering both the predicted and actual regions of interest, such as the liver area in medical imaging. By doing so, the Dice coefficient provides a comprehensive assessment of the segmentation effectiveness of the model. This can be seen in the Eq. (5) used to calculate it.
| (5) |
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.
-
2.
Precision
The precision rate is the proportion of true cases (TP) among all positive cases (TP + FP) judged by the model. A high precision rate indicates that most of the results detected by the model are indeed positive. In liver segmentation, the precision rate represents the proportion of the true liver region in its segmented region, as shown in Eq. (6)
| (6) |
-
3.
Recall
Recall is the proportion of positive predictions to all positive predictions. Its meaning is the probability that a positive sample is predicted among the actual positive samples. A high recall means that there may be more false positives, but it will try to find every object that should be found, as shown in Eq. (7)
| (7) |
-
4)
Hausdorff distance
HD is a calculation used to evaluate the quality of image segmentation and model performance when coming to evaluate various types of surface distances. It is a measure to describe the degree of similarity between two sets of points, and it is a defined form of distance between two sets of points: Suppose there are two sets of sets , , then the HD between these two sets of points is defined as
| (8) |
where , , .
-
5.For the ISIC2018 dataset, we introduced two more targeted evaluation metrics: intersection over union (IOU) and relative volume difference (RVD). RVD is used to measure the error rate. IOU is used to evaluate the similarity between the prediction and the ground truth
(9)
where and represent the target region in the predicted segmentation map and the real label, respectively.(10)
4.3. Ablation Study
To evaluate the influence of each module proposed in this paper on actual segmentation performance, an ablation study is conducted to test each module. Specifically, the fundamental four-layer classical U-Net model serves as the base network. By integrating various network blocks, such as NSAF blocks and PFB blocks, different configurations of the PS-UNet are constructed for the ablation analysis. Table 1 shows the ablation experiment results on the CHAOS dataset.
Table 1.
Experiment results of the ablation study on CHAOS dataset (Base U-Net).
| NSAF block | PFB block | Dice | Precision | Recall | Hausdorff distance |
|---|---|---|---|---|---|
| × | × | 0.9269 | 0.9523 | 0.9029 | 64.88 |
| ✓ | × | 0.9340 | 0.9621 | 0.9076 | 55.67 |
| × | ✓ | 0.9272 | 0.9534 | 0.9025 | 57.74 |
| ✓ | ✓ | 0.9399 | 0.9676 | 0.9138 | 39.96 |
Note: bold values indicate the optimal values.
The overall intuitive result is that incorporating both modules leads to improvements in both the training and test sets. Specifically, the Dice coefficient is improved by 0.0071 after just adding the NSAF block, which is acceptable for such a relatively small improvement with only the addition of the NSAF block. Second, we added the NSAF block again with the PFB reference. We can see that the Dice coefficient of the model is improved by 0.013 and the HD coefficient is improved by 24.92. Therefore, after using the PFB block, the model can give more feature information, which allows the NSAF block to better incorporate the contextual information. Combined with the above ablation experimental results, it can be well demonstrated that the incorporated modular blocks are effective in improving the performance of the U-Net network.
The visual examination of Table 1 reveals the enhancement of individual U-Net segmentation performance by both the NSAF module and PFB module, albeit without substantial improvements in terms of Dice and precision metrics. First, it is apparent that the sole addition of the PFB module does not produce a significant improvement in the model’s performance. This observation suggests that both average pooling and max pooling represent lossy downsampling methods, with an inherent unavoidable loss. Although our intention was to mitigate information loss during the downsampling process by leveraging the advantages of both pooling methods through the PFB module, specific experiments indicate that this enhancement is not highly effective. Second, when solely incorporating the NSAF module with dilated convolutions, the fusion of diverse receptive fields enhances the model’s feature extraction capacity, consequently improving the model’s objective accuracy. Experimental data indicate a Dice increase of over one percent, along with notable enhancements in precision and Hausdorff (HF) metrics. This indicates that modifying the model’s receptive field and engaging with the U-Net context significantly improves the model. In consideration of the findings in Table 1, a more comprehensive analysis and discussion will be presented in Sec. 4.4.
Subsequently, feature maps of different resolutions carry different feature information for the same size of convolution kernel. Therefore, we consider whether there will be a pattern similar to the U-Net++ idea, that is, whether the final effect of the model will be improved if only three or two different resolution feature maps are considered. To this end, we conducted an ablation analysis of the number of different resolution feature maps entering PS-UNet, and the results from ISIC2018 dataset are shown in Table 2.
Table 2.
The experimental results of the quantity of feature maps with different resolutions on ISIC2018 dataset.
| Quantity | Sampling rate | Dice | Precision | Miou | RVD |
|---|---|---|---|---|---|
| 2 | [1, 2] | 0.5852 | 0.6090 | 0.3217 | 0.4863 |
| 3 | [1, 2, 4] | 0.6927 | 0.7741 | 0.4824 | 0.3201 |
| 4 | [1, 2, 4, 8] | 0.7361 | 0.8345 | 0.5851 | 0.1829 |
| 5 | [1, 2, 4, 8, 16] | 0.8587 | 0.8796 | 0.6832 | 0.0932 |
Note: bold values indicate the optimal values.
The results shown in Table 2 show that the five-layer downsampling results are still the best. The effects of three and four layers are significantly insufficient compared with five layers. The reason for this is that if there are not enough feature maps of different resolutions entering the kernel selection module, the information selected by the kernel selection module tends to be local. Only when feature maps of large and small resolutions are simultaneously input into the kernel selection module, the kernel selection module can function properly. This phenomenon fits well with the description in Ref. 26. However, regrettably, this ablation experiment did not give unexpected results. We expected that feature maps of different resolution data could be similar to downsampling modules of different depths. Perhaps a small amount of resolution could provide some special information for learning. However, the experimental results did not show such a trend. We still retain this idea for future work. The follow-up experiments and explanations for this part are described in detail in Sec. 5.2.
4.4. Comparison with Other Advanced Models
4.4.1. Evaluation on CHAOS
To effectively showcase the superiority of the proposed PS5-Net through empirical evidence, we have compared it with several outstanding network models from recent years. The experimental results on the CHAOS dataset, comparing the performance of different networks, are presented in Table 3 along with the introduced evaluation metrics.
Table 3.
The comparative results for different segmentation methods on CHAOS dataset.
| Method | Year | Dice | Precision | Recall | Hausdorff distance |
|---|---|---|---|---|---|
| U-Net4 | 2015 | 0.9269 | 0.9523 | 0.9029 | 64.88 |
| U-Net++11 | 2018 | 0.9521 | 0.9710 | 0.9341 | 29.47 |
| nnU-Net46 | 2018 | 0.9457 | 0.9663 | 0.9261 | 39.26 |
| GIU-Net47 | 2019 | 0.9405 | 0.9722 | 0.9109 | 36.74 |
| MC-Net48 | 2019 | 0.9429 | 0.9619 | 0.9248 | 38.31 |
| I-UNet49 | 2020 | 0.9514 | 0.9713 | 0.9324 | 34.61 |
| MA-UNet50 | 2020 | 0.9554 | 0.9795 | 0.9326 | 34.67 |
| BS-UNet51 | 2020 | 0.9606 | 0.9795 | 0.9426 | 32.14 |
| MS-Net52 | 2021 | 0.9421 | 0.9734 | 0.9441 | 29.72 |
| SFF-Net53 | 2021 | 0.9560 | 0.9803 | 0.9329 | 27.64 |
| DP-Net54 | 2022 | 0.9233 | 0.9451 | 0.9192 | 34.57 |
| LiM-Net55 | 2022 | 0.9614 | 0.9756 | 0.9477 | 26.32 |
| Proposed | — | 0.9613 | 0.9815 | 0.9420 | 25.45 |
Note: bold values indicate the optimal values.
As shown in Table 3, the proposed PS5-Net achieves a precision score of 0.9815 and can reach 25.45 in the HD score, reaching the highest score compared with the other models. Compared with the best results of other compared networks, we can see that the Dice and recall coefficients of PS5-Net differ from the highest score by 0.0001 and 0.0057, respectively, and this value shows that our proposed network does not have a significant decrease.
Specifically, the highest Dice scores are from the model proposed by Kushnure et al.,55 who used an improvement based on U-Net++ that incorporates multiscale attention, and the starting point of our proposed model is also an improvement in multiscale feature extraction, which can be seen in the experimental results, where the segmentation network that introduces multiscale attention gets higher. The core idea of U-Net++ is a kind of network depth selection, which can indeed improve the segmentation ability of the model again in some aspects, and in contrast to other excellent networks, the classical network U-Net proposed by Ronneberger et al.4 and others, which is also the reference network for this work, has a certain amount of information loss due to the downsampling. The network model proposed by Zhou et al.11 confirms the validity of their idea through this experiment. Our experimental data supports the finding that different depths of the network yield distinct segmentation effects for different datasets. This notion is relevant and should be considered when examining network depth in various works. Liu et al.47 improved segmentation performance by incorporating a liver distribution function and replicating features of the pooling layer, resulting in a Dice coefficient of 0.9405. Su and Fang48 introduced a multichannel segmentation model that utilized gradient calculation to generate the training set, achieving a Dice coefficient of 0.9429. Huang et al.49 proposed transposed-resize (TR) convolution, which minimized information loss during downsampling and introduced a new loss function, leading to a Dice coefficient of 0.9514. Aslani et al.17 enhanced the U-Net network through a channel attention mechanism and multiscale information extraction, achieving a Dice score of 0.9554 on our test set. Song et al.51 proposed bottleneck feature supervision, employing an encoding network and a segmentation network to enhance information flow and increase the perceptual field, resulting in a Dice coefficient of 0.9606. Liu et al.53 developed a spatial feature fusion convolutional network that merged spatial and high-level semantic features from early and late layers, achieving a Dice coefficient of 0.9560. Zhang et al.52 introduced MS-Net, a multiscale feature pyramid fusion network that utilized a codec structure formed by the fusion of the multiscale attention module and stacked feature pyramid module. This approach demonstrates promising processing capabilities, and the authors achieved commendable scores while adopting a concept similar to ours. However, the absence of multiscale image input limited the author to attain only a Dice score of 0.9421. Van Quyen and Kim54 proposed a dual prediction method to effectively capture both thin and large objects within complex street scenes. Employing a pyramid-like approach, this method fell short in achieving a high score, as the target environment did not involve medical images, resulting in a final score of only 0.9233.
To compare the results of the experimental data more intuitively and to relatively fairly determine the parameter quantity of the model, we compared the relationship between the data of the four evaluation indicators and the model volume, as shown in Figs. 6–9. The size of the circle in the figure represents the volume of the model, and the dashed line corresponds to the indicator on the -axis. In the paper, we present the results on the CHAOS dataset. We have also plotted the results on the ISIC2018 dataset and placed them in the Appendix.
Fig. 6.
Precision comparison of segmentation results on CHAOS dataset.
Fig. 7.
Dice comparison of segmentation results on CHAOS dataset.
Fig. 8.
Recall comparison of segmentation results on CHAOS dataset.
Fig. 9.
HD comparison of segmentation results on CHAOS dataset.
4.4.2. Evaluation on ISIC2018
In the experiments on the ISIC2018 dataset, the maximum number of training cycles is 30. The experimental data indicators are shown in Table 4. As shown in Table 4, the network proposed in this paper achieved the highest scores in Dice and precision, especially the Dice coefficient, which was significantly higher than other networks in the U-Net series. In terms of Miou and RVD scores, Swin-UNet achieved better scores.
Table 4.
The comparative results for different segmentation methods on ISIC2018 dataset.
| Method | Year | Dice | Precision | Miou | RVD |
|---|---|---|---|---|---|
| U-Net4 | 2015 | 0.7731 | 0.8642 | 0.6256 | 0.15331 |
| U-Net++11 | 2018 | 0.7968 | 0.8505 | 0.6557 | 0.11924 |
| nnU-Net46 | 2018 | 0.7977 | 0.8690 | 0.6430 | 0.13265 |
| U-Net3+12 | 2020 | 0.8177 | 0.8734 | 0.6634 | 0.11523 |
| MS-Net52 | 2021 | 0.8403 | 0.8691 | 0.6742 | 0.08776 |
| MSRF-Net56 | 2021 | 0.8392 | 0.8567 | 0.6679 | 0.09217 |
| Swin-UNet57 | 2021 | 0.8273 | 0.8774 | 0.6933 | 0.06518 |
| DP-Net54 | 2022 | 0.8169 | 0.8544 | 0.6570 | 0.14532 |
| Proposed | — | 0.8587 | 0.8796 | 0.6832 | 0.09329 |
Note: bold values indicate the optimal values.
Our comparative analysis of three pyramid-like segmentation networks52,54,56 revealed that they did not attain the highest scores for several reasons. First, they persist in utilizing traditional single-image pyramid parameter sharing during training. Furthermore, despite acquiring feature maps of varying scales, they did not implement more appropriate weight allocation for these feature maps.
And considering that transformer-based networks did perform well on this dataset with fewer errors, it can be concluded that the model proposed in this paper can rely on powerful contextual information modeling capabilities and screening through kernel selection modules to improve the segmentation accuracy.
The comparative analysis of the data demonstrates that our proposed network structure consistently achieves higher scores in evaluation metrics compared with previous approaches, particularly due to advancements in feature utilization across different scales. These enhancements collectively showcase the segmentation superiority of our proposed network in this experiment.
The data comparison shows that our proposed network structure generally gets higher scores in evaluation metrics, as well as previous optimization regarding feature utilization between different scales, and based on these improvements, the segmentation excellence of our proposed network is fully demonstrated by this experiment.
Upon comparing the data, it becomes apparent that our proposed network structure consistently achieves higher scores in evaluation metrics. The conducted experiments effectively demonstrate the segmentation efficacy of our proposed network. Notably, upon reviewing the experimental findings in Table 1, it is evident that the sole improvement of the U-Net structure does not yield a notably significant enhancement, particularly when compared with the results in Tables 3 and 4. In essence, the singular U-Net structure inadequately extracts crucial features from intricate medical data, with limited capacity to extract image features, particularly high-frequency information. Conversely, our comprehensive PS5 structure, following the integration of multiscale information, exhibits a considerable increase in scores. This underscores the primary role of multiscale information in enhancing the experiment’s effectiveness. The central focus of this study revolves around the integration of such multidimensional information and the judicious selection of these multidimensional details, crucial for surpassing the accuracy of the original singular U-Net model. By inputting varying resolution feature maps into a single U-Net equipped with dilated convolutions, we maximize the utilization of each individual U-Net. Thus, we can deduce that superior scores are not solely achieved through the addition of more networks, but rather by the incorporation of multiscale information and the prudent selection of this information. This conclusion aligns with the fundamental focus and objectives of our research.
4.5. Qualitative Analysis
4.5.1. Analysis on CHAOS
To analyze the segmentation effect of different models more intuitively, we give here the segmentation plots of some model outputs, as shown in Fig. 10. Four representative liver CT datasets were selected from the test set. These datasets were segmented using three different models: U-Net proposed by Ronneberger et al.4 LiM-Net proposed by Kushnure et al.,55 and our proposed PS5-Net. The original image row represents the liver CT data image from the original dataset. The GT row represents the manually annotated ground truth images of the liver dataset. The proposed row displays the segmentation results of PS5-Net on the liver data test set. The Kushnure et al.55 row shows the segmentation results of LiM-Net on the liver data test set, and the Ronneberger et al.4 row presents the U-Net's segmentation results on the liver data test set.
Fig. 10.
Comparison of segmentation results on CHAOS dataset. (a)–(d) represent liver CT images taken at different positions.
The comparison in Fig. 10 shows that the segmentation effect of the original U-Net on the liver dataset is average, which is highly related to the processing ability of U-Net for contextual information. The single splicing operation of U-Net does not compensate well for the information loss caused by network downsampling, and it does not take into account the information connection between multiple scales, which leads to the unsatisfactory feature extraction ability on the dataset. LiM-Net, proposed by Kushnure et al.,55 is a feature extraction network incorporating multiscale attention based on U-Net++ improvement. Upon analyzing the network structure, it is evident that LiM-Net showcases a remarkably flexible network framework. It compares different depths to determine the most suitable segmentation depth for different samples. Our experiments have further validated this observation. To be more specific, the theoretical feature extraction ability of U-Net_L2 (referring to the U-Net network with a depth of 2) is expected to be weaker than that of U-Net_L3 and U-Net. However, based on our experimental findings, we observed that for particular categories of sample data, U-Net_L2 exhibits a remarkably superior feature extraction effect compared to both U-Net_L3 and U-Net. In general, U-Net_L2 does not achieve very effective feature extraction for other body liver datasets, which is limited by the depth of U-Net_L2 itself. By this, it can be verified that the network depth proposed by the U-Net++ network does produce different effects on the segmentation effect of different datasets, and this part of the experiment will be discussed in detail in Sec. 4.5.2. LiM-Net also incorporates a multiscale attention mechanism, which makes the network more capable of interacting with contextual information, but LiM-Net only enhances the multiscale feature extraction part of the encoder. There is no targeted fusion operation for the extracted multiscale features. The PS5-Net combines five layers of feature maps at different scales and uses the attention kernel selection module to perform adaptive fusion of the extracted features at different scales, thus enabling full interaction between global and local information, and the introduction of the NSAF module inside the feature extraction network to achieve more effective contextual information interaction. The information loss of feature maps due to downsampling is avoided to the maximum extent. Although it may not be optimal in terms of evaluation metric scores, combined with the specific results, our proposed network extracts more feature information to a certain extent and can fuse this information more effectively. Even though there are still unavoidable detail defects in some boundary details, the segmentation effect obtains a significant improvement compared with the previous work. In conclusion, based on the above analysis, it can be concluded that our proposed PS5-Net fits the real marker map more closely in terms of segmentation shape and size, and can be more effectively applied to segmentation of liver data of different body shapes.
As shown in Fig. 11, to visualize the regions of interest for the above methods, we compared the results of visualizing the feature map of the last feature layer of the network on the CHLISC dataset. The image uses the principles of heat maps. Figure 11 clearly shows that the U-Net proposed by Ronneberger et al.4 focuses on many unnecessary features, and such a region of interest may eventually lead to blurred segmentation boundaries, and the problem arises mainly because the skip connection of the U-Net directly combines the shallow features of the encoder with the deep features of the decoder, which can lead to insufficient contextual information interaction and loss of some information. The LiM-Net proposed by Kushnure et al.55 and the SFF-Net proposed by Liu et al.53 also focus on a significantly smaller liver region than the PS5-Net. Such a focus region will eventually lead to segmentation results in detail loss or incompleteness, and the reason for this problem is mainly that a large acceptance domain will reduce the accuracy of segmentation region localization, and a small acceptance domain will cause the loss of detail. PS5-Net has a more accurate and clear focus on the segmentation target, and the redundant feature areas are removed while retaining local detail information, resulting in a more accurate and complete final. The segmentation result is more accurate and complete.
Fig. 11.
Results of the last feature layer heat map under different methods on CHAOS dataset.
4.5.2. Analysis on ISIC2018
As shown in Fig. 12, we compared the segmentation results of PS5-Net with the results of three other efficient segmentation networks. It can be seen that the network proposed in this paper has the best segmentation performance. In the details of the green and red regions, the network proposed in this paper can restore more details. This also depends on the powerful contextual information modeling capability. Compared with Swin-UNet, the network proposed in this paper further reduces the area of segmentation errors.
Fig. 12.
Comparison of segmentation results on ISIC2018 dataset.
5. Discussion
5.1. Exploration of the Loss Function Factor
In this section, we will analyze the impact of the loss function on the segmentation performance of the model.
With the experimental part (Sec. 4.2) mentioned above, a certain proportion of DiceLoss is added to the binary cross-entropy to alleviate the impact of sample imbalance on network training; however, DiceLoss tends to make the training unstable, considering the respective advantages and disadvantages of BCEWithLogitsLoss and DiceLoss; therefore, the influence of the DiceLoss ratio on the evaluation results. The formula of loss function with variables is shown in Eq. (11)
| (11) |
To find the optimal variable factor and to optimize the feature extraction capability of the network model, four sets of comparison experiments were conducted on PS5-Net with different values of and then evaluated for testing. The evaluation results are shown in Table 5. When is taken as 0.6, the obtained values of Dice coefficient, accuracy, and HD are all optimal, the model segmentation is the best and the model is optimal.
Table 5.
Different ratio evaluation results of network model on CHAOS dataset.
| Value of | Dice | Precision | Recall | Hausdorff distance |
|---|---|---|---|---|
| 0.8 | 0.9293 | 0.9551 | 0.9049 | 41.21 |
| 0.7 | 0.9236 | 0.9522 | 0.8968 | 42.17 |
| 0.6 | 0.9399 | 0.9676 | 0.9138 | 39.96 |
| 0.5 | 0.9339 | 0.9625 | 0.9071 | 39.21 |
| 0.4 | 0.9365 | 0.9613 | 0.9131 | 40.26 |
| 0.3 | 0.9307 | 0.9585 | 0.9045 | 40.79 |
Note: bold values indicate the optimal values.
5.2. Effect of Network Depth on Experimental Results
As can be seen through the data in the experimental part of Sec. 4, U-Net++ or other models with U-Net++ as the base network have achieved good segmentation results. For this reason, we drew on the idea of U-Net++ and conducted relevant tests specifically on U-Net models with different depths, and found that on some unique data, networks with different depths do achieve different segmentation effects. The segmentation effect is shown in Fig. 13.
Fig. 13.
Results of U-Net network segmentation at different depths.
Figure 13, as an intuitive segmentation result, shows that U-Net_L2 (U-Net network with depth 2) has better segmentation results for this class of dataset than the U-Net network with normal depth, which verifies the idea proposed by U-Net++, and gives a good result of the improved model based on U-Net++ network in the above experimental results. Regrettably, this study does not utilize the aforementioned idea due to the fact that our proposed framework revolves around a multiscale feature map with parallel input. Nevertheless, it raises the question of whether the impact of U-Net’s depth also varies when considering images of different resolutions. With limited resources, we keep this idea and choose a more suitable way to incorporate it in the further work.
5.3. Exploration of the NSAF Module
To increase the model’s receptive field and the interaction between model contexts within the U-shaped structure, we chose to use the NSAF module. In this section, we will specifically discuss the effectiveness of the NSAF module.
Extensive research has revealed that the chief constraint of CNN networks lies in their limited receptive field. To circumvent augmenting the intricacy of the foundational model, dilated convolutions have proven to be an efficacious means for expanding the receptive field. A critical inquiry pertains to whether the U-shaped structure necessitates an influx of additional contextual information and, if so, whether such augmentation is indeed effective. To address this, we present Figs. 14(a) and 14(b), portraying the feature maps at the antecedent layer of the inference process for both the original U-shaped structure and the U-shaped structure enhanced with the inclusion of the NSAF module. The outcomes indicate that the integration of the NSAF module within a solitary U-shaped structure facilitates the extraction of a more substantial breadth of valuable information, particularly high-frequency details, thereby constituting an effective enhancement for images characterized by intricate boundary delineations. In addition, we conducted testing on the input feature map of the NSAF module, which amalgamates the feature maps from the current layer and the subsequent layer. As depicted in Fig. 15, it is evident that the fusion of information from these two layers results in a more comprehensive representation of global information within the feature map. Notably, for the specific task at hand, the network is supplied with feature maps of diverse resolutions, affording it the capacity to comprehensively assimilate the latent intelligence embedded within the images, all without necessitating an expansion of the network’s depth. Through the amalgamation of these parallel five-layer structures, we are able to optimize the efficacy of the individual U-shaped structure. Looking ahead, our attention may potentially gravitate toward maximizing the model’s potential by manipulating the external input, a direction that continues to represent a meritorious avenue for research within the domain of contemporary CNN networks.
Fig. 14.
(a) The intermediate feature map processed by a single U-Net structure combined with the NSAF module. (b) The intermediate feature map processed by the original U-Net structure.
Fig. 15.
Intermediate feature map of the NSAF module, i.e., the feature map after incorporating contextual information.
6. Conclusion
In this paper, we introduce a multiscale resolution medical image feature extraction network, PS5-Net, designed for image segmentation and feature extraction in complex liver images. Leveraging the U-Net network as a baseline, our aim is to enhance the extraction of multiscale image features and prioritize them based on their significance from local to global aspects, while effectively fusing feature information across different scales. Experimental results on the segmentation datasets CHAOS and ISIC2018 demonstrate that the PS5-Net network outperforms other advanced segmentation models. Specifically, the Dice scores reached 0.9613 for liver segmentation and 0.8587 for skin lesion segmentation. PS5-Net has significantly enhanced the segmentation efficacy of liver CT datasets captured in various postures, particularly in processing image detail information, surpassing other mainstream segmentation networks. These findings indicate its potential to better assist clinicians in clinical examinations and diagnoses.
After conducting this experiment and analysis, we acknowledge potential areas for improvement in this work in the future. First, the varying quality of medical image dataset annotations due to subjective factors poses a challenge, as manual annotation by professional doctors requires significant human and material resources. This limitation results in a relatively small number of medical image datasets. Future improvements could focus on enhancing annotation quality, expanding datasets, and including a broader range of liver data from different body types. Second, our experiment verified the impact of different network layers on segmentation results as proposed by the authors of U-Net++. Considering the experimental findings, we anticipate that this concept will serve as a key research direction for future enhancements. PS5-Net achieves more precise medical image segmentation from a novel perspective, combining fundamental vision approaches and transforming the application of the original U-Net. We believe that this work, together with the experimental outcomes, will offer valuable reference and inspiration for future endeavors.
7. Appendix: Visual Comparative Experiments on the ISIC2018 Dataset
Figures 16–19, depict the comparative experimental results on the ISIC2018 dataset. The size of the circle in the figure represents the volume of the model, and the dashed line corresponds to the indicator on the y-axis.
Fig. 16.
Dice comparison of segmentation results on ISIC2018 dataset.
Fig. 17.
Precision comparison of segmentation results on ISIC2018 dataset.
Fig. 18.
Miou comparison of segmentation results on ISIC2018 dataset.
Fig. 19.
RVD comparison of segmentation results on ISIC2018 dataset.
Acknowledgments
The research was supported by the Center for Ocean Mega-Science, Chinese Academy of Sciences (Grant No. KEXUE2019GZ04) and GHfund B (Grant No. GHFUNd202107021586).
Biographies
Fuchen Li is a master’s student at Qingdao University of Science and Technology, mainly engaged in image segmentation.
Yong Liu, PhD, is an associate professor and a master’s supervisor. During this period, she participated in two projects: “Research on the Construction Method of Marine Knowledge Ontology” under the New Century Excellent Talents Support Program of the Ministry of Education, and research on the topic “Industrialization and Application Demonstration of Digital Home Terminal Equipment” under the National Science and Technology Support Program.
JianBo Qi is a master student’s at Qingdao University of Science and Technology, mainly engaged in computer vision and in-depth learning.
Yansong Du is a master’s student at National University of Singapore, mainly focused on computational imaging and new camera design.
QingYue Wang is master’s student at Qingdao University of Science and Technology, mainly engaged in computer vision and object detection.
WenBo Ma is a master’s student at Harbin Institute of Technology, mainly focused on computer vision.
XianChong Xu is a master’s student at Qingdao University of Science and Technology, mainly engaged in computer vision and ocean object detection.
ZhongQi Zhang is a master’s student at Qingdao University of Science and Technology, mainly engaged in computer vision and object segmentation.
Contributor Information
Fuchen Li, Email: lfchen@yeah.net.
Yong Liu, Email: liuyongqust@163.com.
JianBo Qi, Email: qingningya@icloud.com.
Yansong Du, Email: e0682691@u.nus.edu.
QingYue Wang, Email: m1360171813@163.com.
WenBo Ma, Email: 22s121106@stu.hit.edu.cn.
XianChong Xu, Email: 2268346832@qq.com.
ZhongQi Zhang, Email: qgg12138@gmail.com.
Disclosures
The authors confirm that there are no competing interests to disclose in relation to this work.
Code and Data Availability
The code associated with this article has been released as open-source and can be found in a GitHub repository located at: https://github.com/andre20000131/PS5. The data are sourced from two publicly available datasets for the open-source medical segmentation task. Specific details are explained in detail in the article.
Author Contributions
F.C.L. primarily focused on thesis writing, experimental design, and code implementation. Y.L. provided guidance on thesis ideas and algorithms, as well as the necessary experimental equipment and site premises. J.B.Q., Y.D., Q.Y.W., W.B.M., X.C.X., and Z.Q.Z. contributed to algorithm guidance, code testing, and various other aspects.
References
- 1.Zhang J., Teng J., Bai Y., “A new medical image edge detection algorithm based on BC-ACO,” Int. J. Pattern Recognit. Artif. Intell. 2016, 930–937 (2016). [Google Scholar]
- 2.Giger M. L., “Machine learning in medical imaging,” J. Am. Coll. Radiol. 15(3), 512–520 (2018). 10.1016/j.jacr.2017.12.028 [DOI] [PubMed] [Google Scholar]
- 3.Shelhamer E., Long J., Darrell T., “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017). 10.1109/TPAMI.2016.2572683 [DOI] [PubMed] [Google Scholar]
- 4.Ronneberger O., Fischer P., Brox T., “U-Net: convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci. 9351, 234–241 (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
- 5.Alom M. Z., et al. , “Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation,” arXiv:1802.06955 (2018).
- 6.He H., et al. , “A hybrid-attention Nested-UNet for nuclear segmentation in histopathological images,” Front. Mol. Biosci. 8, 614174 (2021). 10.3389/fmolb.2021.614174 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Oktay O., et al. , “Attention U-Net: learning where to look for the pancreas,” arXiv:1804.03999 (2018).
- 8.Gao Y., et al. , “Focus Netv2: imbalanced large and small organ segmentation with adversarial shape constraint for head and neck CT images,” Med. Image Anal. 67, 101831 (2021). 10.1016/j.media.2020.101831 [DOI] [PubMed] [Google Scholar]
- 9.Kiran I., et al. , “Denseres-UNet: segmentation of overlapped/clustered nuclei from multi-organ histopathology images,” Comput. Biol. Med. 143, 105267 (2022). 10.1016/j.compbiomed.2022.105267 [DOI] [PubMed] [Google Scholar]
- 10.He K., et al. , “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR), Las Vegas, NV, pp. 770–778 (2016). 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
- 11.Zhou Z., et al. , “U-Net++: a Nested-UNet architecture for medical image segmentation,” Lect. Notes Comput. Sci. 11045, 3–11 (2018). 10.1007/978-3-030-00889-5_1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Huang H., et al. , “U-Net 3+: a full-scale connected U-Net for medical image segmentation,” in ICASSP 2020-2020 IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), IEEE, pp. 1055–1059 (2020). 10.1109/ICASSP40776.2020.9053405 [DOI] [Google Scholar]
- 13.Ding H., et al. , “Multi-scale fully convolutional network for gland segmentation using three-class classification,” Neurocomputing 380, 150–161 (2019). 10.1016/j.neucom.2019.10.097 [DOI] [Google Scholar]
- 14.Gridach M., “PyDiNet: Pyramid dilated network for medical image segmentation,” Neural Networks 140(4), 274–281 (2021). 10.1016/j.neunet.2021.03.023 [DOI] [PubMed] [Google Scholar]
- 15.Bentaieb A., Kawahara J., Hamarneh G., “Multi-loss convolutional networks for gland analysis in microscopy,” in IEEE Int. Symp. Biomed. Imaging, IEEE; (2016). 10.1109/ISBI.2016.7493349 [DOI] [Google Scholar]
- 16.Kamnitsas K., et al. , “Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation,” Med. Image Anal. 36, 61 (2016). 10.1016/j.media.2016.10.004 [DOI] [PubMed] [Google Scholar]
- 17.Aslani S., et al. , “Multi-branch convolutional neural network for multiple sclerosis lesion segmentation,” NeuroImage 196, 1–15 (2018). 10.1016/j.neuroimage.2019.03.068 [DOI] [PubMed] [Google Scholar]
- 18.Mou L., et al. , “CS-Net: channel and spatial attention net work for curvilinear structure segmentation,” Lect. Notes Comput. Sci. 11764, 721–730 (2019). 10.1007/978-3-030-32239-7_80 [DOI] [Google Scholar]
- 19.Fang X., Yan P., “Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction,” IEEE Trans. Med. Imaging 39(11), 3619–3629 (2020). 10.1109/TMI.2020.3001036 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lyu C., Hu G., Wang D., “HRED-Net: high-resolution encoder-decoder network for fine-grained image segmentation,” IEEE Access 8, 38210–38220 (2020). 10.1109/ACCESS.2020.2975983 [DOI] [Google Scholar]
- 21.Zhuang J., “LadderNet: Multi-path networks based on U-Net for medical image segmentation,” arXiv:1810.07810 (2018).
- 22.Feng S., et al. , “CcNet: a cross-connected convolutional network for segmenting retinal vessels using multi-scale features,” Neurocomputing 392, 268–276 (2019). 10.1016/j.neucom.2018.10.098 [DOI] [Google Scholar]
- 23.Qin J., Huang Y., Wen W., “Multi-scale feature fusion residual network for single image super-resolution,” Neurocomputing 379, 334–342 (2020). 10.1016/j.neucom.2019.10.076 [DOI] [Google Scholar]
- 24.Liu M., et al. , “3D object detection based on attention and multi-scale feature fusion,” Sensors 22, 3935 (2022). 10.3390/s22103935 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Envelope T., Kim M. Y., “Feature pyramid network with multi-scale prediction fusion for real-time semantic segmentation,” Neurocomputing 519, 104–113 (2022). 10.1016/j.neucom.2022.11.062 [DOI] [Google Scholar]
- 26.Zhao Y., et al. , “Pyramid real image denoising network,” in IEEE Vis. Commun. and Image Process. (VCIP), IEEE, pp. 1–4 (2019). [Google Scholar]
- 27.Huang J. B., et al. , “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in IEEE Conf. Comput. Vision and Pattern Recognit. (CVPR), IEEE; (2017). 10.1109/CVPR.2017.618 [DOI] [PubMed] [Google Scholar]
- 28.Zhao H.-S., et al. , “Pyramid scene parsing network,” in Conf. Comput. Vision and Pattern Recognit. (CVPR), pp. 2881–2890 (2017). [Google Scholar]
- 29.He J., et al. , “ReLU deep neural networks and linear finite elements,” arXiv:1807.03973 (2020).
- 30.Liu Y., et al. , “ResDO-UNet: a deep residual network for accurate retinal vessel segmentation from fundus images,” Biomed. Signal Process. Control 79, 104087 (2023). 10.1016/j.bspc.2022.104087 [DOI] [Google Scholar]
- 31.Gholamalinezhad H., Khosravi H., “Pooling methods in deep neural networks, a review,” arXiv:2009.07485 (2020).
- 32.Szegedy C., et al. , “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., CVPR, pp. 2818–2826 (2016). 10.1109/CVPR.2016.308 [DOI] [Google Scholar]
- 33.Chen L.-C., et al. , “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. Eur. Conf. Comput. Vision, ECCV, pp. 801–818 (2018). [Google Scholar]
- 34.Xu J., et al. , “Sparse-representation-based classification with structure-preserving dimension reduction,” Cognitive Comput. 6(3), 608–621 (2014). 10.1007/s12559-014-9252-5 [DOI] [Google Scholar]
- 35.Hu J., Shen L., Sun G., “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., CVPR, pp. 7132–7141 (2018). [Google Scholar]
- 36.Zhai M., et al. , “Optical flow estimation using selective kernel networks,” IEEE Access PP(99), 1 (2019). 10.1109/ACCESS.2019.2930293 [DOI] [Google Scholar]
- 37.Lin M., Chen Q., Yan S., “Network in network,” arXiv:1312.4400 (2013).
- 38.Woo S., et al. , “CBAM: convolutional block attention module,” in Eur. Conf. Comput. Vision, Springer, Cham, Switzerland, 3–19 (2018). [Google Scholar]
- 39.Li C., et al. , “An enhanced robot massage system in smart homes using force sensing and a dynamic movement primitive,” Front. Neurorob. 14, 30 (2020). 10.3389/fnbot.2020.00030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Li C., et al. , “BAS optimized ELM for KUKA iiwa robot learning,” IEEE Trans. Circuits Syst. I Regul. Pap. 68(6), 1987–1991 (2021). 10.1109/TCSII.2020.3034771 [DOI] [Google Scholar]
- 41.Zhu X., et al. , “An empirical study of spatial attention mechanisms in deep networks,” in IEEE/CVF Int. Conf. Comput. Vision (ICCV), IEEE; (2020). 10.1109/ICCV.2019.00679 [DOI] [Google Scholar]
- 42.Zhao Z., Chen K., Yamane S., “CBAM-UNet++: easier to find the target with the attention module “CBAM”,” in IEEE 10th Global Conf. Consum. Electron. (GCCE) (2021). 10.1109/GCCE53005.2021.9622008 [DOI] [Google Scholar]
- 43.Wang S. H., et al. , “AVNC: attention-based VGG-style network for COVID-19 diagnosis by CBAM,” IEEE Sens. J. 22, 17431–17438 (2021). 10.1109/JSEN.2021.3062442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Dickson M. C., Bosman A. S., Malan K. M., “Hybridised loss functions forimproved neural network generalisation,” in Pan-African Artif. Intell. Smart Syst. Conf., Springer, Cham, Switzerland: (2022). [Google Scholar]
- 45.Wu L., Zhu Z., Weinan E., “Towards understanding generalization of deep learning: perspective of loss landscapes,” arXiv:1706.10239 (2017).
- 46.Isensee F., et al. , “nnU-Net: self-adapting framework for U-Net-based medical image segmentation,” arXiv:1809.10486 (2018).
- 47.Liu Z., et al. , “Liver CT sequence segmentation based with improved U-Net and graph cut,” Expert Syst. Appl. 126, 54–63 (2019). 10.1016/j.eswa.2019.01.055 [DOI] [Google Scholar]
- 48.Su T. Y., Fang Y. H., “Automatic liver and spleen segmentation with CT images using multi-channel U-Net deep learning approach,” in Int. Conf. Biomed. and Health Inf., Springer, Cham, pp. 33–41 (2019). 10.1007/978-3-030-30636-6_5 [DOI] [Google Scholar]
- 49.Huang Y. J., et al. , “Improved U-Net based on mixed loss function for liver medical image segmentation,” Laser Optoelectron. Prog. 57(22), 221003 (2020). 10.3788/LOP57.221003 [DOI] [Google Scholar]
- 50.Fan T., et al. , “MA-Net: a multi-scale attention network for liver and tumor segmentation,” IEEE Access 8, 179656–179665 (2020). 10.1109/ACCESS.2020.3025372 [DOI] [Google Scholar]
- 51.Song L. I., Geoffrey K. F., Kaijian H. E., “Bottleneck feature supervised U-Net for pixel-wise liver and tumor segmentation,” Expert Syst. Appl. 145, 113131 (2020). 10.1016/j.eswa.2019.113131 [DOI] [Google Scholar]
- 52.Zhang B., et al. , “Multi-scale feature pyramid fusion network for medical image segmentation,” Int. J. Comput. Assist. Radiol. Surg. 18(2), 353–365 (2023). 10.1007/s11548-022-02738-5 [DOI] [PubMed] [Google Scholar]
- 53.Liu T., et al. , “Spatial feature fusion convolutional network for liver and liver tumor segmentation from CT images,” Med. Phys. 48(1), 264–272 (2021). 10.1002/mp.14585 [DOI] [PubMed] [Google Scholar]
- 54.Van Quyen T., Kim M. Y., “Feature pyramid network with multi-scale prediction fusion for real-time semantic segmentation,” Neurocomputing 519, 104–113 (2023). 10.1016/j.neucom.2022.11.062 [DOI] [Google Scholar]
- 55.Kushnure D. T., Tyagi S., Talbar S. N., “LiM-Net: lightweight multi-level multiscale network with deep residual learning for automatic liver segmentation in CT images,” Biomed. Signal Process. Control 80, 104305 (2023). 10.1016/j.bspc.2022.104305 [DOI] [Google Scholar]
- 56.Srivastava A., et al. , “MSRF-Net: a multi-scale residual fusion network for biomedical image segmentation,” IEEE J. Biomed. Health. Inf. 26(5), 2252–2263 (2021). 10.1109/JBHI.2021.3138024 [DOI] [PubMed] [Google Scholar]
- 57.Cao H., et al. , “Swin-UNet: U-Net-like pure transformer for medical image segmentation,” Lect. Notes Comput. Sci. 13803, 205–218 (2022). 10.1007/978-3-031-25066-8_9 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code associated with this article has been released as open-source and can be found in a GitHub repository located at: https://github.com/andre20000131/PS5. The data are sourced from two publicly available datasets for the open-source medical segmentation task. Specific details are explained in detail in the article.



















