PDRF-Net: a progressive dense residual fusion network for COVID-19 lung CT image segmentation

Xiaoyan Lu; Yang Xu; Wenhao Yuan

doi:10.1007/s12530-023-09489-x

. 2023 Feb 17:1–17. Online ahead of print. doi: 10.1007/s12530-023-09489-x

PDRF-Net: a progressive dense residual fusion network for COVID-19 lung CT image segmentation

Xiaoyan Lu ¹, Yang Xu ^1,^2,^✉, Wenhao Yuan ¹

PMCID: PMC9936947 PMID: 38625320

Abstract

The lungs of patients with COVID-19 exhibit distinctive lesion features in chest CT images. Fast and accurate segmentation of lesion sites from CT images of patients’ lungs is significant for the diagnosis and monitoring of COVID-19 patients. To this end, we propose a progressive dense residual fusion network named PDRF-Net for COVID-19 lung CT segmentation. Dense skip connections are introduced to capture multi-level contextual information and compensate for the feature loss problem in network delivery. The efficient aggregated residual module is designed for the encoding-decoding structure, which combines a visual transformer and the residual block to enable the network to extract richer and minute-detail features from CT images. Furthermore, we introduce a bilateral channel pixel weighted module to progressively fuse the feature maps obtained from multiple branches. The proposed PDRF-Net obtains good segmentation results on two COVID-19 datasets. Its segmentation performance is superior to baseline by 11.6% and 11.1%, and outperforming other comparative mainstream methods. Thus, PDRF-Net serves as an easy-to-train, high-performance deep learning model that can realize effective segmentation of the COVID-19 lung CT images.

Keywords: COVID-19, CT image segmentation, Progressive feature fusion, Vision transformer, Dense skip connections

Introduction

As a serious infectious disease, COVID-19 is one of the greatest threats facing humanity currently (Wu et al. 2020; Wang et al. 2020a). COVID-19 causes inflammatory changes, exudative changes, and interstitial changes in the lungs. In severe cases, breathing difficulties or even respiratory failure may occur. Therefore, early detection and diagnosis for COVID-19 patients has important research value and practical significance.

With the rapid popularity of convolutional neural networks (CNNs) in the field of computer vision, many excellent networks have achieved great success in medical image segmentation tasks (Munusamy et al. 2021).For medical images, CNNs are superior to traditional label-based segmentation methods. To reduce external interference, Wang et al. (2020b) proposed a novel network (COPLE-Net) for COVID-19 lesion segmentation. The framework utilizes the proposed robust Dice loss function to learn COVID-19 lesion segmentation from noisy labels and achieves excellent performance. Wang et al. (2021a) proposed an AI system for rapid diagnosis of COVID-19. The system locates lung lesion regions and identifies COVID-19 infection features through segmentation models and classification models, respectively. To better capture rich contextual relationships, Zhou et al. (2021b) proposed a U-Net segmentation model that combines spatial and channel attention, and created Tversky loss to deal with small region segmentation. Elharrouss et al. (2022) proposed a multi-task learning network for re-segmentation of possible lesion areas in CT images. However, associating the original image with the segmentation map preserves the interference from irrelevant regions. Therefore, the method is less generalizable.

Many of the proposed image segmentation methods based on CNNs perform well. However, owing to the limitation of the receptive field and weight sharing of the convolution operation, they still lack the interactive modeling ability for the global and long-distance semantic information in the image (Dosovitskiy et al. 2021; Zhou et al. 2021a). With the widespread adoption of transformers in image recognition processing tasks, the problem of modeling global and long-range dependencies can be effectively addressed. Compared with a CNN’s coarse segmentation visualization, the transformer’s token has fine-grained attention. Its self-attention graph structure has the ability to correlate long-distance interactions between image regions, unlike CNNs that are limited only to the range of receptive fields. Chen et al. (2021) first proposed TransUNet, which explored the feasibility of a transformer in medical image segmentation. It transforms images into sequences and encodes global information. It also combines the structural features of U-Net to effectively utilize low-level CNN features. Liu et al. (2021) designed the Swin transformer by referring to the feature pyramid used in CNNs and introduced a hierarchical construction approach to learn information at different scales.

In general, although CNN has been widely used in medical image segmentation, there are still some limitations on the model. In order to improve the model’s segmentation performance, many researchers have made improvements in terms of model structure, objective function, and multi-task. Although reconstructing the objective function with robustness (Wang et al. 2020b) speeds up the convergence of the model, the segmentation accuracy improvement is not significant enough. For the improvement of model structure (Wang et al. 2021a; Zhou et al. 2021b), the attention mechanism is a popular research target. Generally, the class activation propagation in the feature map is enhanced by boosting model context relationship. However, the blurred boundary of the segmentation region is still an urgent problem to be solved. In addition, the multi-task learning method (Elharrouss et al. 2022) has led to a significant improvement in segmentation accuracy, but the emergence of large models for practical applications has brought inconvenience. With the advent of visual transformer, its long-distance feature dependence compensates for the shortcomings of CNNs (Chen et al. 2021; Liu et al. 2021). However, a large number of transformer layers make the inference time of the model increase significantly, which is very harsh for hardware requirements.

In this study, we aimed to improve the segmentation accuracy of lung lesions without significantly increasing the model size. Based on the long-distance interaction modeling of visual transformer, we combine a small number of transformer layers with CNN to propose a novel progressive dense residual fusion network (PDRF-Net). The proposed model also extends to dense connections based on the advantages of U-Net’s (Ronneberger et al. 2015) skip connections. Our method captures richer and more detailed image information from multiple aspects to improve network segmentation performance. Experimental results show that compared with other CNNs, PDRF-Net not only solves the problem of feature transmission loss but also increases the long-distance feature dependence. The main contributions of our paper are fourfold:

Based on U-Net, we proposed a new progressive dense residual fusion network for COVID-19 lung CT infection segmentation: PDRF-Net. This network ensures a moderate number of model parameters and achieves high accuracy in segmentation.
We introduced dense skip connections that allow the data flow of the model to move away from a single pass. This connection method can bring more convolutional processing units and interconnect them, reducing the loss caused by image features during transmission.
We designed two aggregated residual modules: down-sampling aggregated residual (DAR) module and up-sampling aggregated residual (UAR) module. The DAR module combines the remote dependence of the visual transformer and the residual structure of CNN; the UAR module uses a multi-branch residual structure, combined with channel attention enhancement feature representation. They not only can learn global and long-range semantic information interaction well, but also result in precise infection location.
Achieving low-loss fusion of feature information from different branches is one of the keys to achieving high-performance segmentation in multi-branch networks. We designed bilateral channel pixel weighted (BCPW) module, which can fuse multiple branches step by step to obtain high-performance feature fusion performance.

The rest of this paper is organized as follows: Sect. 2 presents the related work. Section 3 introduces the structure of the proposed PDRF-Net, the basic building blocks of the model, and our loss function. Section 4 presents the experiment settings, qualitative and quantitative results. The last section summarizes the work.

Related work

In this section, we describe three related medical image segmentation works, i.e., transformer-based medical image segmentation, residual blocks, and dense skip connections.

Transformers in medical image segmentation tasks

Recently, transformer-based architectures have been increasingly used in medical image segmentation tasks. Based on the Swin transformer, Cao et al. (2021) constructed a symmetric encoder-decoder architecture with skip connections. Self-attention from local to global is implemented in the encoder; in the decoder, global features are up-sampled to the input resolution for corresponding pixel-level segmentation prediction. Wang et al. (2021b) designed a new segmentation framework that effectively combines a 3D CNN and transformer for the first time. The encoder uses the 3D CNN to extract local contextual information and uses the transformer to globally model the input features. The decoder uses the transformer-embedded features for up-sampling to perform lesion segmentation prediction. The transformer proposed by Peiris et al. (2021) has a U-shaped encoding-decoding structure that can directly process the 3D volume data of medical images to improve the computational efficiency of volume semantic segmentation. Based on Swin transformer, Dong et al. (2022) designed the Cross-Shaped Window self-attention mechanism to calculate the elemental correlations in horizontal and vertical directions in parallel to establish long-distance dependencies. In addition, Locally-enhanced Positional Encoding (LePE) is introduced into the self-attention mechanism to process local position information. Hatamizadeh et al. (2022) redefine the 3D medical image segmentation task as a sequence-to-sequence prediction problem. The encoder with transformer allows feature extraction at different resolutions, and then the features at each resolution are fed into decoder based on full convolutional neural networks (FCNNs) for prediction using skip connections.

To sum up, the above methods rely on model pretraining, or only focus on single disease lesion segmentation or ignore the high memory and operation cost of transformer with deep superposition. Considering these problems, we combine the few-layer transformer with CNNs and embed it into encoder’s each stage to form a progressive feature extraction architecture. The hybrid CNN-Transformer architecture enables medical image segmentation without pretraining. In addition, the feature maps of different stages provide more detailed information for target region segmentation by progressive combination.

Residual blocks in medical image segmentation tasks

In recent years, models based on residual blocks have achieved good performance in image segmentation. Inspired by residual connections, Xiao et al. (2018) proposed a model similar to U-Net, which added a residual connection between the sub-modules of the U-Net to perform retinal image segmentation. Bhalerao and Thakur (2019) used 3D U-Net extended with residual connections to reduce the impact of gradient explosion and performance degradation caused by deep networks. The network also utilizes 3D spatial information to the fullest possible extent. Based on U-Net, Alom et al. (2019) designed a new recursive residual block, which helps the network to better obtain feature information through the combination of residual connection and recursive convolution. Chen et al. (2020) improved the network by adding a residual structure and an attention mechanism to U-Net to improve segmentation of lung images, but this did not solve the problem of insufficient training samples. Zhang et al. (2020) proposed a 3D residual network with context residual blocks to extract inter-slice pixel information. Mu et al. (2022) proposed an attentional residual U-Net model based on differential preprocessing and geometric postprocessing, which uses residual-based skip connection to achieve deep supervision. In addition, a multiscale supervision strategy was designed to integrate multiscale semantic information to complete angiographic image segmentation.

The above algorithms are similar in some aspects, such as using a combination between convolution, global pooling, down-sampling, and residual branches to enhance feature extraction and delivery in the network. However, due to the limited receptive field of convolution kernels, it is difficult for CNN-based networks to model the long-distance dependence, which hinders the learning of global semantic information. To this end, we introduce the transformer layer into the CNN-based residual module, and use the transformer’s ability to establish long-distance dependencies in feature mapping as an auxiliary. Local and global feature learning makes the overall residual module have a more powerful feature representation capability.

Dense skip connections in medical image segmentation tasks

Dense skip connections not only ensure maximum information flow between layers and improve gradient flow but also enhance feature propagation. Tang et al. (2021) proposed a dual attention-based (DA-DSUnet) framework for automatic head-and-neck tumor segmentation in MRI images. To combat the gradient disappearance problem, dense blocks are introduced instead of traditional convolution, and thus, feature fusion is achieved. Banerjee et al. (2022) designed a novel hybrid CNN architecture (SIU-Net) for fully automated bone feature detection, in which dense convolution operations help to further assess the severity of scoliosis. Wang et al. (2022) proposed dense skip connections in multimodal medical image segmentation, thus retaining more contextual information and using multilevel features to help recover images. To facilitate the classification of breast cancer in histopathological images, Chattopadhyay et al. (2022) proposed a dense residual dual-shuffle attention network, whose dense connections not only solve the overfitting and gradient disappearance problems but also enable the trained CNN to obtain depth features. Chen et al. (2022) introduced a fuzzy skip connection module in the U-Net architecture to transform low-level features into high-level semantic features, which is conducive to segmenting small and variable target regions. Based on fuzzy feature mapping, a target attention mechanism is designed to improve the sensitivity of the network to target features.

Although the above methods have good cross-feature reconstruction ability, they have numerous network branches and high calculation cost. Different from the above methods, we use the same coding branch to decode the features of different stages, which ensures the dense transmission of feature information while maintaining moderate network complexity.

Methodology

Overview of PDRF-Net

Progressive feature fusion can gradually extract more abundant fine-grained feature information and reduce information loss. As a typical progressive feature fusion network, PMCNet (He et al. 2022) gradually integrates the multiscale features of adjacent encoding layers, and learns the features of each layer by aggregating fine-grained details and high-level semantics. Different from PMCNet, our proposed PDRF-Net pays more attention to the dense fusion and decoding of fine-grained details and high-level semantics of each stage, and then progressively fuses the features of each branch to achieve optimal prediction.

Figure 1 illustrates the architecture of our proposed PDRF-Net for COVID-19 lung CT image segmentation. It is an end-to-end architecture with a four-stage encoding-decoding part (as shown in stage 1–stage 4 of Fig. 1) and BCPW modules. The encoding part of stage 1–stage 3 utilizes the DAR module, which combines the residuals with the transformer to extract long-distance semantic information. We elicit the corresponding decoding branch for each encoding node for multi-scale pixel restoration. The decoding part then restores the infected region stage by stage based on the extracted semantic information, where the stage 1–stage 3 branches perform up-sampling aggregated residual operations on the features. Each sampling node corresponding to stage 1–stage 4 uses the dense skip connections to process the input feature tensor, and each dense concatenation can further enhance the feature information movement at each layer. Thus, the high-level semantic features and low-level contour features are effectively combined to better segment the lesion regions. The output features of stage 1–stage 4 are stacked step by step into the BCPW modules to maximize the recovery of image information, highlight the most discriminative feature areas, and accomplish segmentation of lung CT images.

Fig. 1 — Overview of the proposed PDRF-Net for COVID-19 CT segmentation, which adopts a progressive encoding-decoding structure, captures contextual feature information stage by stage, and then feeds it stage by stage to the BCPW modules to obtain the final segmentation result

Dense skip connections

The skip connections between the encoding and decoding parts in U-Net preserve the spatial information that is lost during the encoding process. However, the skip connections between the decoding and encoding parts are one-to-one, resulting in details such as location and boundary information still being lost.

To solve this problem, we propose dense skip connections. We add more nodes without extending the network depth, which makes each stage no longer has only features from both ends, and retains more detailed image information in the encoding part. From the relation of input and output between layers, all decoding nodes combine details from the symmetric encoding node and the corresponding upper-level encoding node. In addition, the sampled features of the encoding nodes can be reused in the multi-level structure, allowing the mining of small-sized semantic information hidden in the shallow network and improving feature flow delivery to enhance multi-scale information in the higher-level network. The equation is defined as:

\begin{matrix} y_{k, i} = \{\begin{matrix} U A R (y_{k, i - 1}) \oplus y_{k - 1, i - 1}, & if k \leq 3 \\ U p (y_{k, i - 1}) \oplus y_{k - 1, i - 1}, & if k = 4 \end{matrix}) \end{matrix}

where $x_{in}$ is input, $y_{k, i}$ is the feature map in the k stage and i is the Sampling node; $\oplus$ denotes concatenating feature maps on the channel dimension; $U A R (\cdot)$ and $U p (\cdot)$ are up-sampling. The process of the densely connected convolutional layer is shown in Fig. 2.

Fig. 2 — Illustration of dense skip connections at different complexities. From upper to lower, stage 1 has a one-layer connection shown as yellow; stage 2 has two-layer connections; layer 2 is shown in green; stage 3 has three-layer connections; layer 3 is shown in purple; and stage 4 has four-layer connections; layer 4 is shown in red (colour figure online)

Aggregated residual module

In the down-sampling and up-sampling of stage 1–stage 3 of the architecture, the feature tensor passes through the aggregated residual modules in turn. Since the convolution paths of different branches focus on different feature information, the same feature map is obtained differently after different convolution paths. However, the coupling of the feature maps obtained from different paths is low. Combining the feature maps of all convolutional paths and fusing them can complement each other with the missing feature information, which is helpful to obtain more complete image feature information. In this study, two aggregated residual convolution structures with different sizes are designed, as shown in Figs. 3 and 4.

Fig. 3 — a Illustration of the DAR module, b illustration of the detailed structure of the transformer block

Figure 3 shows the DAR module for the encoding part of stage 1–stage 3 in the network. Specifically, the input feature passes through four branches. In the topmost branch, the feature $F_{1}$ is obtained by $1 \times 1$ convolution and atrous convolution (Yu and Koltun 2015). Different dilation rate can maximize the extraction of feature information in different receptive fields. The receptive field calculation of atrous convolution is similar to that of standard convolution. Since the atrous convolution can be regarded as filling 0 in the standard convolution kernel, it can be imagined as a standard convolution kernel with enlarged size. Then, the receptive field’ size of atrous convolution can be calculated by using the way that the standard convolution kernel calculates the receptive field. For the atrous convolution with convolution kernel size k and dilation rate r, the calculation formula of receptive field F is as follows:

\begin{matrix} F = k + (k - 1) (r - 1) \end{matrix}

where k is the actual size of the convolution kernel after expansion; and r is the convolution’s dilation rate, which increases exponentially, being 4, 8, and 12 respectively.

For reducing the loss of input feature information, the input feature X is operated by $1 \times 1$ convolution, $3 \times 3$ convolution, and $5 \times 5$ convolution respectively in the first three branches, and then dimension concatenation is implemented to obtain feature $F_{2} .$ In the bottom branch, the input feature X goes through $1 \times 1$ convolution and transformer block, while the ReLU activation function is used to suppress the negative propagation to obtain the feature $F_{3} .$ Here, we used a transformer block similar to MobileViT (Mehta and Rastegari 2021) based on the embedded transformer in MobileNetV2 (Sandler et al. 2018), which combines the CNN and transformer to achieve high-quality performance in joint feature extraction. Finally, $F_{1},$ $F_{2},$ and $F_{3}$ are concatenated in the channel to obtain the final output feature Y.

The detailed structure of the transformer block in the DAR module is shown in Fig. 3b. For a given input tensor $X \in R^{H \times W \times C},$ where H, W, and C are the height, width, and number of channels of the feature tensor X, respectively. The input tensor X first by $3 \times 3$ convolution and $1 \times 1$ convolution operations to produce $X_{U} \in R^{P \times N \times d} .$ To enable MobileViT to learn global representations with spatial inductive bias, we unfold $X_{L}$ into N non-overlapping flattened patches $X_{U} \in R^{P \times N \times d} .$ Here, $P = w h,$ $N = \frac{WH}{p},$ w, h is the width and height of a patch. $X_{G} \in R^{P \times N \times d}$ is obtained by applying transformer to $X_{U},$ which can be formulated as:

\begin{matrix} X_{G} (p) = Transformer (X_{U} (p)), 1 \leq p \leq P . \end{matrix}

Then, we can fold $X_{G} \in R^{P \times N \times d}$ to obtain $X_{F} \in R^{H \times W \times d},$ and obtain a feature map with the same dimensions as the original input feature map. After a $1 \times 1$ convolution operation, it is concatenated with the input feature, and finally, a $3 \times 3$ convolution is used to obtain the final output feature Y.

Effective integration of low-level sampling features in network up-sampling can further improve the accuracy of pixel classification. In view of this, we design the up-sampling residual module, as shown in Fig. 4. Firstly, the high-level feature is convolved by $1 \times 1$ convolution and up-sampling, and then the feature $F_{1}$ is obtained by channel concatenating with the low-level feature after $1 \times 1$ convolution. The obtained $F_{1}$ feature passes through two branches, namely $1 \times 1$ convolution and twice $3 \times 3$ convolution, and obtains the feature $F_{2},$ $F_{3}$ in turn. BN is used to accelerate the convergence speed of the network after each convolution. $F_{3}$ uses PReLU activation function to convolve with $1 \times 1$ convolution and BN to obtain the feature $F_{4} .$ After $F_{2}$ and $F_{4}$ are concatenated and through the PReLU activation function, the global average pooling is used to aggregate the global environment information of the input features. Finally, we use two $1 \times 1$ convolutional layer followed by ReLU and Sigmoid functions to generate the weights of each layer along the channel dimension, and obtain the final output feature Y. PReLU is defined as:

\begin{matrix} PReLU (x_{i}) = \{\begin{matrix} x_{i} & i f x_{i} > 0 \\ a_{i} x_{i} & i f x_{i} \leq 0 . \end{matrix}) \end{matrix}

It can introduce a learning parameter to prevent the problem that the gradient is zero and the network cannot be updated. The feature $F_{1},$ $F_{2},$ $F_{3},$ $F_{4}$ can be defined as:

\begin{matrix} F_{1} = f_{1} (X_{L}) \oplus (U_{P} (f_{1} (X_{H}))) \end{matrix}

\begin{matrix} F_{2} = BN (f_{1} (F_{1})) \end{matrix}

\begin{matrix} F_{3} = {[BN (f_{2} (F_{1}))]}^{*} \end{matrix}

\begin{matrix} F_{4} = BN (f_{1} (PReLU (F_{3}))) . \end{matrix}

The final output feature Y can be defined as:

\begin{matrix} Y = (PReLU (F_{2} \oplus F_{4})) \times f_{GAP} (X_{H}) . \end{matrix}

In the formula, $X_{L},$ $X_{H}$ refer to low-level features and high-level features, $f_{1},$ $f_{2},$ $f_{3}$ refer to $1 \times 1$ convolution, $3 \times 3$ convolution, and $5 \times 5$ convolution, $\oplus$ refers to channel concatenation, $*$ refers to secondary convolution, $\times$ refers to the multiplication of corresponding elements, $U_{p}$ refers to up-sampling, $f_{GAP}$ refers to global average pooling.

Bilateral channel pixel weighted module

Dense skip connections help the encoder to preserve boundary information. While the feature map of the encoder also generates noise and shadows of pseudo-infection. To further capture more information about the fused contextual features, we propose a parallel transmission strategy using the BCPW module to minimize the loss of feature transmission caused by serial convolution. The BCPW module is shown in Fig. 5.

Fig. 5 — a Illustration of the detailed structure of BCPW, b illustration of PW block

Specifically, given the low-level feature and high-level feature from the decoder layer, where H, W, and C are the height, width, and the number of channels of $X_{L}$ and $X_{H},$ respectively. Firstly, the low-level feature and the high-level feature are concatenated to obtain the feature F and then are sent to two branches, where the $L_{1}$ branch is $1 \times 1$ convolution and $3 \times 3$ convolution to obtain the feature $F_{I} .$ Then $F_{I}$ is sent to pixel weighted (PW) block to generate pixel weight:

\begin{matrix} W_{i} = softmax (\frac{e^{x_{i}}}{\sum_{i = 1}^{HW} e^{x_{i}}}) \end{matrix}

\begin{matrix} W_{F} = Fold ((\begin{matrix} w_{11} & \dots & w_{1 i} \\ ⋮ & ⋱ & ⋮ \\ w_{c 1} & \dots & w_{ci} \end{matrix})) \end{matrix}

\begin{matrix} F_{K} = F_{I} \cdot W_{F} . \end{matrix}

The $L_{2}$ branch operation is the same as above. Then, the feature obtained from $L_{1}$ and $L_{2}$ branches are multiplied by the input features, and then the elements are added. Finally, the obtained features and the input features complete channel concatenation, and the final segmentation result is obtained by the ReLU activation function.

First, the convolutions of three different sizes of receptive fields, $1 \times 1,$ $3 \times 3,$ and $5 \times 5$ are alternately used in parallel to reduce the number of parameters. Then, batch normalization and average pooling processing are performed. The obtained feature blocks are multiplied with the original features. Finally, the dual-channel information is spliced in the channel dimension to enhance the interactivity of the cross-channel information. The BCPW module realizes feature fusion of different sizes of receptive fields and different levels and maximizes the extraction of more feature information from the image from each dimension.

The first feature fusion block BCPW1 in Fig. 1 does not adopt a double-ended input. Specifically, we split the input feature map into two subblocks and then adopt parallel passing. First, the two branches are operated by convolutional layers with receptive field sizes of $1 \times 1,$ $3 \times 3,$ and $1 \times 1,$ $5 \times 5,$ respectively, and then passed through the batch normalization process and ReLU to suppress negative values. After that, the same fusion method as in Fig. 5 is performed. The overall flow is shown in Fig. 6.

Fig. 6 — Illustration of the detailed structure of BCPW1

Loss function

The choice of loss function has a great influence on the network model. An appropriate loss function can guide the network to train in the right direction and improve the segmentation ability of the model. In this paper, we utilize a combined loss function composed of Dice coefficient (Dice) loss (Milletari et al. 2016) and binary cross-entropy (BCE) loss to segment COVID-19 lung infected regions.

Dice loss

The Dice loss function is generally chosen for image segmentation. It mitigates the problem of imbalance in the number of positive and negative samples. The Dice loss function is defined as:

\begin{matrix} L_{D ice} = 1 - \sum_{a = 1}^{I} \frac{\sum_{b = 1}^{N} p (a, b) g (a, b)}{\sum_{b = 1}^{N} p (a, b) + \sum_{b = 1}^{N} g (a, b)} \end{matrix}

where N is the total number of pixels in the image, and I is the number of categories. In this study, $I (= 2)$ represents the COVID-19 lung infection regions and background. $p (a, b) \in [0, 1]$ represents the probability that pixel b in the prediction image belongs to class a, and $g (a, b) \in [0, 1]$ represents the true probability that pixel b in the image belongs to class a.

BCE loss

Training of the segmentation model means to obtain a maximized probability of the accuracy of segmenting lung infected regions. The BCE loss is widely used in the performance evaluation of binary segmentation tasks. The mathematical formula is defined as:

\begin{matrix} L_{BCE} = - \frac{1}{N} \sum_{i = a}^{N} [g_{a} log p_{a} + (1 - g_{a}) log (1 - p_{a})] \end{matrix}

where N represents the total number of pixels in the image; g represents the ground truth value of the pixel; and p represents the predicted value of the pixel. In this way, when g is 0, the first term of the formula is 0, and p needs to be 0 as far as possible to make the second term of the formula smaller. When g is 1, the second term is 0, and p needs to be 1 as much as possible to make the value of the first term smaller to achieve the desired effect of making p as close to g as possible. To ensure that the network’s output is between 0 and 1, we usually add a sigmoid function.

To better improve the stability of model training, we combine Dice loss and BCE loss to propose a new loss function. The overall loss function for PDRF-Net training is as follows:

\begin{matrix} L_{total} = L_{Dice} + L_{BCE} . \end{matrix}

Experimental design

Dataset and experimental details

Compared with other pathological images, COVID-19 CT images are more complex and easily confused with other lung diseases. It is very difficult for encoders to extract effective segmentation features. Secondly, the COVID-19 infection regions are diffuse, uncertain in location, and variable in shape, which requires high ability to extract detailed features of the segmentation model. Therefore, we used the COVID-19-1 dataset and the COVID-19-2 dataset to evaluate the proposed PDRF-Net. In our experiments, the distribution of training set and test set is 8:2.

COVID-19-1: We select COVID-19 CT scans from Kaggle as dataset 1. This dataset contains 20 CT scans of patients diagnosed with COVID-19 as well as specialists’ manual segmentation of lung infections. There are 11,191 slices.

COVID-19-2: We take the dataset consisting of the COVID-19 CT segmentation dataset and the COVID-19 CT Segmentation dataset nr.2 as dataset 2. The COVID-19 CT segmentation dataset includes 100 axial CT images of 20 COVID-19 patients collected by the Italian Medical Society and other institutions. The COVID-19 CT Segmentation dataset nr.2 is provided by Radiopaedia Institute. The composed dataset 2 has a total of 6804 slices.

To reduce the impact of resizing images on image quality, we first resample the original image slices and then crop all slices uniformly to $128 \times 128 .$ Combined with experimental experience and control variable method to adjust the model parameters, we selected the Adam optimizer to converge our loss function with a learning rate of 0.001. Due to GPU performance limitations, we set the training batch to 2, and the epoch is 120.

Our PDRF-Net does not require pretraining, and the model architecture is based on an encoding-decoding structure. Our model is implemented using PyTorch on an Ubuntu 20.04 server. We use Intel (R) Core (TM) i7-11700K CPU with NVIDIA RTX 3080 Ti GPU to accelerate model training process. Furthermore, the programming language is Python 3.8.

Evaluation metrics

To evaluate the proposed PDRF-Net segmentation performance of infected lung CT regions, we adopted five evaluation indexes: Sensitivity (Sen), Specificity (Spe), Intersection over union (IoU), Dice Similarity Coefficient (DSC) and Accuracy (ACC). They are defined as:

\begin{matrix} S e n = \frac{TP}{T P + F N} \end{matrix}

\begin{matrix} S p e = \frac{TN}{T N + F P} \end{matrix}

\begin{matrix} I o U = \frac{TP}{F P + T P + F N} \end{matrix}

\begin{matrix} D S C = \frac{2 T P}{F P + 2 T P + F N} \end{matrix}

\begin{matrix} A C C = \frac{T P + T N}{T P + F P + T N + F N} . \end{matrix}

For lung CT image segmentation task, K denotes the category; $K + 1$ denotes the plus background class; and I denotes the true value. TP refers to the number of pixels in the infected region of the lung that are correctly identified. And the number of non-lung infected pixels that are correctly identified as non-lung infected is defined as TN. FP is the number of lung non-infected pixels but incorrectly identified as infected. lung infected pixels wrongly identified as uninfected are defined as FN.

Comparison of different methods

In this section, we compare the proposed method PDRF-Net with other mainstream methods, including SegNet (Badrinarayanan et al. 2017), DeepLabv3+ (Chen et al. 2018), R2U-Net (Alom et al. 2019), Attention U-Net (Oktay et al. 2018), and U²-Net (Qin et al. 2020), as well as Attention R2U-Net, which combines Attention U-Net and R2U-Net. All segmentation models were run separately in the same experimental setting. The experimental results for the two datasets are shown in Table 1.

Table 1.

Comparison of the performance of different models on COVID-19-1 and COVID-19-2

Dataset	Methods	Params.(M)	FLOPs(G)	Inference(ms)	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
COVID-19-1	SegNet (Badrinarayanan et al. 2017)	29.6	10.2	9	80.6	94.3	67.2	80.4	85.6
	DeepLabv3+ (Chen et al. 2018)	40.8	12.4	16	82.5	95.1	69.5	82.0	86.9
	R2U-Net (Alom et al. 2019)	17.3	35.6	21	83.7	95.9	71.4	83.3	88.3
	Attention U-Net (Oktay et al. 2018)	8.5	15.5	12	85.4	96.2	74.2	85.2	90.5
	Attention R2U-Net	36.9	52.6	24	89.2	96.4	76.3	86.6	91.3
	U²-Net (Qin et al. 2020)	48.7	57.3	28	87.5	96.8	77.5	87.3	92.6
	PDRF-Net	54.9	66.1	23	88.9	96.3	79.6	88.6	94.1
COVID-19-2	SegNet (Badrinarayanan et al. 2017)	29.6	10.2	9	75.6	87.4	60.3	75.2	76.3
	DeepLabv3+ (Chen et al. 2018)	40.8	12.4	16	76.3	89.5	61.2	75.9	79.5
	R2U-Net (Alom et al. 2019)	17.3	35.6	21	79.2	95.4	64.7	78.6	80.1
	Attention U-Net (Oktay et al. 2018)	8.5	15.5	12	86.4	92.3	68.5	81.3	84.2
	Attention R2U-Net	36.9	52.6	24	83.6	93.5	71.4	83.3	86.7
	U²-Net (Qin et al. 2020)	48.7	57.3	28	85.3	94.7	73.9	85.0	87.2
	PDRF-Net	54.9	66.1	23	86.2	95.2	75.3	85.9	89.4

Open in a new tab

Bold indicates the best result

Generally, the number of parameters in the model is positively correlated with the computational complexity. As can be seen from the table, Compared with U²-Net with high computational complexity, the parameters of PDRF-Net increase by 12.7%, the computational complexity increases by 15.4%, and the inference time decreases by 17.9%. With a slight increase in network parameters and computational complexity, our well-designed PDRF-Net reaches the maximum value in IoU, DSC and ACC evaluation metrics. We also found that the overall experimental results of all the models on COVID-19-2 are worse than those on COVID-19-1. This may be because the infections in the two datasets are very different. Compared with SegNet, our network improves by 18.5% and 24.9% on IoU, 9.9% and 17.2% on ACC. The segmentation performance of SegNet and DeepLabv3+ is significantly lower than other comparison networks, which is because these networks do not include skip connections. This makes the segmentation network unable to consider the connection between shallow features and deep abstract features when extracting features, resulting in a coarse final segmentation result. However, the U-Net-based networks consider this defect in the network structure and add skip connections, which is suitable for the high-precision segmentation task of medical images. On the COVID-19-1 dataset, the proposed PDRF-Net differs from the Attention R2U-Net that obtains the best Sen by 0.3%, and differs from the U²-Net that obtains the best Spe by 0.5%. While on the COVID-19-2 dataset, the proposed PDRF-Net differs from the Attention U-Net that gets the best Sen by 0.2%, and also differs from the R2U-Net that performs best on the Spe by 0.2%. In general, considering the network parameters, computational complexity and other evaluation metrics, our proposed PDRF-Net still has strong competitiveness.

In addition, we randomly selected six images in the test set of COVID-19-1 and COVID-19-2 for visualization. The visualizations are shown in Fig. 7. The first row and the second row are the original images and the ground truth, respectively. The third to eighth rows are the segmentation results of SegNet, DeepLabv3+, R2U-Net, Attention U-Net, Attention R2U-Net, and U²-Net, respectively. The last row is the result of PDRF-Net, which was close to manual segmentation. It can be seen from the above figure that SegNet, DeepLabv3+, and R2U-Net all only segment the infected area with a thicker main part, and there are different degrees of under-segmentation. Attention U-Net, Attention R2U-Net, and U²-Net handle the details slightly better. For subtle changes in the lesion regions, our network has a greater learning ability and achieves the best segmentation outcome.

The dynamic trends during model training are very important for evaluating the excellence of the model. Thus, we plotted the loss graph and the accuracy graph of the training of each comparison algorithm, as shown in Fig. 8. From the figure, we can find that our model fuses shallow feature information and deep feature information by the dense skip connections; the training process converges faster than the other networks; and the accuracy after convergence is higher than the other six algorithms in all cases. Therefore, we can conclude that PDRF-Net can learn lung infection segmentation better.

Fig. 8 — Comparison of the loss curve and ACC curve of lung infection segmentation obtained by different models on the target dataset. a is the loss comparison curve for COVID-19-1; b is the loss comparison curve for COVID-19-2; c is the ACC comparison curve for COVID-19-1; d is the ACC comparison curve for COVID-19-2

Ablation study

To illustrate the ability of all modules in this paper to contribute towards the overall network, in this section, we perform detailed ablation experiments by an incremental method. We take the traditional U-Net as the baseline and sequentially add DAR module, UAR module, dense skip connections, and BCPW module. The results of the ablation experiments on COVID-19-1 and COVID-19-2 are presented in Table 2.

Table 2.

Ablation study of our PDRF-Net on the COVID-19-1 dataset

Dataset	Baseline	DAR	UAR	Dense skip connections	BCPW	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
COVID-19-1	$\sqrt$					83.6	92.7	71.3	83.2	82.9
	$\sqrt$	$\sqrt$				84.5	93.2	72.9	84.3	85.2
	$\sqrt$		$\sqrt$			84.7	93.8	72.4	84.0	84.6
	$\sqrt$			$\sqrt$		85.2	94.2	73.5	84.7	86.3
	$\sqrt$			$\sqrt$	$\sqrt$	86.1	94.9	75.1	85.8	89.4
	$\sqrt$	$\sqrt$		$\sqrt$		87.3	95.4	76.8	86.9	91.3
	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$		88.2	95.8	78.2	87.8	92.7
	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	88.9	96.3	79.6	88.6	94.1
COVID-19-2	$\sqrt$					81.2	85.3	67.8	80.8	82.7
	$\sqrt$	$\sqrt$				82.3	89.5	69.2	81.8	84.3
	$\sqrt$		$\sqrt$			81.7	87.6	68.7	81.4	83.6
	$\sqrt$			$\sqrt$		82.8	90.1	70.3	82.6	84.9
	$\sqrt$			$\sqrt$	$\sqrt$	83.9	90.6	71.6	83.4	85.2
	$\sqrt$	$\sqrt$		$\sqrt$		84.7	91.8	72.5	84.1	86.5
	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$		85.4	92.7	74.2	85.2	87.6
	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	86.2	95.2	75.3	85.9	89.4

Open in a new tab

The best results are shown in bold font

Compared with the baseline, our PDRF-Net improves by 13.5% and 8.1% on ACC and 11.6% and 11.0% on IoU. After introducing DAR into the baseline, it increased by 2.2% and 2.1% on IoU and 2.8% and 1.9% on ACC, respectively. While after introducing UAR into the baseline, the network increased by 1.5% and 0.6% on IoU, and increased by 2.1% and 1.1% on ACC respectively. This is due to the addition of aggregated residual modules (DAR, UAR) in the encoding-decoding part, which not only increases the network depth, but also improves the problem of network forgetting. The dense skip connections in our network induce feature information transfer at different scales. The BCPW module after the final output of the decoding part enhances the feature learning capability of the network, ensuring that the model extracts more feature information from images in all dimensions. The segmentation performance of the network has been greatly improved. Figure 9 further clearly shows the specific contribution of each module in the model.

Fig. 9 — The 3D histogram more intuitively presents the performance of different components on COVID-19-1 dataset, COVID-19-2 dataset

The corresponding visualization results for the ablation experiments on COVID-19-1 and COVID-19-2 are shown in Fig. 10. The introduction of the DAR and UAR modules improves our baseline in terms of segmentation results. After continuing to add dense skip connections, the effect is significantly improved, reflecting the importance of dense skip connections. Finally, after adding the BCPW modules, the results are convincing.

Ablation study for down-sampling

Using U-Net as the baseline, we compare the performance of the residual block of D2AUNet (Zhao et al. 2021) with the proposed DAR module. The experimental results are shown in Table 3. It can be seen from the table that the evaluation metrics of DAR module are better than residual block. Our DAR module is 2.0% and 1.7% higher than residual block on IoU. In terms of DSC, the DAR module is 1.9% and 1.8% higher than the residual block. In addition, the DAR module on ACC compared to the residual block increased by 1.5% and 1.6%. It can be seen that after combining transformer in the DAR module, the residual module has stronger feature representation ability, and the overall segmentation performance is better than the residual block.

Table 3.

Performance comparison of different down-sampling methods on COVID-19-1 and COVID-19-2

Dataset	Methods	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
COVID-19-1	Residual block (Zhao et al. 2021)	82.3	91.6	71.5	82.7	83.9
	DAR	84.5	93.2	72.9	84.3	85.2
COVID-19-2	Residual block (Zhao et al. 2021)	83.1	92.4	71.2	82.5	83.3
	DAR	84.7	93.8	72.4	84.0	84.6

Open in a new tab

Ablation study for up-sampling

Table 4 shows the segmentation performance comparison between the proposed UAR module and the Gate Att module (Zhao et al. 2021). We replace the traditional up-sampling in U-Net with the UAR module and the Gate Att module respectively.

Table 4.

Performance comparison of different up-sampling methods on COVID-19-1 and COVID-19-2

Dataset	Methods	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
COVID-19-1	Gate Att module (Zhao et al. 2021)	81.9	87.9	68.4	80.5	82.7
	DAR	82.3	89.5	69.2	81.8	84.3
COVID-19-2	Gate Att module (Zhao et al. 2021)	80.6	88.2	67.5	80.2	81.8
	DAR	81.7	87.6	68.7	81.4	83.6

Open in a new tab

The experimental results show that the UAR module is 1.2% and 1.8% higher than the Gate Att module on IoU. In terms of DSC, the UAR module is 1.6% and 1.5% higher than the Gate Att module. Finally, the UAR module compared to the Gate Att module increased by 1.9% and 2.2%. When segmenting the lesion regions, the UAR module improves the fusion quality of low-level and high-level semantic features. In addition, the segmentation results obtained by U-Net combined with the UAR module are closer to the ground truth.

Ablation study for dense skip connections

Dense skip connections can improve the diversity and representation ability of the extracted features by information fusion between different encoding layers and decoding layers. As shown in Table 5, based on U-Net, we introduced U-Net++’s (Zhou et al. 2018) skip connections and our dense skip connections respectively. As can be seen from the table, the evaluation results on the two datasets are improved compared with U-Net++. The U-Net + dense skip connections method has significantly improved the evaluation metrics Sen and DSC, and it has still improved in the COVID-19-2 dataset, while the improvement is relatively small.

Table 5.

Ablation study for dense skip connections

Dataset	Methods	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
COVID-19-1	Skip connections (Zhou et al. 2018)	83.9	92.5	71.6	83.4	84.6
	Dense skip connections	85.2	94.2	73.5	84.7	86.3
COVID-19-2	Skip connections (Zhou et al. 2018)	82.1	88.7	68.9	81.6	82.5
	Dense skip connections	82.8	90.1	70.3	82.6	84.9

Open in a new tab

Generalization ability

Finally, we investigate the generalization ability of PDRF-Net in different fields. This is a task of retinal vessel segmentation, which is different from the segmentation of lung lesions and challenging. The adopted DRIVE dataset (Staal et al. 2004) contains 40 images with a resolution of $584 \times 565,$ which has divided the dataset in advance. There are 20 images for training and 20 for testing, but the validation set is not available. Therefore, we divide the original training set into 16 images for training and 4 images for verification. Finally, the partition of the training set, verification set, and test set in our experiment is 16:4:20.

Based on the same experimental environment and parameter Settings, we used SegNet (Oktay et al. 2018), DeepLabV3+ (Chen et al. 2018), R2U-Net (Alom et al. 2019), U-Net (Ronneberger et al. 2015) and Attention U-Net (Oktay et al. 2018) to perform model evaluation on the test set. Table 6 demonstrates that PDRF-Net outperforms other mainstream models in the evaluation metrics of IoU, DSC and ACC, reaching 65.1%, 78.7% and 95.8% respectively. It has a strong competitive advantage in Sen and Spe, which is 4.3% and 0.3% different from the optimal SegNet on Sen and the optimal Attention U-Net on Spe, respectively. The corresponding segmentation results are shown in Fig. 11. The results indicate that our proposed method still has strong generalization ability in the task of fundus retinal vessel segmentation.

Table 6.

Comparison of the effect of each method on fundus retinal vessel segmentation

Methods	Sen(%)	Spe(%)	IoU(%)	DSC(%)	ACC(%)
SegNet (Oktay et al. 2018)	80.7	83.3	31.8	48.3	81.9
DeepLabv3+ (Chen et al. 2018)	65.3	94.6	52.7	69.0	93.6
R2U-Net (Alom et al. 2019)	67.8	96.1	58.4	73.7	94.7
U-Net (Ronneberger et al. 2015)	75.6	95.8	62.6	77.0	94.5
Attention U-Net (Oktay et al. 2018)	74.2	97.5	64.5	78.4	95.2
PDRF-Net	76.4	97.2	65.1	78.7	95.8

Open in a new tab

The best results are shown in bold font

Fig. 11 — Visual results of fundus retinal vessel segmentation

Conclusion

In this work, we proposed a novel lung infected region segmentation model (PDRF-Net) for COVID-19. First, we introduce dense skip connections to enable the sampled features of the encoded nodes to be reused in a multi-level structure to mine the small-sized semantic information hidden in the shallow layer, such as the tiny infected regions of lung lesions and boundary information. Second, we design the DAR module for down-sampling in the encoder to capture multi-scale spatial information, and we design the UAR module in the decoder of the first three stages to further refine the feature information and better recover the prominent regions that are infected. In addition, we use the BCPW module after all decoder branches to achieve feature fusion with different-sized perceptual fields and different levels and enable the model to extract optimal image feature information from all dimensions. The experimental results reflect that our PDRF-Net is better than the current mainstream segmentation methods for the lung infection region of COVID-19 segmentation. However, our method still has some limitations. For example, this paper converts 3D medical images into 2D slices for operation, which may ignore the spatial information between slices that facilitates fine segmentation. In addition, our method still has room for improvement in the number of parameters and computational complexity. Therefore, we hope to optimize the model lightweight in future work and explore the application of PDRF-Net in 3D medical image segmentation.

Author contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Guizhou Science and Technology Planning Project (Guizhou Science and Technology Cooperation Support [2021] General 176).

Availability of data and materials

Publicly available dataset was used in this study. The COVID-19-1 dataset and COVID-19-2 dataset can be found here: https://www.kaggle.com/datasets/andrewmvd/covid19-ct-scans, https://medicalsegmentation.com/covid19/.

Declarations

Conflict of interest

The authors declare no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Alom MZ, Yakopcic C, Hasan M, Taha TM, Asari VK (2019) Recurrent residual u-net for medical image segmentation. J Med Imaging 6(1):014006 [DOI] [PMC free article] [PubMed] [Google Scholar]
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495 [DOI] [PubMed] [Google Scholar]
Banerjee S, Lyu J, Huang Z, Leung FH, Lee T, Yang D, Su S, Zheng Y, Ling SH (2022) Ultrasound spine image segmentation using multi-scale feature fusion skip-inception U-Net (SIU-Net). Biocybern Biomed Eng 42(1):341–361 [Google Scholar]
Bhalerao M, Thakur (2019) Brain tumor segmentation based on 3d residual u-net. In: International MICCAI brainlesion workshop. Springer, Berlin, pp 218–225
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint. arXiv:2105.05537
Chattopadhyay S, Dey A, Singh PK, Sarkar R (2022) DRDA-Net: dense residual dual-shuffle attention network for breast cancer classification using histopathological images. Comput Biol Med 145:105437 [DOI] [PubMed] [Google Scholar]
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), Munich, pp 801–818
Chen X, Yao L, Zhang Y (2020) Residual attention U-Net for automated multi-class segmentation of COVID-19 chest CT images. arXiv preprint. arXiv:2004.05645
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint. arXiv:2102.04306
Chen Y, Xu C, Ding W, Sun S, Yue X, Fujita H (2022) Target-aware U-Net with fuzzy skip connections for refined pancreas segmentation. Applied Soft Computing 131:109818
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12114–12124. 10.1109/CVPR52688.2022.01181
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR)
Elharrouss O, Subramanian N, Al-Maadeed S (2022) An encoder–decoder-based method for segmentation of COVID-19 lung infection in CT images. SN Comput Sci 3(1):1–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D (2022) Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI brainlesion workshop. Springer, Berlin, pp 272–284
He A, Wang K, Li T, Bo W, Kang H, Fu H (2022) Progressive multi-scale consistent network for multi-class fundus lesion segmentation. IEEE Trans Med Imaging 41(11):3146–3157 [DOI] [PubMed]
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Montreal, pp 10012–10022
Mehta S, Rastegari M (2022) Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations (ICLR)
Milletari F, Navab N, Ahmadi S-A (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth international conference on 3D vision (3DV). IEEE, Stanford, pp 565–571
Mu N, Lyu Z, Rezaeitaleshmahalleh M, Tang J, Jiang J (2022) An attention residual u-net with differential preprocessing and geometric postprocessing: learning how to segment vasculature including intracranial aneurysms. Med Image Anal 84:102697 [DOI] [PMC free article] [PubMed]
Munusamy H, Muthukumar KJ, Gnanaprakasam S, Shanmugakani TR, Sekar A (2021) FractalCovNet architecture for COVID-19 chest X-ray image classification and CT-scan image segmentation. Biocybern Biomed Eng 41(3):1025–1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al (2018) Attention U-Net: Learning Where to Look for the Pancreas. In: Medical Imaging with Deep Learning (MIDL), Amsterdam
Peiris H, Hayat M, Chen Z, Egan G, Harandi M (2021) A volumetric transformer for accurate 3d tumor segmentation. arXiv preprint. arXiv:2111.13300
Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit 106:107404 [Google Scholar]
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 234–241
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, Salt Lake City, pp 4510–4520
Staal J, Abràmoff MD, Niemeijer M, Viergever MA, Van Ginneken B (2004) Ridge-based vessel segmentation in color images of the retina. IEEE Trans Med Imaging 23(4):501–509 [DOI] [PubMed] [Google Scholar]
Tang P, Zu C, Hong M, Yan R, Peng X, Xiao J, Wu X, Zhou J, Zhou L, Wang Y (2021) DA-DSUnet: dual attention-based dense SU-net for automatic head-and-neck tumor segmentation in MRI images. Neurocomputing 435:103–113 [Google Scholar]
Wang C, Horby PW, Hayden FG, Gao GF (2020a) A novel coronavirus outbreak of global health concern. Lancet 395(10223):470–473 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S (2020b) A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging 39(8):2653–2663 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang B, Jin S, Yan Q, Xu H, Luo C, Wei L, Zhao W, Hou X, Ma W, Xu Z et al (2021a) AI-assisted CT imaging analysis for COVID-19 screening: building and deploying a medical AI system. Appl Soft Comput 98:106897 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang W, Chen C, Ding M, Yu H, Zha S, Li J (2021b) Transbts: multimodal brain tumor segmentation using transformer. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 109–119
Wang X, Li Z, Huang Y, Jiao Y (2022) Multimodal medical image segmentation using multi-scale context-aware network. Neurocomputing 486:135–146 [Google Scholar]
Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y et al (2020) A new coronavirus associated with human respiratory disease in china. Nature 579(7798):265–269 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In: 2018 9th International conference on information technology in medicine and education (ITME). IEEE, Hangzhou, pp 327–331
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint. arXiv:1511.07122
Zhang J, Xie Y, Wang Y, Xia Y (2020) Inter-slice context residual learning for 3D medical image segmentation. IEEE Trans Med Imaging 40(2):661–672 [DOI] [PubMed] [Google Scholar]
Zhao X, Zhang P, Song F, Fan G, Sun Y, Wang Y, Tian Z, Zhang L, Zhang G (2021) D2a u-net: Automatic segmentation of covid-19 ct slices based on dual attention and hybrid dilated convolution. Comput Biol Med 135:104526. 10.1016/j.compbiomed.2021.104526 [DOI] [PMC free article] [PubMed]
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, 2018, pp 3–11 [DOI] [PMC free article] [PubMed]
Zhou H-Y, Lu C, Yang S, Yu Y (2021a) ConvNets vs. transformers: whose visual representations are more transferable? In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Montreal, pp 2230–2238
Zhou T, Canu S, Ruan S (2021b) Automatic COVID-19 CT segmentation using U-Net integrated spatial and channel attention mechanism. Int J Imaging Syst Technol 31(1):16–27 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR2] Alom MZ, Yakopcic C, Hasan M, Taha TM, Asari VK (2019) Recurrent residual u-net for medical image segmentation. J Med Imaging 6(1):014006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495 [DOI] [PubMed] [Google Scholar]

[CR4] Banerjee S, Lyu J, Huang Z, Leung FH, Lee T, Yang D, Su S, Zheng Y, Ling SH (2022) Ultrasound spine image segmentation using multi-scale feature fusion skip-inception U-Net (SIU-Net). Biocybern Biomed Eng 42(1):341–361 [Google Scholar]

[CR5] Bhalerao M, Thakur (2019) Brain tumor segmentation based on 3d residual u-net. In: International MICCAI brainlesion workshop. Springer, Berlin, pp 218–225

[CR6] Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint. arXiv:2105.05537

[CR7] Chattopadhyay S, Dey A, Singh PK, Sarkar R (2022) DRDA-Net: dense residual dual-shuffle attention network for breast cancer classification using histopathological images. Comput Biol Med 145:105437 [DOI] [PubMed] [Google Scholar]

[CR8] Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), Munich, pp 801–818

[CR9] Chen X, Yao L, Zhang Y (2020) Residual attention U-Net for automated multi-class segmentation of COVID-19 chest CT images. arXiv preprint. arXiv:2004.05645

[CR10] Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint. arXiv:2102.04306

[CR11] Chen Y, Xu C, Ding W, Sun S, Yue X, Fujita H (2022) Target-aware U-Net with fuzzy skip connections for refined pancreas segmentation. Applied Soft Computing 131:109818

[CR12] Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12114–12124. 10.1109/CVPR52688.2022.01181

[CR13] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR)

[CR14] Elharrouss O, Subramanian N, Al-Maadeed S (2022) An encoder–decoder-based method for segmentation of COVID-19 lung infection in CT images. SN Comput Sci 3(1):1–12 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D (2022) Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI brainlesion workshop. Springer, Berlin, pp 272–284

[CR16] He A, Wang K, Li T, Bo W, Kang H, Fu H (2022) Progressive multi-scale consistent network for multi-class fundus lesion segmentation. IEEE Trans Med Imaging 41(11):3146–3157 [DOI] [PubMed]

[CR17] Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Montreal, pp 10012–10022

[CR18] Mehta S, Rastegari M (2022) Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations (ICLR)

[CR19] Milletari F, Navab N, Ahmadi S-A (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth international conference on 3D vision (3DV). IEEE, Stanford, pp 565–571

[CR20] Mu N, Lyu Z, Rezaeitaleshmahalleh M, Tang J, Jiang J (2022) An attention residual u-net with differential preprocessing and geometric postprocessing: learning how to segment vasculature including intracranial aneurysms. Med Image Anal 84:102697 [DOI] [PMC free article] [PubMed]

[CR21] Munusamy H, Muthukumar KJ, Gnanaprakasam S, Shanmugakani TR, Sekar A (2021) FractalCovNet architecture for COVID-19 chest X-ray image classification and CT-scan image segmentation. Biocybern Biomed Eng 41(3):1025–1038 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al (2018) Attention U-Net: Learning Where to Look for the Pancreas. In: Medical Imaging with Deep Learning (MIDL), Amsterdam

[CR23] Peiris H, Hayat M, Chen Z, Egan G, Harandi M (2021) A volumetric transformer for accurate 3d tumor segmentation. arXiv preprint. arXiv:2111.13300

[CR24] Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit 106:107404 [Google Scholar]

[CR25] Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 234–241

[CR26] Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, Salt Lake City, pp 4510–4520

[CR27] Staal J, Abràmoff MD, Niemeijer M, Viergever MA, Van Ginneken B (2004) Ridge-based vessel segmentation in color images of the retina. IEEE Trans Med Imaging 23(4):501–509 [DOI] [PubMed] [Google Scholar]

[CR28] Tang P, Zu C, Hong M, Yan R, Peng X, Xiao J, Wu X, Zhou J, Zhou L, Wang Y (2021) DA-DSUnet: dual attention-based dense SU-net for automatic head-and-neck tumor segmentation in MRI images. Neurocomputing 435:103–113 [Google Scholar]

[CR29] Wang C, Horby PW, Hayden FG, Gao GF (2020a) A novel coronavirus outbreak of global health concern. Lancet 395(10223):470–473 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S (2020b) A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging 39(8):2653–2663 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] Wang B, Jin S, Yan Q, Xu H, Luo C, Wei L, Zhao W, Hou X, Ma W, Xu Z et al (2021a) AI-assisted CT imaging analysis for COVID-19 screening: building and deploying a medical AI system. Appl Soft Comput 98:106897 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] Wang W, Chen C, Ding M, Yu H, Zha S, Li J (2021b) Transbts: multimodal brain tumor segmentation using transformer. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 109–119

[CR33] Wang X, Li Z, Huang Y, Jiao Y (2022) Multimodal medical image segmentation using multi-scale context-aware network. Neurocomputing 486:135–146 [Google Scholar]

[CR34] Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y et al (2020) A new coronavirus associated with human respiratory disease in china. Nature 579(7798):265–269 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In: 2018 9th International conference on information technology in medicine and education (ITME). IEEE, Hangzhou, pp 327–331

[CR36] Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint. arXiv:1511.07122

[CR37] Zhang J, Xie Y, Wang Y, Xia Y (2020) Inter-slice context residual learning for 3D medical image segmentation. IEEE Trans Med Imaging 40(2):661–672 [DOI] [PubMed] [Google Scholar]

[CR38] Zhao X, Zhang P, Song F, Fan G, Sun Y, Wang Y, Tian Z, Zhang L, Zhang G (2021) D2a u-net: Automatic segmentation of covid-19 ct slices based on dual attention and hybrid dilated convolution. Comput Biol Med 135:104526. 10.1016/j.compbiomed.2021.104526 [DOI] [PMC free article] [PubMed]

[CR39] Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, 2018, pp 3–11 [DOI] [PMC free article] [PubMed]

[CR40] Zhou H-Y, Lu C, Yang S, Yu Y (2021a) ConvNets vs. transformers: whose visual representations are more transferable? In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE, Montreal, pp 2230–2238

[CR41] Zhou T, Canu S, Ruan S (2021b) Automatic COVID-19 CT segmentation using U-Net integrated spatial and channel attention mechanism. Int J Imaging Syst Technol 31(1):16–27 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PDRF-Net: a progressive dense residual fusion network for COVID-19 lung CT image segmentation

Xiaoyan Lu

Yang Xu

Wenhao Yuan

Abstract

Introduction

Related work

Transformers in medical image segmentation tasks

Residual blocks in medical image segmentation tasks

Dense skip connections in medical image segmentation tasks

Methodology

Overview of PDRF-Net

Fig. 1.

Dense skip connections

Fig. 2.

Aggregated residual module

Fig. 3.

Fig. 4.

Bilateral channel pixel weighted module

Fig. 5.

Fig. 6.

Loss function

Dice loss

BCE loss

Experimental design

Dataset and experimental details

Evaluation metrics

Comparison of different methods

Table 1.

Fig. 7.

Fig. 8.

Ablation study

Table 2.

Fig. 9.

Fig. 10.

Ablation study for down-sampling

Table 3.

Ablation study for up-sampling

Table 4.

Ablation study for dense skip connections

Table 5.

Generalization ability

Table 6.

Fig. 11.

Conclusion

Author contributions

Funding

Availability of data and materials

Declarations

Conflict of interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases