Deep co-supervision and attention fusion strategy for automatic COVID-19 lung infection segmentation on CT images

Haigen Hu; Leizhao Shen; Qiu Guan; Xiaoxin Li; Qianwei Zhou; Su Ruan

doi:10.1016/j.patcog.2021.108452

. 2021 Nov 25;124:108452. doi: 10.1016/j.patcog.2021.108452

Deep co-supervision and attention fusion strategy for automatic COVID-19 lung infection segmentation on CT images

Haigen Hu ^a,^b, Leizhao Shen ^a,^b, Qiu Guan ^a,^b, Xiaoxin Li ^a,^b, Qianwei Zhou ^a,^b,^⁎, Su Ruan ^c,^⁎

PMCID: PMC8612757 PMID: 34848897

Abstract

Due to the irregular shapes,various sizes and indistinguishable boundaries between the normal and infected tissues, it is still a challenging task to accurately segment the infected lesions of COVID-19 on CT images. In this paper, a novel segmentation scheme is proposed for the infections of COVID-19 by enhancing supervised information and fusing multi-scale feature maps of different levels based on the encoder-decoder architecture. To this end, a deep collaborative supervision (Co-supervision) scheme is proposed to guide the network learning the features of edges and semantics. More specifically, an Edge Supervised Module (ESM) is firstly designed to highlight low-level boundary features by incorporating the edge supervised information into the initial stage of down-sampling. Meanwhile, an Auxiliary Semantic Supervised Module (ASSM) is proposed to strengthen high-level semantic information by integrating mask supervised information into the later stage. Then an Attention Fusion Module (AFM) is developed to fuse multiple scale feature maps of different levels by using an attention mechanism to reduce the semantic gaps between high-level and low-level feature maps. Finally, the effectiveness of the proposed scheme is demonstrated on four various COVID-19 CT datasets. The results show that the proposed three modules are all promising. Based on the baseline (ResUnet), using ESM, ASSM, or AFM alone can respectively increase Dice metric by 1.12%, 1.95%,1.63% in our dataset, while the integration by incorporating three models together can rise 3.97%. Compared with the existing approaches in various datasets, the proposed method can obtain better segmentation performance in some main metrics, and can achieve the best generalization and comprehensive performance.

Keywords: Semantic segmentation, Multi-scale features, Attention mechanism, Feature fusion, COVID-19

1. Introduction

Since the outbreak of COVID-19 in December, 2019, it has spread rapidly around the world, and has caused millions of casualties and amount of economic losses. Rapid diagnosis of COVID-19 is of great significance for diagnosis, assessment and staging COVID-19 infection [1], [2], [3]. Nucleic acid testing is the “gold standard” for the diagnosis of COVID-19, but the diagnosis are easily influenced by the quality of the sample collection, and it is also more time consuming. Therefore, it is still common to use the imaging diagnosis methods such as CT and X-ray. Especially, the combining of artificial intelligence (AI) with other methods has been proposed to help auxiliary diagnosis by using medical images for COVID-19 in clinical practice, and some deep learning-based methods are becoming hot spots in the detection and segmentation of COVID-19 infected areas. For example, a modified inception neural network was proposed to train the Regions of Interest (RoI) instead of the whole CT images for classifying COVID-19 patients from control group [7]. Amyar et al. [5] proposed a multitask deep learning model to jointly identify COVID-19 patient and segment COVID-19 lesion from chest CT images. Oulefki et al. [10] presented the utility of an automated tool of segmentation and measurement for COVID-19 lung Infection using chest CT imagery. Owing to the fact that lung infected region segmentation is a necessary initial step for lung image analysis, some image segmentation algorithms are also proposed for some specific application scenarios. For instance, an improved Inf-Net was proposed to segment the infection area of the novel coronavirus, and a semi-supervised training method is put forward to solve insufficient amount of labeled CT and improve the segmentation performance [6]. Currently, most of the methods are based on detection and classification tasks, but not much on the semantic segmentation of infection on CT slices [4], so that the assessment and staging COVID-19 infection are greatly limited. Therefore, according to CT imaging characteristics, it is necessary to propose some segmentation methods for the infection regions of COVID-19, so that we can further achieve quantitative analysis of the lesions.

However, it is a still challenging task to accurately segment the infected lesions of COVID-19 on CT images owing to the following facts.

1.
The infections have irregular boundary, different sizes and shapes from slice to slice on CT images (shown in Fig. 1 a). It would easily lead to missing some small ground-glass lesions or generating excessive over-segmentation for the infections on CT images.
2.
There seems to be no discernible difference between infections and normal tissues (shown in Fig. 1b). It is unaffected for the detection or classification, but it can decrease segmentation accuracy and quantified quality.
3.
The existing semantic segmentation approaches like the encoder-decoder structure exist a “semantic gap” between low-level visual features and high-level semantic concepts, which greatly limits the efficiency of semantic segmentation.

Fig. 1 — An illustration of challenging task for identification the infected lesions (contours in red) of COVID-19 on CT images. (a) The infections have various scales and shapes. (b) There is no obvious difference between normal and infected tissues. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

To address these issues, a novel segmentation scheme is proposed for the infections of COVID-19 based on the encoder-decoder architecture [11] in this paper, and the proposed scheme can collaboratively enhance supervised information of different levels and fuse different scale feature maps. For the proposed deep collaborative supervision scheme, we propose an Auxiliary Semantic Supervised Module (ASSM) and an Edge Supervised Module (ESM) to guide the network learning the features of edges and semantics in the encoding stage, respectively. As for multi-scale feature maps, an Attention Fusion Module (AFM), following with the decoding stage, is proposed to reduce the semantic gaps between high-level and low-level feature maps. The proposed attention fusion strategy can take full advantage of different scale context information. Finally, a series of experiments are conducted on the COVID-19 dataset to verify the effectiveness of the proposed scheme. The results show that our method can obtain better performance for the segmentation of COVID-19 infections than the existing approaches. The main contributions of this paper are listed as follows.

•
An ESM is put forward to highlight low-level boundary features. The edge supervised information is incorporated into the initial stage of down-sampling, as the proposed edge supervised loss function allows to capture rich spatial information in various scales.
•
An ASSM is proposed to enhance high-level semantics from feature maps with different scales. The mask supervised information is introduced into the later stage of down-sampling, thanks to the corresponding auxiliary semantic loss function that is defined to explore sufficient semantic information from various scale infections on COVID-19 CT images.
•
An AFM is developed to fuse various scale feature maps from the up-sampling stage. An attention mechanism is utilized to reduce the semantic gaps between high-level and low-level feature maps, so as to strengthen and supplement the lost detailed information in high-level representations.
•
A joint loss function is constructed by combining the edge supervised loss, auxiliary semantic supervised loss and fusion loss. It can guide the network achieving a deep collaborative supervision on edges and semantics, and prompting the fusion efficiency on multiple scale feature maps from different levels.

This paper is organized as follows. Section 2 introduces the related works. Section 3 describes details about the proposed methods, including Edge Supervised Module (ESM), Auxiliary Semantic Supervised Module (ASSM) and Attention Fusion Module (AFM). Section 4 presents experiments, results and discussions, and Section 5 concludes this work.

2. Related works

In this section, we provide a short review of previous studies on network models, edge supervision, multi-scale object recognition, and attention mechanism.

2.1. Network models

Deep network models are a kind of hierarchical feature learning methods by learning multiple levels of representation to model complex relationships among data, and higher-level features and concepts are thus defined in terms of lower-level ones, and such a hierarchy of features is called a deep architecture [12]. Usually, the first layers will learn the low level features like intensity, color, lines, dots and curves, then the more the layers approach the output layer, the more the layers will learn the high level features like objects and shapes in a feature extracting pipeline. For example, from AlexNet [13], VGG [14] to ResNet [15], the ability of feature extraction is becoming more and more powerful with the deepening of the network depth. Accordingly, the deeper networks can provide a powerful feature extraction ability for semantic segmentation tasks, and can greatly improve segmentation accuracy.

Since FCN [16] is proposed, other semantic segmentation networks attempt to improve this architecture by adding new modules to solve the problems regarding the lack of spatial and contextual information. For example, U-Net [11] is greatly improved only by adding the skip connection based on FCN. PSPNet [17] employs pyramid pooling module to explore the global context information, and it can improve the accuracy of target segmentation at different scales. Besides, DeepLabV3+ [18] combines the advantages of Spatial Pyramid Pooling (SPP) module and encoder-decoder structure, and further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling (ASPP) and decoder modules. PSANet [19] can capture pixel level relationship and relative position information in spatial dimension through convolution layer. In addition, EncNet [21] also introduced a channel attention mechanism to capture the global context.

Although many advanced network structures have been emerged for semantic segmentation tasks, U-Net and its derivatives are still the most popular architecture and have been widely applied in the medical imaging community [27], [37]. However, despite their outstanding overall performance in segmenting medical images, the U-Net-based architecture seems to be lacking in certain aspects. For example, although the high-level feature map can be optimized through the concatenation the feature maps of the low-level layers and the high-level layer by using skipping connection, it is still very difficult to reduce the semantic gap between low-level visual features and high-level semantic features. Thus, we select ResUNet as the backbone to attempt to exploit a novel segmentation architecture for the COVID-19 segmentation task in this work.

2.2. Edge supervision and multi-scale object recognition

Edge information, as an important image feature, is drawing more and more attention in deep learning community owing to the fact that edge information is conducive to the extraction of object contour in segmentation tasks. For example, explicit edge-attention are utilized to model the boundaries and enhance the representations in Fan et al. [6]. Wu et al. [22] proposed a novel edge aware salient object detection method, and it passes messages between two tasks in two directions, and refines multi-level edge and segmentation features. ET-Net [23] integrates edge detection and object segmentation into a deep learning network, and the edge attention representation is embedded to supervise the segmentation prediction. Normally, edge information can provide useful fine-grained constraints to guide feature extraction in semantic segmentation tasks. However, high-level feature maps have little edge information, while low-level layers contain richer object boundaries.

For the multi-scale object recognition problem, it is common practice to exploit multiple levels of coarse and fine-grained semantic features by adopting different network structures in computer vision. For example, the operations of convolution and pooling on the original image is used to obtain feature maps of different sizes, and it is similar to constructing pyramids in the feature space of images. Feature Pyramid Networks (FPN) [24] is one of the most typical examples, and it adopts a top-down architecture with lateral connections for building high-level semantic feature maps at all scales. It has been demonstrated a significant improvement as a generic feature extractor in detection tasks, and has been widely applied in different detection architectures, such as Faster R-CNN [25] and Mask R-CNN [26].

It is widely known that the low-level feature maps pay more attention to detail information, while the high-level lay much attention to semantic information. More specifically, the encoded pathway is mainly used for feature extraction, and there are hierarchy and gradation for various feature. Because the spatial resolution and the semantics can be decreased and strengthened along with the deepening of down-sampling, respectively. Significantly, FPN [24] and U-Net [11] both adopt encoder-decoder architecture, but they are respectively applied in object detection and semantic segmentation. The main difference is that there are multiple prediction layers for various scale features in FPN [24]. Inspired by this, we attempt to exploit sufficient multi-scale context information from different levels of the encoder in this work. Low level detailed feature maps can exploit rich spatial information, and they could strengthen the boundaries of the infected regions; while high-level semantic feature maps can endow position information, and they could locate the infected regions.

2.3. Attention mechanism

Attention can be regarded as a mechanism, and it emphasizes the features that need attention through the context of feature maps. Normally, an attention mechanism is used to highlight the important context in the channel-wise or space-wise [7], [8], while suppressing the context information irrelevant to the content. For example, Fu et al. [28] proposed Dual Attention Network (DAN), and two attention modules were introduced to capture the spatial dependence between any two positions in the feature maps. A similar self-attention mechanism was used to capture the channel dependence between any two channels, and the weighted sum of all channel was utilized to update each channel. Huang et al. [29] proposed Criss-Cross Net (CCNet) to capture this important information in a more effective way, specifically, for each pixel, CCNet can obtain the context information on its crisscross path through a Criss-Cross attention module. Non-local operations, proposed by Wang et al. [30], can directly capture remote dependencies by calculating the interaction between any two locations. Besides, an attention mechanism is also used to aggregate different levels of features to bridge the semantic gaps between low-level features and high-level semantics. For example, Li et al. [31] proposed Gated Fully Fusion(GFF) to fully fuse multi-level feature maps controlled by learned gate maps, and the novel module can bridge the gap between high resolution with low semantics and low resolution with high semantics. Inspired by this, we adopt an attention mechanism to fuse various level feature maps, and the proposed AFM can reduce the semantic gaps between high-level and low-level feature maps, so as to strengthen and supplement the missing detailed information in high-level representations.

3. Methods

In this section, we first present the proposed network architecture. Then we introduce in details the proposed three modules: ESM, ASSM and AFM.

3.1. Proposed network architecture

As mentioned above, U-Net [11] and FPN [24] both have a similar encoder-decoder structure for multi-scale object vision tasks, consisting of a contracting path to capture context and a symmetric expanding path that enables precise localization. While U-Net [11] creates a path for information propagation allowing signals propagate between low and high levels by copying low level features to the corresponding high levels. Despite achieving good segmentation performance in U-Net and its variations, however, the edge information and channels would decrease and increase along with down-sampling of the contracting path, respectively. Both cases can lead to effective information missing, thereby not exploring sufficient information from full scales so as to suffer segmentation performance degradation. While FPN [24] can overcome these drawbacks to retain multi-scale contextual information by using multiple prediction layers: one for each up-sampling layer. Based on this idea, we propose a novel segmentation scheme for the infections of COVID-19.

Fig. 2 illustrates the proposed network architecture. Firstly, we collaboratively enhance the supervised information by introducing edge and semantic information into the encoding stage. Note that the initial stages are used for the edge supervision, while the later stages for the semantic supervision. They occupy the whole down-sampling together, more precisely, the sum of the low-level and high-level layers is equal to the total layers of the encoder. Especially, low-level feature maps from shallow layers are with high resolution, but with limited semantics, whereas high-level feature maps from deep layers have low spatial resolution without detailed information (like object boundaries). When various levels are selected to enhance the supervised information, there is a trade-off between edge supervision and semantic supervision, thus we call it “collaborative supervision” (“Co-supervision”). Then we fuse multi-scale feature maps of different levels from the decoding stage in an encoder-decoder framework (like U-Net). Considering the fact that low level detailed feature maps have high resolution and can capture rich spatial information like object boundaries, we design an ESM to highlight low-level boundary features by incorporating the edge supervised information into the initial stage (like $S_{1}$ and $S_{2}$ in Fig. 2) of down-sampling in the encoder. While high-level semantic feature maps embody position information like object concepts, thus we present an ASSM to strengthen high-level semantic information by integrating object mask supervised information into the later stage (like $S_{3} \sim S_{5}$ in Fig. 2). Finally, the obtained various scale feature maps from the up-sampling stage are fused by adopting an attention mechanism to achieve good segmentation performance for infections of COVID-19.

3.2. Edge supervised module (ESM)

Many studies [22], [23] show that the edge information can provide effective constraints to the feature extraction in the segmentation task. To supplement the missing edge information along with down-sampling, we propose ESM to further highlight the object boundary features in the low-level layers. Because feature maps of low level from shallow layers are with high resolution and detailed information (including edge information), and these detailed information are easily lost during the initial stage of the down-sampling process, the proposed ESM can capture more detailed information like object boundaries. Specifically, we can guide the network to extract edge features from the initial stages like $S_{1}$ and $S_{2}$ (shown in Fig. 2) by defining edge supervised loss function. To this end, the output feature maps from the initial stage are firstly resized to the size $H \times W$ of the original image by using bilinear interpolation up-sampling. Then the obtained large feature maps of each layer in ESM are reduced to a feature map by using $1 \times 1$ convolution operation. Finally each pixel value is converted to a probability by using Sigmoid function $σ (\cdot)$ (shown in Fig. 3 a), and an edge prediction image with $H \times W$ is obtained. Accordingly, the edge supervised loss function is given based on Dice coefficient as follows.

L_{e d g e} = 1 - 2 \times \sum_{i = 1}^{l} \frac{ζ_{i} (S_{e d g e}^{i} \cap G_{e d g e})}{(S_{e d g e}^{i} + G_{e d g e})}

(1)

where $S_{e d g e}^{i}$ is the edge prediction image obtained by using bilinear interpolation up-sampling in the $i th$ stage. $G_{e d g e}$ is the corresponding Ground Truth (GT) of edge image, which is obtained by generating edge GT from the segmentation mask. $l$ is the number of stages used for edge supervised in the ESM. $ζ_{i}$ ( $i = 1, \dots, l$ ) is the weight coefficient of the $i th$ stage. By using skip connections and AFM, the edge features in the high-level feature maps can also be strengthened.

Fig. 3 — An illustration of ESM and ASSM. Firstly, the low resolution feature maps from the stage $S_{i}$ are resized to the same size $H \times W$ with the input image by using bilinear interpolation up-sampling. Then all high resolution feature maps are reduced to a feature map by using $1 \times 1$ convolutions. Finally each pixel value of the obtained feature map is converted to a probability by using Sigmoid function $σ (\cdot)$ , and the prediction image of the $S_{i}$ stage is obtained. (a) ESM: the edge supervision is achieved by comparing between the obtained edge prediction image $S_{e d g e}^{i}$ and the corresponding edge Ground Truth (GT) $G_{e d g e}$ based on Eq.(1). (b) ASSM: the auxiliary semantic supervision is achieved by comparing between the obtained coarse segmented image $S_{m a s k}^{i}$ and the corresponding Ground Truth (GT) of segmentation mask $G_{m a s k}$ based on Eq.(2).

3.3. Auxiliary semantic supervised module (ASSM)

For the multi-scale object segmentation, the multi-level loss function is used to build receptive fields of different sizes for different layers in the network. For example, FPN [24] uses multi-level auxiliary loss to detect objects at different scales, and it is a great breakthrough in multi-scale object detection task. Inspired by this, we develop an ASSM based on the similar strategy in our network. Specifically, the semantic information is gradually strengthened along with the down-sampling process in the encoder, and the high-level feature map has rich semantics but low spatial resolution without detailed information. Different layers contain different level semantic features according to the feature hierarchy of the contracting path. Thereby we can define an auxiliary semantic loss function to reduce the semantic gaps between high-level and low-level feature maps in the later stage (i.e., $S_{3} \sim S_{5}$ ) of the encoder. Eventually, low-level semantic features can be strengthened by using multi-scale skip connections and AFM, and it can also reduce the background noise in the low-level feature maps.

Similar to the above steps in ESM, we can obtain one coarse segmented image with the size of $H \times W$ and the probability of each pixel through a series of operations, such as bilinear interpolation, $1 \times 1$ convolution, and Sigmoid function $σ (\cdot)$ (shown in Fig. 3b). Then the auxiliary semantic loss function is defined based on Dice coefficient as follows.

L_{s e m a n t i c} = 1 - 2 \times \sum_{i = l + 1}^{5} \frac{ω_{i} (S_{m a s k}^{i} \cap G_{m a s k})}{(S_{m a s k}^{i} + G_{m a s k})}

(2)

where $S_{m a s k}^{i}$ and $G_{m a s k}$ are the obtained coarse segmented image in the $i th$ stage of the Encoder and the Ground Truth (GT) of segmentation mask, respectively. $ω_{i}$ ( $i = l + 1, \dots, 5$ ) is the weight coefficient of the $i th$ stage.

3.4. Attention fusion module (AFM)

As mentioned above, high-level features are very efficient in semantic segmentation tasks. However, the high-level feature maps easily lead to inferior results for small or thin objects owing to the fact that the operations of convolution and pooling can cause the detailed information missing, thereby high-level feature maps have coarse resolution. To compensate the lost detailed information in high-level representations, it is necessary to import low level features. However, the full-scale skip connections can only incorporate low-level details with high-level semantics from feature maps in different scales of the same level, and the semantic gaps existing among various levels hampers the effectiveness of the semantic segmentation. Thus we propose the AFM to fuse multi-scale feature maps of different levels by using an attention mechanism to strengthen and supplement the lost detailed information in high-level representations.

Gated Fully Fusion(GFF) [31] can selectively fuse features from multiple levels using gates in a fully connected way, and add weights to each spatial position by using skip connection. Inspired by this idea, an attention mechanism is incorporated into the AFM by aggregating different level features, aiming at reducing the semantic gaps between low-level features and high-level features. The corresponding attention mechanism is illustrated in Fig. 4 . In general, we can directly obtain the segmentation maps from the top feature map $X_{1} (\in R^{C \times H \times W}$ , where $c$ , $h$ and $w$ are the channel number, height and width, respectively) of the expansive path in the standard U-Net. The $X_{1}$ has high spatial resolution because the outputs need to be with the same resolution as the input image, but actually, multiple down-sampling and up-sampling operations make the deep network cause mistake and loss in the detailed information. As well as strengthening the top feature map $X_{1}$ , therefore, we can aggregate feature maps of other levels (i.e., $X_{2} \sim X_{5}$ ) to supplement the lost detailed information caused by the filters or pooling operations.

Fig. 4 — An illustration of the attention mechanism. $X_{i}^{u}$ represents the up-sampling intermediate result by bilinear interpolation for the feature map $X_{i}$ , and its 2D size is the same size with the input image.

More precisely, we can obtain a confidence map $P_{1} (\in R^{C \times H \times W}$ ) through the attention block (AB) of the top feature map $X_{1}$ . The points with high confidence have a greater possibility to retain the original feature map values, and vice versa. Similarly, the lost detailed information is represented by the confidence map $1 - P_{1}$ , in which the higher the value, the less object information it contains. Thus, we can strengthen the top feature map $X_{1}$ through the dot product between the confidence map $P_{1}$ and $X_{1}$ , and can supplement the lost detailed information by using dot product between the confidence map $1 - P_{1}$ and the sum of other feature maps. The procedure of the attention block is illustrated in Fig. 5 , and the final prediction result $S_{P}$ can be defined as follows.

S_{P} = X_{1} + Y_{1} + (1 - P_{1}) \cdot \sum_{i = 2}^{5} Y_{i}

(3)

where $Y_{i}$ is the output by using the attention block to process the corresponding $X_{i}$ . While $X_{i}$ is firstly up-sampled to the same size with the input image by bilinear interpolation. Then $Y_{i}$ can be obtained by processing the up-sampling intermediate result $X_{i}^{u}$ based on the attention block, and it is defined as follows.

Y_{i} = Φ_{A} (X_{i}) = P_{i} \cdot X_{i}^{u}

(4)

where $Φ_{A} (\cdot)$ is the attention function.

Fig. 5 — The procedure of the attention block. The color bar represents the trends of confidence values, and the red and blue denote 1 and 0, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

The specific process is as follows.

1.
Each up-sampling feature map ( $X_{i}$ ) is processed through an attention block.
2.
After an $1 \times 1$ convolution operation, the channels are reduced to 64, and we can obtain the $i th$ level feature maps.
3.
Then the resolution is resized to $H \times W$ by using bilinear interpolation.
4.
After the operation of a convolution and Sigmoid function $σ (\cdot)$ , we can obtain the confidence output $Y_{i}$ by using a dot product $Y_{i} = P_{i} \cdot X_{i}^{u}$ . Note that the top feature map $X_{1}$ is selected as the main prediction, while other confidence output only as the supplement of $Y_{1} = Φ_{A} (X_{1})$ . When $P_{1}$ is small, it means that the corresponding confidence is low, and thereby we can compensate the lost information by doing a dot product between ( $1 - P_{1}$ ) and the sum of the confidence outputs of other layer feature maps $X_{i}$ ( $i = 2, \dots, 5$ ).
5.
Finally, the final prediction result $S_{p}$ is obtained by summing the residuals of $X_{1}$ . The specific process is shown in Algorithm 1 .

The loss function for fusion is defined as follows:

L_{f u s i o n} = 1 - \frac{2 \times (S_{p} \cap G)}{S_{p} + G}

(5)

where $G$ represents the ground truth of COVID-19.

L_{t o t a l} = θ L_{e d g e} + β L_{s e m a n t i c} + L_{f u s i o n}

(6)

where $θ$ and $β$ are weight coefficients.

Considering the fact that there would be negative values in the category imbalance case when using the cross-entropy loss function. Therefore, we select Dice loss to supervise the predictions and labels in our experiments. To achieve deep fusions and supervisions for the features of different level, the overall loss function integrates ESM, ASSM and AFM, given as Eq. (6).

4. Experiments

4.1. Datasets and baselines

We collect the COVID-19 segmentation dataset from two sources. One is from [32], including more than 900 CT images, among them about 400 slices with infections. Another is from [33], and it contains 3D CT images of 20 patients, and we can obtain 3686 images by converting from 3D volumes into 2D slices. Due to the small datasets, the two sources are put together in a total of 4449 2D slices, among them 4000 slices for training sets and 449 slices for test sets, respectively. The GT contains four categories: $0 \sim 3$ represent background, ground glass, consolidation and plural effect, respectively. Owing to the imbalance of infection categories in the dataset, for example, only few slices contain plural effect infection, we take all types of infection as one type. Considering the limitation of GPU memory, we resize the image resolution of $512 \times 512$ to $256 \times 256$ by bilinear interpolation, then Z-score is used for data normalization. Besides, to further verify the effectiveness and generalization ability of the proposed method, we select three additional public COVID-19 datasets for testing and comparison, including MosMedData [42], UESTC-COVID-19 [41] and COVID-ChestCT [43]. MosMedData is a dataset of 100 axial CT images from more than 40 patients with COVID-19, including 829 slices with $512 \times 512$ size (see Morozov et al. [42] for details), and UESTC-COVID-19 contains CT scans (3D volumes) of 50 patients diagnosized with COVID-19 from 10 different hospitals (see Wang et al. [41] for details). While COVID-ChestCT is a small dataset, and it contains 20 CT scans of patients diagnosed with COVID-19 as well as segmentations of lungs and infections made by experts (see Cohen et al. [43] for details).

We select ResUNet as the backbone of the proposed network, in which the down-sampling of U-Net is replaced with ResNet. To verify the effectiveness of the proposed scheme, we use a series of popular segmentation models for comparison in the medical image segmentation area, such as U-Net [11], UNet++ [9], and Attention U-Net [34], and we compare our methods with two cutting-edge models from the semantic segmentation: DeepLabV3+ [18] and PSPNet [17].

4.2. Evaluation metrics and experimental settings

We adopt three metrics to evaluate our methods, such as Dice similarity coefficient, Sensitivity (Sens.), Precision (Prec.). Besides, we also introduce three golden metrics to verify the detection and segmentation performance from the object detection field, such as Structure Measure [35], Enhance-alignment Measure [36], and Mean Absolute Error. In our evaluation, we select $S_{p}$ as the final output prediction, and measure the similarity/dissimilarity between $S_{p}$ and ground-truth $G$ , which can be formulated as follows.

•
Dice similarity coefficient: it is used to measure the proportion of intersection between $S_{p}$ and $G$ , which is defined as follows.
$D i c e = \frac{2 \times (S_{p} \cap G)}{S_{p} + G}$ (7)
•
Structure Measure (S $_{α}$ ): it is used to measure the structural similarity between a prediction S $_{p}$ and ground-truth $G$ , which is more consistent with the human visual system.
$S_{α} = (1 - α) \times S_{o} (S_{p}, G) + α \times S_{r} (S_{p}, G),$ (8)
where $S_{o}$ and $S_{r}$ are the object-aware similarity and region-aware similarity, respectively. $α$ is a balance factor between $S_{o}$ and $S_{r}$ . We report S $_{α}$ using the default setting ( $α$ = 0.5) suggested in the original paper.
•
Sensitivity ( $S e n s .$ ): it is used to measure the percentage of positive samples in the total number of patients, or the probability of no missed diagnosis. The formulation is given as follows.
$S e n s . = \frac{S_{p} \cap G}{G}$ (9)
•
Precision ( $P r e c .$ ): it is used to measure the percentage of samples with negative test in the total number of healthy people, or the probability of not misdiagnosing. The formulation is given as follows:
$P r e c . = \frac{S_{p} \cap G}{S_{p}}$ (10)
•
Enhanced-alignment Measure ( $E_{ϕ}^{m e a n}$ ): it is a recently proposed metric for evaluating both local and global similarity between two binary maps. The formulation is given as follows:
$E_{ϕ} = \frac{1}{w \times h} \sum_{x}^{w} \sum_{y}^{h} ϕ (S_{p} (x, y), G (x, y))$ (11)
where $w$ and $h$ are the width and height of ground-truth $G$ , and $(x, y)$ denotes the coordinate of each pixel in $G$ . Symbol $ϕ$ is the enhanced alignment matrix. We obtain a set of $E_{ϕ}$ by converting the prediction $S_{p}$ into a binary mask with a threshold from 0 to 255. In our experiments, we report the mean of E $_{ξ}$ computed from all the thresholds.
•
Mean Absolute Error ( $M A E$ ): it is used to measure the pixel-wise error between $S_{p}$ and $G$ , which is defined as:
$M A E = \frac{1}{w \times h} \sum_{x}^{w} \sum_{y}^{h} | S_{p} (x, y) - G (x, y) | .$ (12)

For the hyper-parameters in the experiments is given in Table 1 by try-and-error, respectively. Note that the learning rate is initially selected as 1e-4 $-$ , then is reduced by a factor of 0.5 when the test loss is not improved within 25 epoch. Early stopping is used to avoid over-fitting. All experiments are conducted on a desktop computer with an E3-1230 v5 3.40 GHz 8-core processor, and with a GeForce GTX 1070 graphics card. A GPU implementation accelerates the forward propagation and back propagation routines by using the Adam optimizer under the Pytorch framework. Each experiment is run three times, then its average and standard deviation $\pm$ are obtained.

Table 1.

Hyperparameter setting.

Parameters	Values
Input image size $H \times W$	$256 \times 256$
batch_size	8
learning rate	$1 e - 4$
Early stopping	25 epochs
$θ$	0.8
$β$	0.4
$ζ_{i}$ ( $i = 1, \dots, l$ )	1
$ω_{i}$ ( $i = l + 1, \dots, 5$ )	1

Open in a new tab

4.3. Experimental results

4.3.1. Quantitative results

A series of comparison experiments are implemented on our dataset, and the results are shown in Table 2 . From Table 2, the proposed method can achieve the best performances among these methods in $D i c e$ , $S e n s .$ and $P r e c .$ . Thereinto, our method has improved by around 4.4% and 1.44% in the main metric—Dice coefficient compared with U-Net [11] and Inf-Net [6], respectively. In particular, UNet++ [9] and Attention U-Net [34] represent the best U-Net-based methods in the medical image processing area, while Inf-Net [6], CE-Net [38] and CPFNet [40] are the newest and best methods for the segmentation of medical images. It suggests that the proposed scheme is effective and competitive, and can effectively fuse the multi-scale and multi-level features to accurately achieve the COVID-19 infection segmentation.

Table 2.

Comparisons between different networks on our dataset. Bold black text represents the best results.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$P r e c . (%) ↑$	$M A E (%) ↓$	$E_{ϕ}$ (%) $↑$	$S_{α}$ (%) $↑$
U-Net [11]	$85.56 \pm 0.33$	$85.38 \pm 1.53$	$85.76 \pm 0.89$	$0.72 \pm 0.01$	$94.21 \pm 0.02$	$81.23 \pm 0.23$
UNet+ [9]	$86.71 \pm 1.25$	$90.27 \pm 0.61$	$88.30 \pm 1.05$	$0.60 \pm 0.02$	$94.50 \pm 0.63$	$84.61 \pm 1.00$
Attention U-Net [34]	$87.40 \pm 0.26$	$89.48 \pm 0.49$	$89.88 \pm 0.53$	$0.58 \pm 0.03$	$94.74 \pm 0.87$	$84.71 \pm 0.87$
PSPNet [17]	$87.45 \pm 0.31$	$88.32 \pm 1.25$	$89.89 \pm 1.11$	$0.60 \pm 0.05$	$93.84 \pm 0.27$	$83.81 \pm 0.20$
Deeplabv3 [18]	$87.81 \pm 0.19$	$89.24 \pm 0.96$	$90.72 \pm 0.66$	$0.58 \pm 0.02$	$95.58 \pm 0.22$	$86.03 \pm 0.95$
Inf-Net [6]	$88.49 \pm 0.17$	$90.07 \pm 0.35$	$90.39 \pm 0.18$	$0.55 \pm 0.01$	$95.70 \pm 0.24$	$86.55 \pm 0.09$
SCRN [22]	$86.24 \pm 0.08$	$83.64 \pm 0.36$	$89.65 \pm 0.59$	$0.60 \pm 0.015$	$95.02 \pm 0.40$	$84.09 \pm 0.26$
F3Net [20]	$87.99 \pm 1.45$	$85.14 \pm 2.23$	$91.08 \pm 0.17$	$0.58 \pm 0.025$	$93.51 \pm 0.63$	$86.35 \pm 2.03$
DANet [28]	$88.94 \pm 0.29$	$85.48 \pm 2.74$	$90.50 \pm 0.53$	$0.57 \pm 0.015$	$94.11 \pm 0.91$	$86.90 \pm 1.33$
ACFNet [39]	$83.25 \pm 0.18$	$83.88 \pm 0.10$	$83.06 \pm 0.25$	$0.34 \pm 0.001$	$85.21 \pm 0.15$	$90.62 \pm 0.05$
CE-Net [38]	$81.49 \pm 0.75$	$84.21 \pm 0.85$	$84.18 \pm 0.34$	$0.30 \pm 0.005$	$85.06 \pm 0.26$	$92.00 \pm 0.10$
CPFNet [40]	$85.19 \pm 0.14$	$84.66 \pm 1.32$	$85.22 \pm 1.05$	$0.31 \pm 0.002$	$86.38 \pm 0.07$	$92.09 \pm 0.20$
ResUNet_C $_{2}$ F(Ours)	$89.93 \pm 0.09$	$90.29 \pm 0.66$	$91.91 \pm 0.97$	$0.52 \pm 0.01$	$95.69 \pm 0.10$	$86.75 \pm 0.07$

Open in a new tab

Besides, we further analyze the influence of edge supervision in different levels on segmentation performance by adding or reducing level edge supervision in the low-level features. To facilitate the analysis, ResUNet with Co-supervision and Fusion Model (ResUNet_C $_{i}$ F) represents the first $i$ levels (i.e., $S_{1}, \dots, S_{i}$ ) in the low level to use ESM, while the rest (i.e., $S_{i + 1}, \dots, S_{n}$ ) adopt ASSM in the Co-supervision, where $n$ is the number of down-sampling ( $n = 5$ here). The results is illustrated in Table 3 , and it is obvious that Dice coefficient firstly rises and then declines along the first level number $i$ from 1 to 5. When $i = 2$ (i.e., ResUNet_C $_{2}$ F), the proposed method can obtain the best segmentation performance. It means that the features of low-level boundary and high-level semantic can both be strengthened as the first level number $i$ increases and reduces, respectively. When $i = 2$ , there is a trade-off between the number of low-level and high-level (i.e., the use of context and localization accuracy), consequently ResUNet_C $_{2}$ F can surpass other ResUNet_C $_{i}$ F in most metrics, such as $D i c e$ , $M A E$ , $E_{ϕ}$ and $S_{α}$ . More precisely, the proposed ESM and ASSM can incorporate low-level details with high-level semantics from feature maps in different levels by using AFM.

Table 3.

The results of different numbers of edge supervised on our dataset. Bold black text represents the best results.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$P r e c . (%) ↑$	$M A E (%) ↓$	$E_{ϕ}$ (%) $↑$	$S_{α}$ (%) $↑$
ResUNet_C $_{1}$ F	$89.16 \pm 0.49$	$88.03 \pm 1.49$	$92.08 \pm 1.05$	$0.56 \pm 0.03$	$95.27 \pm 0.12$	$85.59 \pm 0.80$
ResUNet_C $_{2}$ F	$89.93 \pm 0.09$	$90.29 \pm 0.66$	$91.91 \pm 0.97$	$0.52 \pm 0.01$	$95.69 \pm 0.10$	$86.75 \pm 0.07$
ResUNet_C $_{3}$ F	$89.44 \pm 0.14$	$90.15 \pm 0.88$	$91.90 \pm 1.13$	$0.55 \pm 0.01$	$95.30 \pm 0.49$	$85.41 \pm 0.09$
ResUNet_C $_{4}$ F	$89.40 \pm 0.33$	$90.66 \pm 0.45$	$91.12 \pm 0.92$	$0.58 \pm 0.02$	$95.32 \pm 0.24$	$85.35 \pm 0.43$
ResUNet_C $_{5}$ F	$88.33 \pm 0.89$	$90.28 \pm 0.67$	$90.05 \pm 0.45$	$0.58 \pm 0.06$	$95.05 \pm 1.30$	$85.29 \pm 1.58$

Open in a new tab

4.3.2. Qualitative results

To further demonstrate the effectiveness of the proposed scheme, we visualize the prediction results of different networks. As shown in Fig. 6 , our method can remarkably outperform the baseline methods in the lung infection segmentation. Specifically, our segmentation results have much less mis-segmented tissues, while there are a lot of lossing and improper segmentation in the baseline U-Net and other methods. For the infection edge marked with a red box, for instance, our method can obtain a complete edge, and it is much closer to the real label in edge detail, which benefits from the more detailed edge information provided by the proposed ESM. Besides, from the regions marked by the blue box, our method can avoid over-segmentation, under-segmentation and incorrect segmentation efficiently. Especially in the $4 th$ rows, only our method and Deeplabv3+ can correctly detect the small infection (marked the blue box). It can be also observed obviously that our method is better than Deeplabv3+ in the edge details of large targets (marked the red box) because our method can provide different sizes of receptive fields and have good segmentation performance for different scale objects.

Along the down-sampling process in U-Net, edge feature information becomes less and less, while semantic one becomes richer and richer. For further verification, we visualize the feature maps of different levels (i.e., from $S_{1}$ to $S_{5}$ ) in ResUNet_C $_{5}$ F. As shown in Fig. 7, the feature maps of low-level output ( $S_{1}$ and $S_{2}$ ) contain more details, and the feature map in $S_{3}$ is the closest to the edge GT. With the deepening of down-sampling, edge feature information becomes less obvious. In the back propagation, we can extract more semantic information from the feature maps of high-level, as shown in $S_{5}$ . It demonstrates that our ESM in low-level and ASSM in hige-level are very efficient to deal with a such difficult segmentation.

Fig. 7 — Visualization of each stage supervised by ESM. Column 1: the original CT image; Columns 2 to 6: $S_{1}$ to $S_{5}$ ; Column 7: the corresponding edge ground truth (GT).

4.4. Ablation experiments

To further analyse and test the validity of the proposed modules, a series of comparison experiments are conducted on our dataset by using various combinations among ESM, ASSM and AFM based on the baseline ResUNet. The experimental results are shown in the third row of Table 4 , and each module can improve independently the Dice coefficient of infection segmentation. Thereinto, compared with the baseline ResUNet without any other modules, ASSM can obtain independently the greatest performance improvements, followed by AFM. While for various combinations between ESM, ASSM and AFM, they can also outperform their separate modules, and the combination of ASSM and AFM can obtain slightly better performance than that of ESM and AFM. Finally, the combination of the three modules can obtain the best performance, the reason is that the integration can take full advantage of them and obtain the optimal segmentation effect. Our network can be generalized for other segmentation applications due to the effectiveness of its architecture.

Table 4.

Ablation experiments on our dataset. Bold black text represents the best results.

Baseline	ESM	ASSM	ASSM $^{*}$	ESM $^{*}$	AFM	$D i c e (%) ↑$
ResUnet						$85.96 \pm 0.03$
	$\sqrt$					$87.08 \pm 0.45$
		$\sqrt$				$87.91 \pm 0.83$
					$\sqrt$	$87.59 \pm 1.07$
	$\sqrt$				$\sqrt$	$88.33 \pm 0.89$
		$\sqrt$			$\sqrt$	$88.70 \pm 0.25$
	$\sqrt$	$\sqrt$			$\sqrt$	$89.93 \pm 0.09$
				$\sqrt$		$87.17 \pm 0.58$
			$\sqrt$			$86.47 \pm 0.46$
			$\sqrt$		$\sqrt$	$87.31 \pm 0.58$
				$\sqrt$	$\sqrt$	$87.99 \pm 0.36$
			$\sqrt$	$\sqrt$	$\sqrt$	$88.86 \pm 0.31$
	$\sqrt$			$\sqrt$	$\sqrt$	$86.95 \pm 0.37$
		$\sqrt$	$\sqrt$		$\sqrt$	$85.19 \pm 0.23$
	$\sqrt$		$\sqrt$		$\sqrt$	$85.55 \pm 0.47$
		$\sqrt$		$\sqrt$	$\sqrt$	$85.11 \pm 0.09$
	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	$85.63 \pm 0.51$

Open in a new tab

To test the effects of the proposed modules in the decoder, ESM and ASSM are applied separately or jointly in the up-sampling path. For convenience, $^{*}$ indicates the corresponding modules and stages in the up-sampling path (shown in Fig. 2). Owing the symmetric structure between the encoder and decoder, ESM and ASSM are symmetrically placed in the low level (i.e., $S_{1}^{*}$ to $S_{2}^{*}$ ) and high level (i.e., $S_{3}^{*}$ to $S_{5}^{*}$ ) of the up-sampling path, respectively. The experimental results are shown in the fourth row of Table 4. Compared with the baseline method, the Dice performance can be improved in certain extent when these modules are separately or jointly adopted in up-sampling path, particularly the combination of the three modules can obtain the second best segmentation performance. However, the obtain performance in the up-sampling path is slightly worse than that of the corresponding down-sampling path in general. It means that the proposed Co-supervision scheme can both guide the network learning the features of edges and semantics in the down-sampling and up-sampling paths, but the effect would be more appreciable when the supervision modules is applied in the down-sampling path. The reason is that the levels of the down-sampling path contain richer primitive feature information than those of the up-sampling path owing to the encoder close to the original input data, while the edge and semantic information exist more or less some loss and noise when reconstructing a higher resolution layers by using bilinear interpolation up-sampling. Accordingly, the supervision in the levels of the down-sampling path is more stronger than that of the up-sampling path.

However, interestingly, the segmentation performance is even decreased compared with the baseline method when the proposed Co-supervision scheme are simultaneously applied in the down-sampling and up-sampling paths, and the fourth row of Table 4 shows the results. Except the combination between ESM, ESM $^{*}$ and AFM, all combinations between the down-sampling and up-sampling paths can obtain poorer segmentation performance than the baseline method. While the combination between ESM, ESM $^{*}$ and AFM can increase by about 1% over the baseline method. The most probable cause is the conflict and interference of the Co-supervision between the down-sampling and up-sampling paths. For example, the down-sampling path (i.e., encoder) is used to encode the input image into feature representations at multiple different levels, thereby capturing the context of the image like edge detail information. While the up-sampling path (i.e., decoder) is to semantically project the discriminative features (lower resolution) learned by the encoder onto the pixel space (higher resolution) to get a precise localization. Correspondingly, the loss function is to put more emphasis on edge details in the encoder path, while to highlight localization information for the decoder path. But all the feature maps of the decoder come from the encoder by concatenating and up-sampling, which results in the conflict and interference between the encoder and decoder when the Co-supervision modules are simultaneously applied in the two paths.

4.5. Comparison of fusion methods

Multilevel feature fusion means different level of feature maps are integrated together to enrich the feature information, and traditional fusion approaches usually use feature addition or concatenation. An addition process is to add multiple feature maps to be one, which means that the amount of information under the characteristics of the description image is increased. While a concatenation is a combination of the number of channels, which means that the features describing the image itself are increased, but the information under each feature is not increased. To further verify the advantage of the proposed AFM, a series of comparison experiments are carried out by only using different fusion approaches, and the segmentation results are shown in Table 5 . It can be seen that the proposed AFM can surpass the other two methods in all metrics except $S_{α} (%)$ . The reason is that all feature maps are evenly fused according to the same importance in the adding or concatenating process. However, it is obviously unreasonable because there are great differences between different levels in feature representations, and it is not sufficient to adaptively compensate low level finer details to high level semantic features only by simple adding or concatenating operation. Meanwhile, the concatenation operation can reduce the weight of the feature maps with poor semantics in the subsequent features in the convolution layer, while retaining rich semantic features in the channel. Whereas the addition operation can weaken the discrimination of features due to the simple pixel-wise summation for the feature maps. Therefore, the concatenation fusion method can surpass the addition operation.

Table 5.

The results of different fusion methods on our dataset.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$P r e c . (%) ↑$	$M A E (%) ↓$	$E_{ϕ}$ (%) $↑$	$S_{α}$ (%) $↑$
Add	$83.59 \pm 2.14$	$85.16 \pm 1.91$	$81.07 \pm 0.83$	$0.85 \pm 0.13$	$93.66 \pm 0.10$	$80.12 \pm 0.84$
Concatenate	$86.75 \pm 1.38$	$87.00 \pm 0.92$	$86.93 \pm 1.03$	$0.64 \pm 0.08$	$94.39 \pm 1.04$	$84.12 \pm 1.27$
Attention	$87.59 \pm 1.07$	$88.04 \pm 1.12$	$87.18 \pm 1.36$	$0.59 \pm 0.05$	$95.05 \pm 1.30$	$83.89 \pm 1.21$

Open in a new tab

Fig. 8 illustrates the visual results of the fusion process by utilizing an attention mechanism. $Y_{1}$ is only processed by the attention block (AB), thereby it is the nearest output to the segmentation prediction of the baseline. While $S_{p}$ is the segmentation results by fusing multiple level feature maps, which would achieve the goal of both high resolution and rich semantics by combining the complementary strengths of multiple level feature maps. It is obvious that the $S_{p}$ is more complete than the $P_{1}$ , and its lost information is lesser than that of the $P_{1}$ . The obtain confidence $P_{1}$ attaches importance to the $P_{1}$ to ensure the most information retained. As a complement to the $P_{1}$ , whereas, the confidence map $1 - P_{1}$ pays attention to the lost detailed information, and it can exploit sufficient spatial and semantic features to supplement the lost detailed information by fusing different levels. Thus the proposed methods can overcome the under-segmentation problem of the baseline, and retain multi-scale contextual information from multiple different levels.

4.6. Comparisons on other COVID-19 datasets

To further verify the effectiveness and generalization ability, a series of comparison experiments are conducted on MosMedData [42], UESTC-COVID-19 [41] and COVID-ChestCT [43], respectively. We select three important metrics for the evaluation of the COVID-19 lung infection segmentation, including $D i c e$ , $S e n s .$ and $M A E$ . The results are shown in Table 6, Table 7, Table 8 . For the MosMedData dataset, our method is slightly superior than Attention U-Net [34] and UNet++ [9] with Dice metric, but it can obtain 3.06% better than its nearest competitor F3Net [20] with Sensitivity ( $S e n s .$ ), and can achieve the best performance among these three methods with all metrics (shown in Table 6). In the UESTC-COVID-19 dataset, our method is slightly better than its nearest competitor with $D i c e$ and $M A E$ metrics, and is slightly lower than its nearest competitor in Sensitivity ( $S e n s .$ ). Overall, our method can obtain the best comprehensive performance among these methods (shown in Table 7). As for the COVID-ChestCT, our method can achieve the first, first and third best performance in Sensitivity ( $S e n s .$ ), $M A E$ and $D i c e$ , respectively. Compared with other methods, our method can also achieve the best overall performance (shown in Table 8). From the above results, our method can achieve the first three best performances for various datasets using all metrics, and has the best comprehensive performance comparing to other methods.

Table 6.

Performance comparisons between different methods on MosMedData. Bold black text represents the best results.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$M A E (%) ↓$
U-Net [11]	$80.39 \pm 9.865$	$64.32 \pm 19.197$	$8.26 \pm 10.959$
UNet+ [9]	$87.27 \pm 2.102$	$74.80 \pm 7.508$	$1.60 \pm 2.223$
Attention U-Net [34]	$87.42 \pm 0.395$	$81.92 \pm 3.566$	$0.30 \pm 0.059$
PSPNet [17]	$82.38 \pm 1.048$	$79.11 \pm 2.323$	$0.37 \pm 0.040$
Deeplabv3 [18]	$83.70 \pm 1.166$	$80.01 \pm 0.350$	$0.34 \pm 0.026$
Inf-Net [6]	$78.64 \pm 0.390$	$78.17 \pm 4.277$	$0.44 \pm 0.015$
SCRN [22]	$87.13 \pm 0.070$	$82.42 \pm 3.957$	$0.26 \pm 0.008$
F3Net [20]	$83.66 \pm 0.337$	$82.58 \pm 1.121$	$0.33 \pm 0.006$
DANet [28]	$84.84 \pm 4.435$	$70.65 \pm 8.948$	$2.94 \pm 4.268$
ACFNet [39]	$79.40 \pm 1.114$	$79.07 \pm 3.978$	$0.42 \pm 0.023$
CE-Net [38]	$86.11 \pm 0.371$	$76.17 \pm 0.121$	$0.28 \pm 0.012$
CPFNet [40]	$86.79 \pm 0.266$	$73.50 \pm 4.910$	$0.42 \pm 0.101$
ResUNet_C $_{2}$ F(Ours)	$87.43 \pm 0.165$	$85.64 \pm 3.013$	$0.253 \pm 0.006$

Open in a new tab

Table 7.

Performance comparisons between different methods on UESTC-COVID-19. Bold black text represents the best results.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$M A E (%) ↓$
U-Net [11]	$85.48 \pm 0.125$	$75.90 \pm 3.466$	$0.47 \pm 0.060$
UNet+ [9]	$85.13 \pm 0.296$	$76.41 \pm 1.636$	$0.47 \pm 0.010$
Attention U-Net [34]	$84.62 \pm 0.539$	$77.04 \pm 2.484$	$0.50 \pm 0.035$
PSPNet [17]	$82.67 \pm 0.248$	$76.45 \pm 1.246$	$0.56 \pm 0.006$
Deeplabv3 [18]	$80.13 \pm 1.212$	$70.88 \pm 2.573$	$0.63 \pm 0.049$
Inf-Net [6]	$83.26 \pm 0.440$	$77.45 \pm 1.810$	$0.54 \pm 0.015$
SCRN [22]	$83.78 \pm 0.402$	$77.75 \pm 2.949$	$0.52 \pm 0.017$
F3Net [20]	$83.58 \pm 0.974$	$78.56 \pm 1.178$	$0.53 \pm 0.026$
DANet [28]	$85.40 \pm 0.745$	$79.40 \pm 2.175$	$0.47 \pm 0.026$
ACFNet [39]	$84.31 \pm 0.435$	$79.83 \pm 1.799$	$0.50 \pm 0.020$
CE-Net [38]	$85.45 \pm 0.420$	$77.38 \pm 7.962$	$0.47 \pm 0.016$
CPFNet [40]	$85.36 \pm 0.182$	$77.35 \pm 1.151$	$0.47 \pm 0.015$
ResUNet_C $_{2}$ F(Ours)	$85.52 \pm 0.081$	$79.46 \pm 2.286$	$0.47 \pm 0.006$

Open in a new tab

Table 8.

Performance comparisons between different methods on COVID-ChestCT. Bold black text represents the best results.

Methods	$D i c e (%) ↑$	$S e n s . (%) ↑$	$M A E (%) ↓$
U-Net [11]	$71.86 \pm 0.240$	$79.31 \pm 0.950$	$0.75 \pm 0.012$
UNet+ [9]	$71.62 \pm 0.412$	$72.77 \pm 5.927$	$0.93 \pm 0.173$
Attention U-Net [34]	$70.44 \pm 1.420$	$75.27 \pm 7.900$	$0.95 \pm 0.303$
PSPNet [17]	$65.95 \pm 1.979$	$80.95 \pm 4.126$	$0.93 \pm 0.068$
Deeplabv3 [18]	$59.41 \pm 1.213$	$74.27 \pm 7.419$	$1.05 \pm 0.095$
Inf-Net [6]	$59.63 \pm 2.735$	$78.32 \pm 0.131$	$1.09 \pm 0.107$
SCRN [22]	$68.46 \pm 1.345$	$74.33 \pm 6.288$	$1.15 \pm 0.471$
F3Net [20]	$68.62 \pm 0.477$	$83.30 \pm 5.989$	$0.83 \pm 0.031$
DANet [28]	$71.10 \pm 0.731$	$72.45 \pm 6.953$	$0.92 \pm 0.182$
ACFNet [39]	$68.69 \pm 1.979$	$76.36 \pm 8.352$	$0.84 \pm 0.042$
CE-Net [38]	$73.72 \pm 0.583$	$71.38 \pm 1.666$	$0.75 \pm 0.012$
CPFNet [40]	$74.65 \pm 1.163$	$81.24 \pm 1.700$	$1.01 \pm 0.411$
ResUNet_C $_{2}$ F(Ours)	$72.81 \pm 0.148$	$83.89 \pm 1.358$	$0.74 \pm 0.017$

Open in a new tab

5. Conclusion

It is still a challenging task to accurately segment the infected lesions of COVID-19 on CT images owing to the irregular shapes with various sizes and indistinguishable boundaries between normal and infected tissues. In this paper, a novel segmentation scheme is proposed for the infection segmentation of COVID-19 on CT Images. To achieve this, we propose three modules for deep collaborative supervision and attention fusion based on ResUnet. To verify the effectiveness of the proposed scheme, a series of experiments are conducted on four COVID-19 datasets. The results show that our method can achieve the best performance for most of the datasets with metrics, such as $D i c e$ , Sensitivity( $S e n s .$ ) and $M A E$ , and has better generalization performance comparing to the existing approaches.

The proposed technique has four advantages as follows. Firstly, it is able to capture rich spatial information in various scales through an edge supervised module, denoted as the ESM, which allows to incorporate the edge supervised information into the initial stage of down-sampling in the framework of ResUnet. As low-level layers contain richer object boundaries, they are used to define the edge supervised loss function to capture all spatial information. The main benefit of this module is to highlight low-level boundary features and provide useful fine-grained constraints to guide feature extraction in semantic segmentation tasks. Secondly, the proposed method can explore semantic information from various scale infections on COVID-19 CT images by using an auxiliary semantic supervised module (i.e., ASSM) that can integrate the appearance supervised information into the later stage of down-sampling. The main advantage of this module is to strengthen high-level semantic information during the feature extraction process. Thirdly, we propose an attention fusion module (i.e., AFM) to fuse multiple scale feature maps of different levels from the up-sampling stage to reduce the semantic gaps between high-level and low-level feature maps. The main advantage of this module is to strengthen and supplement the lost detailed information in high-level representations. Lastly, we construct a joint loss function by combining the edge supervised loss, auxiliary semantic supervised loss and fusion loss. The joint function can guide the network in learning the features of COVID-19 infections, thereby achieving a deep collaborative supervision on edges and semantics. Meanwhile, it can also act as an incentive to effectively fuse multi-scale feature maps of different levels.

Although our network can get a good result in segmenting the overall infection region, it is not sufficient to estimate the severity of infected COVID-19, because finer segmentation of the different infection regions is required. In the future, we might collect a large amount of COVID-19 data, and consider further recognizing the severity of COVID-19 according to the area, size, and location of infections. The code is publicly available at https://github.com/slz674763180/COVID19. The package includes the proposed three modules and joint loss function for reproducibility purposes.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to express their appreciation to the referees for their helpful comments and suggestions. This work was supported in part by Zhejiang Provincial Natural Science Foundation of China (Grant nos. LGF20H180002 and GF22F037921), and in part by National Natural Science Foundation of China (Grant nos. 61802347, 61801428 and 61972354), the National Key Research and Development Program of China (Grant no.2018YFB1305202), and the Microsystems Technology Key Laboratory Foundation of China.

Biographies

graphic file with name fx1_lrg.jpg

Haigen Hu received the Ph.D. degree in Control theory and Control Engineering from Tongji University, Shanghai, China in 2013. He has come out from the post-doctoral station of LITIS Laboratory, Université de Rouen, France in 2019. He is currently an associate professor in the College of Computer Science and Technology at Zhejiang University of Technology, China. His current research interests include deep learning, computer vision and medical image processing.

graphic file with name fx2_lrg.jpg

Leizhao Shen was was born in Zhejiang in 1996, And graduated from Zhejiang University of technology with a bachelor’s degree in 2018. Now is a postgraduate in Zhejiang University of Technology, Zhejiang, China, from 2018. His current research interests include Machine Learning (Deep learning), Medical Image Processing.

graphic file with name fx3_lrg.jpg

Qiu Guan is currently a Professor and PhD supervisor with the College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China. Her research interests include computer vision and medical image computing and understanding. Sponsored by China Scholarship Council in 2007, she has been a visiting scholar of University College London in UK for one year and focused her study on medical image processing there. She carried out a number of China’s State (NSF61103140, U20A20171, 60870002) and Province funded research projects (LY21F020027, 2015C33073, 2014C33110, 2010C33095) as a PI or co-PI in the area related to these fields, which resulted in a number of papers published in the international journals and conference proceedings.

graphic file with name fx4_lrg.jpg

Xiaoxin Li received the B.E. and M.E. degrees from the Wuhan University of Technology, Wuhan, China, in 2002 and 2005, respectively, and the Ph.D. degree from the South China University of Technology, Guangzhou, China, in 2009., Since 2009, he has been a Postdoctoral Research Fellow with the Center for Computer Vision, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, China. He is also currently a Lecturer with the College of Computer Science and Technology, Faculty of Information Technology, Zhejiang University of Technology, Hangzhou, China. His current research interests include image processing, statistical data analysis, and bioinformatics.

graphic file with name fx5_lrg.jpg

Qianwei Zhou (A17-M19) received the Ph.D. degree in communication and information systems from the Shanghai Institute of Microsystem and Information Technology, University of Chinese Academic and Sciences, Shanghai, China, in 2014. In July 2014, he joined Zhejiang University of Technology, Hangzhou, China, where he is currently a Research Scientist in the College of Computer Science. His researches have been published in many international journals, including the IEEE Transactions on Industrial Electronics, IEEE Transactions on Magnetics, IEEE Signal Processing Letters, IEEE Transactions on Instrumentation and Measurement, and Journal of Sound and Vibration. His research interests include the crossing field of machine learning and computer-aided design, Internet-of-Thing related signal processing, pattern recognition and algorithms for the understanding of medical images.

graphic file with name fx6_lrg.jpg

Su Ruan (Senior Member, IEEE) received the M.S. and the Ph.D. degrees in image processing from the University of Rennes, France, in 1989 and 1993, respectively. From 2003 to 2010, she was a Full Professor with the University of Reims Champagne-Ardenne, France. She is currently a Full Professor with the Department of Medicine, and the Leader of the QuantIF Team, LITIS Research Laboratory, University of Rouen, France. Her research interests include pattern recognition, machine learning, information fusion, and medical imaging.(Based on document published on 8 April 2021).

References

1.Wang C., Horby P.W., Hayden F.G., Gao G.F. A novel coronavirus outbreak of global health concern. Lancet. 2020;395(10223):470–473. doi: 10.1016/S0140-6736(20)30185-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Huang C., Wang Y., et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.He K., Zhao W., Xie X., Ji W., Liu M., Tang Z., Shi Y., Shi F., Gao Y., Liu J., Zhang J., Shen D. Synergistic learning of lung lobe segmentation and hierarchical multi-instance classification for automated severity assessment of COVID-19 in CT images. Pattern Recognit. 2021;113:107828. doi: 10.1016/j.patcog.2021.107828. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rorat M., Jurek T., Simon K., Guziński M. Value of quantitative analysis in lung computed tomography in patients severely ill with COVID-19. PLoS One. 2021;16(5):e0251946. doi: 10.1371/journal.pone.0251946. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Amyar A., Modzelewski R., Li H., Ruan S. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: classification and segmentation. Comput. Biol. Med. 2020;126:104037. doi: 10.1016/j.compbiomed.2020.104037. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fan D.P., Zhou T., Ji G.P., et al. Inf-Net: automatic COVID-19 lung infection segmentation from CT scans. IEEE Trans. Med. Imaging (TMI) 2020;39(8):2626–2637. doi: 10.1109/TMI.2020.2996645. [DOI] [PubMed] [Google Scholar]
7.Wang S., Kang B., Ma J., et al. A deep learning algorithm using CT images to screen for corona virus disease (COVID-19) Eur. Radiol. 2021;31(8):6096–6104. doi: 10.1007/s00330-021-07715-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chen J., Wu L., Zhang J., et al. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Sci. Rep. 2020;10(1):19196. doi: 10.1038/s41598-020-76282-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhou Z., Rahman Siddiquee M.M., Tajbakhsh N., Liang J. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Vol. 11045. 2018. UNet++: a nested U-Net architecture for medical image segmentation; pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Oulefki A., Agaian S., Trongtirakul T., Laouar A.K. Automatic COVID-19 lung infected region segmentation and measurement using CT-scans images. Pattern Recognit. 2021;114:107747. doi: 10.1016/j.patcog.2020.107747. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ronneberger O., Fischer P., Brox T. Medical Image Computing and Computer Assisted Intervention (MICCAI), Munich Germany. vol. 9351. 2015. U-Net: convolutional networks for biomedical image segmentation; pp. 234–241. [Google Scholar]
12.Deng L., Yu D. Deep learning: methods and applications. Found. Trends Signal Process. 2013;7(3–4):197–387. [Google Scholar]
13.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems, California, USA. 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]
14.Simonyan K., Zisserman A. Computer Vision and Pattern Recognition (CVPR), Columbia USA. 2014. Very deep convolutional networks for large-scale image recognition. [Google Scholar]
15.He K., Zhang X., Ren S., Sun J. Computer Vision and Pattern Recognition (CVPR), Las Vegas USA. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]
16.Long J., Shelhamer E., Darrell T. Computer Vision and Pattern Recognition (CVPR), Boston, USA. 2015. Fully convolutional networks for semantic segmentation; pp. 3431–3440. [DOI] [PubMed] [Google Scholar]
17.Zhao H., Shi J., Qi X., Wang X., Jia J. Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA. 2017. Pyramid scene parsing network; pp. 2881–2890. [Google Scholar]
18.Chen L.-C., Zhu Y., Papandreou G., Schroff F., Adam H. European Conference on Computer Vision (ECCV), Munich, Germany. 2018. Encoder decoder with atrous separable convolution for semantic image segmentation; pp. 801–818. [Google Scholar]
19.Zhao H., Zhang Y., Liu S., Shi J., Loy C.C., Lin D., Jia J. European Conference on Computer Vision (ECCV), Munich, Germany. 2018. Psanet: point-wise spatial attention network for scene parsing; pp. 267–283. [Google Scholar]
20.Wei J., Wang S., Huang Q. AAAI Conference on Artificial Intelligence 2020 (AAAI 2020) 2020. F3Net: fusion, feedback and focus for salient object detection; pp. 12321–12328. [Google Scholar]
21.Zhang H., Dana K., Shi J., Zhang Z., Wang X., Tyagi A., Agrawal A. Computer Vision and Pattern Recognition (CVPR), Utah, USA. 2018. Context encoding for semantic segmentation; pp. 7151–7160. [Google Scholar]
22.Wu Z., Su L., Huang Q. International Conference on Computer Vision (ICCV), Seoul South Kerean. 2019. Stacked cross refinement network for edge-aware salient object detection; pp. 7264–7273. [Google Scholar]
23.Zhang Z., Fu H., Dai H., Shen J., Pang Y., Shao L. Medical Image Computing and Computer Assisted Intervention (MICCAI), Shenzhen China. vol. 11764. 2019. ET-Net: a generic edge-attention guidance network for medical image segmentation. [Google Scholar]
24.Lin T.-Y., Dollár P., Girshick R.B., He K., Hariharan B., Belongie S.J. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Feature pyramid networks for object detection; pp. 936–944. [Google Scholar]
25.Ren S., He K., Girshick R., Sun J. Advances in Neural Information Processing Systems. 2015. Faster R-CNN: towards real-time object detection with region proposal networks; pp. 91–99. [DOI] [PubMed] [Google Scholar]
26.He K., Gkioxari G., Dollár P., Girshick R. Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017. Mask R-CNN; pp. 2961–2969. [Google Scholar]
27.Hu H., Guan Q., Chen S., Ji Z., Yao L. Detection and recognition for life state of cell cancer using two-stage cascade CNNs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020;17(3):887–898. doi: 10.1109/TCBB.2017.2780842. [DOI] [PubMed] [Google Scholar]
28.Fu J., Liu J., Tian H., Li Y., Bao Y., Fang Z., Lu H. Conference on Computer Vision and Pattern Recognition (CVPR), California, USA. 2019. Dual attention network for scene segmentation; pp. 3146–3154. [Google Scholar]
29.Huang Z., Wang X., Huang L., Huang C., Wei Y., Liu W. International Conference on Computer Vision (ICCV), Seoul, South Kerean. 2019. CCNet: criss-cross attention for semantic segmentation; pp. 603–612. [Google Scholar]
30.Wang X., Girshick R., Gupta A., He K. Conference on Computer Vision and Pattern Recognition (CVPR), Utah, USA. 2018. Non-local neural networks; pp. 7794–7803. [Google Scholar]
31.Li X., Zhao H., Han L., Tong Y., Yang K. Association for the Advance of Artificial Intelligence (AAAI), New York USA. 2019. GFF: gated fully fusion for semantic segmentation. [Google Scholar]
32.“COVID-19 CT segmentation dataset, 2020, https://medicalsegmentation.com/covid19/.
33.“COVID-19 CT segmentation dataset, 2020, https://gitee.com/junma11/COVID-19-CT-Seg-Benchmark.
34.Oktay O., Schlemper J., et al. International Conference on Medical Imaging with Deep Learning (MIDL) 2018. Attention U-Net: learning where to look for the pancreas. [Google Scholar]
35.Fan D.-P., Cheng M.-M., Liu Y., Li T., Borji A. IEEE International Conference on Computer Vision (ICCV) 2017. Structure-measure: a new way to evaluate foreground maps; pp. 4548–4557. [Google Scholar]
36.Fan D.-P., Gong C., Cao Y., Ren B., Cheng M.-M., Borji A. International Joint Conference on Artificial Intelligence (IJCAI), Stockholm. 2018. Enhanced-alignment measure for binary foreground map evaluation; pp. 698–704. [Google Scholar]
37.Hu H., Liu A., Zhou Q., Guan Q., Li X., Chen Q. An adaptive learning method of anchor shape priors for biological cells detection and segmentation. Comput. Methods Prog. Biomed. 2021;208:106260. doi: 10.1016/j.cmpb.2021.106260. [DOI] [PubMed] [Google Scholar]
38.Gu Z., Cheng J., Fu H., et al. CE-Net: context encoder network for 2D medical image segmentation. IEEE Trans. Med. Imaging. 2019;38(10):2281–2292. doi: 10.1109/TMI.2019.2903562. [DOI] [PubMed] [Google Scholar]
39.Zhang F., Chen Y., Li Z., Hong Z., Liu J., Ma F., Han J., Ding E. 2019 IEEE/CVF International Conference on Computer Vision, ICCV. 2019. ACFNet: attentional class feature network for semantic segmentation; pp. 6797–6806. [Google Scholar]
40.Feng S., Zhao H., Shi F., et al. CPFNet: context pyramid fusion network for medical image segmentation. IEEE Trans. Med. Imaging. 2020;39:3008–3018. doi: 10.1109/TMI.2020.2983721. [DOI] [PubMed] [Google Scholar]
41.Wang G., Liu X., Li C., Xu Z., Ruan J., Zhu H., Meng T., Li K., Huang N., Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans. Med. Imaging. 2020;39(8):2653–2663. doi: 10.1109/TMI.2020.3000314. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Morozov S.P., Andreychenko A.E., Blokhin I.A., et al. Mosmeddata: data set of 1110 chest CT scans performed during the COVID-19 epidemic. Digit. Diagn. 2020;1(1):49–59. doi: 10.17816/DD46826. [DOI] [Google Scholar]
43.J.P. Cohen, P. Morrison, L. Dao, et al., COVID-19 image data collection: prospective predictions are the future, 2020, https://github.com/ieee8023/covid-chestxray-dataset. arXiv:2006.11988

[bib0001] 1.Wang C., Horby P.W., Hayden F.G., Gao G.F. A novel coronavirus outbreak of global health concern. Lancet. 2020;395(10223):470–473. doi: 10.1016/S0140-6736(20)30185-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0002] 2.Huang C., Wang Y., et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0003] 3.He K., Zhao W., Xie X., Ji W., Liu M., Tang Z., Shi Y., Shi F., Gao Y., Liu J., Zhang J., Shen D. Synergistic learning of lung lobe segmentation and hierarchical multi-instance classification for automated severity assessment of COVID-19 in CT images. Pattern Recognit. 2021;113:107828. doi: 10.1016/j.patcog.2021.107828. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0004] 4.Rorat M., Jurek T., Simon K., Guziński M. Value of quantitative analysis in lung computed tomography in patients severely ill with COVID-19. PLoS One. 2021;16(5):e0251946. doi: 10.1371/journal.pone.0251946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0005] 5.Amyar A., Modzelewski R., Li H., Ruan S. Multi-task deep learning based CT imaging analysis for COVID-19 pneumonia: classification and segmentation. Comput. Biol. Med. 2020;126:104037. doi: 10.1016/j.compbiomed.2020.104037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0006] 6.Fan D.P., Zhou T., Ji G.P., et al. Inf-Net: automatic COVID-19 lung infection segmentation from CT scans. IEEE Trans. Med. Imaging (TMI) 2020;39(8):2626–2637. doi: 10.1109/TMI.2020.2996645. [DOI] [PubMed] [Google Scholar]

[bib0007] 7.Wang S., Kang B., Ma J., et al. A deep learning algorithm using CT images to screen for corona virus disease (COVID-19) Eur. Radiol. 2021;31(8):6096–6104. doi: 10.1007/s00330-021-07715-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0008] 8.Chen J., Wu L., Zhang J., et al. Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography. Sci. Rep. 2020;10(1):19196. doi: 10.1038/s41598-020-76282-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0009] 9.Zhou Z., Rahman Siddiquee M.M., Tajbakhsh N., Liang J. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Vol. 11045. 2018. UNet++: a nested U-Net architecture for medical image segmentation; pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0010] 10.Oulefki A., Agaian S., Trongtirakul T., Laouar A.K. Automatic COVID-19 lung infected region segmentation and measurement using CT-scans images. Pattern Recognit. 2021;114:107747. doi: 10.1016/j.patcog.2020.107747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0011] 11.Ronneberger O., Fischer P., Brox T. Medical Image Computing and Computer Assisted Intervention (MICCAI), Munich Germany. vol. 9351. 2015. U-Net: convolutional networks for biomedical image segmentation; pp. 234–241. [Google Scholar]

[bib0012] 12.Deng L., Yu D. Deep learning: methods and applications. Found. Trends Signal Process. 2013;7(3–4):197–387. [Google Scholar]

[bib0013] 13.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems, California, USA. 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]

[bib0014] 14.Simonyan K., Zisserman A. Computer Vision and Pattern Recognition (CVPR), Columbia USA. 2014. Very deep convolutional networks for large-scale image recognition. [Google Scholar]

[bib0015] 15.He K., Zhang X., Ren S., Sun J. Computer Vision and Pattern Recognition (CVPR), Las Vegas USA. 2016. Deep residual learning for image recognition; pp. 770–778. [Google Scholar]

[bib0016] 16.Long J., Shelhamer E., Darrell T. Computer Vision and Pattern Recognition (CVPR), Boston, USA. 2015. Fully convolutional networks for semantic segmentation; pp. 3431–3440. [DOI] [PubMed] [Google Scholar]

[bib0017] 17.Zhao H., Shi J., Qi X., Wang X., Jia J. Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA. 2017. Pyramid scene parsing network; pp. 2881–2890. [Google Scholar]

[bib0018] 18.Chen L.-C., Zhu Y., Papandreou G., Schroff F., Adam H. European Conference on Computer Vision (ECCV), Munich, Germany. 2018. Encoder decoder with atrous separable convolution for semantic image segmentation; pp. 801–818. [Google Scholar]

[bib0019] 19.Zhao H., Zhang Y., Liu S., Shi J., Loy C.C., Lin D., Jia J. European Conference on Computer Vision (ECCV), Munich, Germany. 2018. Psanet: point-wise spatial attention network for scene parsing; pp. 267–283. [Google Scholar]

[bib0020] 20.Wei J., Wang S., Huang Q. AAAI Conference on Artificial Intelligence 2020 (AAAI 2020) 2020. F3Net: fusion, feedback and focus for salient object detection; pp. 12321–12328. [Google Scholar]

[bib0021] 21.Zhang H., Dana K., Shi J., Zhang Z., Wang X., Tyagi A., Agrawal A. Computer Vision and Pattern Recognition (CVPR), Utah, USA. 2018. Context encoding for semantic segmentation; pp. 7151–7160. [Google Scholar]

[bib0022] 22.Wu Z., Su L., Huang Q. International Conference on Computer Vision (ICCV), Seoul South Kerean. 2019. Stacked cross refinement network for edge-aware salient object detection; pp. 7264–7273. [Google Scholar]

[bib0023] 23.Zhang Z., Fu H., Dai H., Shen J., Pang Y., Shao L. Medical Image Computing and Computer Assisted Intervention (MICCAI), Shenzhen China. vol. 11764. 2019. ET-Net: a generic edge-attention guidance network for medical image segmentation. [Google Scholar]

[bib0024] 24.Lin T.-Y., Dollár P., Girshick R.B., He K., Hariharan B., Belongie S.J. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Feature pyramid networks for object detection; pp. 936–944. [Google Scholar]

[bib0025] 25.Ren S., He K., Girshick R., Sun J. Advances in Neural Information Processing Systems. 2015. Faster R-CNN: towards real-time object detection with region proposal networks; pp. 91–99. [DOI] [PubMed] [Google Scholar]

[bib0026] 26.He K., Gkioxari G., Dollár P., Girshick R. Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017. Mask R-CNN; pp. 2961–2969. [Google Scholar]

[bib0027] 27.Hu H., Guan Q., Chen S., Ji Z., Yao L. Detection and recognition for life state of cell cancer using two-stage cascade CNNs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020;17(3):887–898. doi: 10.1109/TCBB.2017.2780842. [DOI] [PubMed] [Google Scholar]

[bib0028] 28.Fu J., Liu J., Tian H., Li Y., Bao Y., Fang Z., Lu H. Conference on Computer Vision and Pattern Recognition (CVPR), California, USA. 2019. Dual attention network for scene segmentation; pp. 3146–3154. [Google Scholar]

[bib0029] 29.Huang Z., Wang X., Huang L., Huang C., Wei Y., Liu W. International Conference on Computer Vision (ICCV), Seoul, South Kerean. 2019. CCNet: criss-cross attention for semantic segmentation; pp. 603–612. [Google Scholar]

[bib0030] 30.Wang X., Girshick R., Gupta A., He K. Conference on Computer Vision and Pattern Recognition (CVPR), Utah, USA. 2018. Non-local neural networks; pp. 7794–7803. [Google Scholar]

[bib0031] 31.Li X., Zhao H., Han L., Tong Y., Yang K. Association for the Advance of Artificial Intelligence (AAAI), New York USA. 2019. GFF: gated fully fusion for semantic segmentation. [Google Scholar]

[bib0032] 32.“COVID-19 CT segmentation dataset, 2020, https://medicalsegmentation.com/covid19/.

[bib0033] 33.“COVID-19 CT segmentation dataset, 2020, https://gitee.com/junma11/COVID-19-CT-Seg-Benchmark.

[bib0034] 34.Oktay O., Schlemper J., et al. International Conference on Medical Imaging with Deep Learning (MIDL) 2018. Attention U-Net: learning where to look for the pancreas. [Google Scholar]

[bib0035] 35.Fan D.-P., Cheng M.-M., Liu Y., Li T., Borji A. IEEE International Conference on Computer Vision (ICCV) 2017. Structure-measure: a new way to evaluate foreground maps; pp. 4548–4557. [Google Scholar]

[bib0036] 36.Fan D.-P., Gong C., Cao Y., Ren B., Cheng M.-M., Borji A. International Joint Conference on Artificial Intelligence (IJCAI), Stockholm. 2018. Enhanced-alignment measure for binary foreground map evaluation; pp. 698–704. [Google Scholar]

[bib0037] 37.Hu H., Liu A., Zhou Q., Guan Q., Li X., Chen Q. An adaptive learning method of anchor shape priors for biological cells detection and segmentation. Comput. Methods Prog. Biomed. 2021;208:106260. doi: 10.1016/j.cmpb.2021.106260. [DOI] [PubMed] [Google Scholar]

[bib0038] 38.Gu Z., Cheng J., Fu H., et al. CE-Net: context encoder network for 2D medical image segmentation. IEEE Trans. Med. Imaging. 2019;38(10):2281–2292. doi: 10.1109/TMI.2019.2903562. [DOI] [PubMed] [Google Scholar]

[bib0039] 39.Zhang F., Chen Y., Li Z., Hong Z., Liu J., Ma F., Han J., Ding E. 2019 IEEE/CVF International Conference on Computer Vision, ICCV. 2019. ACFNet: attentional class feature network for semantic segmentation; pp. 6797–6806. [Google Scholar]

[bib0040] 40.Feng S., Zhao H., Shi F., et al. CPFNet: context pyramid fusion network for medical image segmentation. IEEE Trans. Med. Imaging. 2020;39:3008–3018. doi: 10.1109/TMI.2020.2983721. [DOI] [PubMed] [Google Scholar]

[bib0041] 41.Wang G., Liu X., Li C., Xu Z., Ruan J., Zhu H., Meng T., Li K., Huang N., Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans. Med. Imaging. 2020;39(8):2653–2663. doi: 10.1109/TMI.2020.3000314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0042] 42.Morozov S.P., Andreychenko A.E., Blokhin I.A., et al. Mosmeddata: data set of 1110 chest CT scans performed during the COVID-19 epidemic. Digit. Diagn. 2020;1(1):49–59. doi: 10.17816/DD46826. [DOI] [Google Scholar]

[bib0043] 43.J.P. Cohen, P. Morrison, L. Dao, et al., COVID-19 image data collection: prospective predictions are the future, 2020, https://github.com/ieee8023/covid-chestxray-dataset. arXiv:2006.11988

PERMALINK

Deep co-supervision and attention fusion strategy for automatic COVID-19 lung infection segmentation on CT images

Haigen Hu

Leizhao Shen

Qiu Guan

Xiaoxin Li

Qianwei Zhou

Su Ruan

Abstract

1. Introduction

Fig. 1.

2. Related works

2.1. Network models

2.2. Edge supervision and multi-scale object recognition

2.3. Attention mechanism

3. Methods

3.1. Proposed network architecture

Fig. 2.

3.2. Edge supervised module (ESM)

Fig. 3.

3.3. Auxiliary semantic supervised module (ASSM)

3.4. Attention fusion module (AFM)

Fig. 4.

Fig. 5.

Algorithm 1.

4. Experiments

4.1. Datasets and baselines

4.2. Evaluation metrics and experimental settings

Table 1.

4.3. Experimental results

4.3.1. Quantitative results

Table 2.

Table 3.

4.3.2. Qualitative results

Fig. 6.

Fig. 7.

4.4. Ablation experiments

Table 4.

4.5. Comparison of fusion methods

Table 5.

Fig. 8.

4.6. Comparisons on other COVID-19 datasets

Table 6.

Table 7.

Table 8.

5. Conclusion

Declaration of Competing Interest

Acknowledgments

Biographies

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases