Two-Stage CNN Whole Heart Segmentation Combining Image Enhanced Attention Mechanism and Metric Classification

Xuchu Wang; Fusheng Wang; Yanmin Niu

doi:10.1007/s10278-022-00708-6

. 2022 Sep 29;36(1):124–142. doi: 10.1007/s10278-022-00708-6

Two-Stage CNN Whole Heart Segmentation Combining Image Enhanced Attention Mechanism and Metric Classification

Xuchu Wang ^1,^✉, Fusheng Wang ², Yanmin Niu ³

PMCID: PMC9984643 PMID: 36175600

Abstract

Accurate segmentation of multiple tissues and organs in cardiac medical imaging is of great value in computer-aided cardiovascular diagnosis. However, it is challenging due to the complex distribution of various tissues and organs in cardiac MRI (magnetic resonance imaging) slices, low discriminative and large spanning organs. To handle these problems, a two-stage CNN (convolutional neural network) segmentation method based on the combination of Log-Gabor filter attention mechanism and metric classification is proposed. The Log-Gabor filterbank is applied to selectively enhance the texture information and contour information of each tissue and organ, and the spatial and channel attention mechanism jointly with the varying kernel size of Log-Gabor filterbank is incorporated into the codec structure to adaptively extract target features of different sizes and focus on the discriminative features in the network. To solve the problem of insufficient segmentation on subtle and adherent edges involving different tissues, a metric classification network is incorporated to finely optimize the hard-to-be-segmented boundaries. The proposed method was tested on cardiac MRI data set to segment 7 cardiac tissues, and the rationality and effectiveness of the method were verified. In comparison to a series of deep learning-based segmentation models, the proposed method achieves competitive performance.

Keywords: Cardiac multi-target segmentation, Metric classification, Image enhancement, Attention mechanism, Convolutional neural network

Introduction

The segmentation of various tissues and organs in cardiac magnetic resonance imaging (MRI) slices is a key step in computer-aided diagnosis and treatment. The precise segmentation of clinical targets is of great significance to assist doctors in judging the health of tissues and organs. An inaccurate segmentation result often brings poor clinical experience to patients. However, due to the complex distribution and close arrangement of various organs and tissues in MRI, the obscure gray level distinction of cardiac organs, and the characteristics of large variability in the morphological size of various tissues and organs in different slices, accurate segmentation of multiple tissues and organs in cardiac MRI is often full of challenges [1, 2].

Related Work

In the viewpoint of machine learning-based segmentation strategy, cardiac image segmentation can be coarsely categorized into traditional learning-based methods and deep learning-based ones. Traditional methods use the gray-scale features, edge features and hand-crafted features of the image itself to help the classifier or regressor to separate tissues. Due to the poor representation of traditional features, the segmentation accuracy is not extraordinarily high. In recent years, convolutional neural network (CNN) has promoted the development of various segmentation models. According to the connection style, existing segmentation models can be roughly divided into methods based on nesting and jump connections, methods based on collecting global scale information, and methods based on variants of conditional random field (CRF) or recurrent neural network (RNN) [3].

The methods based on nesting and skip connection, mainly include fully convolutional network (FCN) [4], SegNet [5], and U-Net [6] series. FCN is the first segmentation network where the feature resolution is gradually restored in the up-sampling stage to achieve pixel-level classification, while both U-Net and SegNet are symmetrical codec networks. The down-sampling stage extracts shallow image detail information, and the up-sampling stage gradually restores the image spatial information, and the jump connections are incorporated for feature fusion. The former superimposes different levels of feature channels, while the latter uses record pooling index and deconvolution to fuse shallow features, which has a certain improvement in the fine-grained segmentation results.

Recently, an attention gate structure is proposed in an attention U-Net [7], and it focuses on learning target features without additional supervision conditions that significantly increases the model parameters. Furthermore, an encoding and decoding network structure is introduced in U-Net++ [8] based on dense jump connection, which captures different levels of features through dense jump connection, and reduces the semantic differences in decoding and coding process. The segmentation speed is improved by using different branches according to different segmentation tasks. The U-Net3+ [9] changed the interconnection between the encoder and the decoder and the internal connection between the decoder subnets. In this model, a full-size jump connection is designed to connect the output in two sides with a mixed loss for supervised extraction of features. The jump connection in these networks has been proven to be effective in recovering fine-grained details of the target, but in a series of encoding and decoding processes, the edge information of large objects and the small objects themselves are easily lost, resulting in blurring of object edge details.

The methods based on global scale information mainly include PspNet [10], ParseNet [11], RefineNet [12], and DeepLab series [13–15]. PspNet proposes a global pyramid pooling module to search for regional context information and embed difficult-to-analyze scene features into the FCN-based prediction framework to improve segmentation accuracy. ParseNet obtains global information through global average pooling, and merges the pooled global information with convolutional features. However, excessive acquisition of global information can easily misclassify the occluded part as a target. RefineNet introduces a multi-path refinement network to further utilize the finer information along the down-sampling path. DeepLabv1 proposes a hole convolution to solve the problem of resolution degradation caused by down-sampling and pooling. The hole convolution enlarges the receptive field and makes it easy to capture the global scale information of the target. At the same time, it introduces the conditional random field to improve the model’s ability to capture fine details. Both DeepLabv2 and DeepLabv3 use hole convolution instead of down-sampling pooling, and introduces spatial pyramids to expand the receptive field to extract multi-scale information. In addition, some studies rely on the attention mechanism to obtain multi-scale global context information [16–18]. Although global scale information and different levels of semantic information can improve the accuracy of segmentation, how to model the global context more effectively and use semantics for explicit reasoning requires further exploration.

The methods based on CRF or RNN variants mainly include CRFasRNN [19], DPN [20], DAG-RNN [21], and DD-RNN [22]. CRFasRNN introduces fully connected CRF into the back end of the network as a recursive calculation layer. Through joint training with CNNs, the prediction accuracy is improved. DPN uses a local convolutional layer to model the smoothing items in CRFs, and integrates them to generate a deep analytical network, which alleviates the time-consuming problem of recursive calculation in the combination of RNN and CNN. However, due to computational reasons, the structured markup energy used in CRF usually cannot enforce high-order contextual consistency, because they only limit markup compatibility on small factions (for example, pairwise compatibility). In addition, CRF may smooth small-sized objects, which is not ideal for scene segmentation. DAG-RNN proposed the context aggregation module to assemble the semantic information between contexts and extract the dependencies between pixels. DD-RNN proposes a denser RNN to mine the rich dependencies of images, and simultaneously introduces the attention mechanism into the network to select useful dependencies for improving the accuracy of segmentation. RNN can be used to capture semantic dependencies in images and increase the recognition of fuzzy areas at the edges of objects. However, dense RNNs will cause feature redundancy and require a lot of computing time. Sparse RNNs will cause useful dependency information to be lost. How to target specific objects by grasping the reasonable density in image modeling still needs further investigation.

Traditional medical image segmentation is feature-oriented, requiring manual intervention and special medical knowledge to define contours, such as intensity distribution threshold [23] or active contour deformation model [24], which usually have low accuracy and limited robustness. In order to avoid the shortcomings of traditional segmentation methods, many researchers migrated the CNN model to the field of medical segmentation, including lung [25], liver [26], heart [27, 28], and coronary angiography [29]. Tran et al. [30] firstly used FCN for left and right ventricle segmentation. In their method, a 15-layer FCN model is used to normalize the mean variance after each layer of convolution to maintain the uniformity of the data. However, the model has many parameters and long training and testing time, which is not ideal for patients with myocardial hypertrophy. Avendi et al. [31] combined CNN with a deformed model to automatically segment left ventricle to improve accuracy and robustness. Similarly, Ngo et al. [32] also proposed an improved segmentation network by combining CNN and level set algorithm. Payer et al. [33] applied the contextual information of the heart substructure to the U-Net network by adding a tag transformation network, and achieved competitive performance. Xia et al. [34] proposed a multi-view collaborative training framework for uncertainty perception, and achieved good results in pancreas and lung segmentation using a semi-supervised strategy.

Motivation and Contribution

The segmentation models above are usually designed independent of the image category, and the characteristics of data sets will cause varying performance. Generally, due to the low cost of natural image acquisition and many channels, it is often driven by a large number of data sources to reduce and the remarkable target difference and outliers. At present, many algorithms have reached satisfactory performance levels in multi-class segmentation. However, due to the complexity and variety of cadiac medical images, it is often difficult to obtain accurate segmentation results in small sample data sets. Many methods were designed by general purpose, whereas there are a number of specific adjustments to a specific problem with a specific equipment or dataset in real application. The U-Net series have achieved good results in spleen and liver segmentation, but the algorithm focuses on two-class of segmentation, which is difficult to apply to multi-target segmentation scenes in medical-assisted diagnosis. In addition, the U-Net series only consider optimization or improvement from the network structure, instead of starting from the complexity and particularity of the original images itself, the problem of edge information loss will inevitably occur in the encoding and decoding process.

To this end, we propose a two-stage CNN method for segmenting multiple tissues in cardiac MRI images by integrating Log-Gabor filterbank, attention mechanism and metric classification. Starting from the complexity of the medical image itself, the Log-Gabor filter is used to mine the rich detailed texture and edge information at different scales and directions in the image. It can selectively enhance the original image to highlight the characteristics of different tissues and organs by combining the attention mechanism. Because the edge information is enhanced, it helps to solve the problem of edge information loss in the encoding and decoding process. Furthermore, the kernel size attention mechanism [35] and the spatial and channel attention mechanism [36] are introduced, and the receptive field size is adaptively adjusted to capture different sizes. Which can focus on learning the discriminative features and constructing the target edge ring based on one-stage segmentation. The segmentation accuracy is further improved by pixel classification in the ring of the boundary. The proposed method is evaluated and verified in a small sample cardiac MRI data set containing seven tissues.

Specifically, the contributions of the proposed method are listed as follows.

In view of feature enhancement, aiming at the characteristics of low discrimination and small differences in some organs in cardiac MRI, this paper proposes an image enhancement module based on Log-Gabor filter and channel attention mechanism, which combines multi-scale and multi-direction Log-Gabor filtering with attention mechanism, and reasonably learns the attention to the filtering results for adaptively enhancing the edge information of different organs and tissues, and strengthening the distinction among them.
In view of feature learning, considering the large size span of the target area in cardiac MRI, this paper proposes an encoding-decoding (codec) CNN network with dense jump connections and an adaptive kernel size attention mechanism. It uses multi-convolution kernels to extract features of different scale targets in the down-sampling stage, then aggregates information from different convolution kernel branches. Considering CNN features are not equally important in segmentation tasks, this paper also introduces spatial and channel attention mechanisms to focus on learning discriminatory features toward segmentation, and uses deep supervision mechanisms to accelerate network convergence.
In view of hard region segmentation, aiming at the fact that the target edge information will inevitably be lost in the encoding and decoding process and result in insufficient segmentation of the edges, this paper proposes a fine optimization method for the target edge region by incorporating metric learning which classifies the edge ring of the target region pixel by pixel to further improve the edge segmentation accuracy.

The remainder of this paper is organized as follows. The “Proposed Method” section describes the details of the proposed framework, which includes the first segmentation stage, the design of CNN architecture, and the second segmentation stage. The “Experimental Results and Discussion” section presents the experimental results including data set experimental platform accuracy, evaluation metric, parameter settings, and experimental results and discussion with comparison to state-of-the-art methods. Finally, we conclude this work and its research potential in the future in the “Conclusion and Future Works” section.

Proposed Method

Overview

The proposed overall framework of multi-target segmentation in cardiac MRI is shown in Fig. 1, including two-stage coarse-to-fine segmentation. In the first stage, the Log-Gabor filterbank is used to extract the rich details and edge contour features in the image, and selectively enhances the original image with the help of the channel attention mechanism learning. At the same time, in the codec CNN network based on dense jump connections, the space and channel attention mechanism and the adaptive core size attention mechanism are jointly introduced to improve the network performance. In the second stage, the first-stage segmented area mask is expanded and eroded to form the edge zone area, combined with the metric classification to classify the zone pixel by pixel, to further optimize the edge segmentation result and improve the segmentation accuracy. In the following subsection we will present the details of the two-stage segmentation.

Coarse Segmentation Stage

Due to the tight arrangement of tissues and organs in cardiac MRI, the gray-scale distribution is complicated, and some organs are characterized by low discrimination. This paper considers the enhancement of the original image to make the outline of the target tissues and organs in the image clearer, and the detailed texture information to be richer, and to strengthen the discrimination of different tissues and organs. First, the Log-Gabor filter is used to extract the rich low-frequency detail texture information and high-frequency edge contour information in the image in multi-scale and multi-direction, and the extracted features are superimposed back to the original image to enhance the image, but it is difficult to obtain discriminatory features by direct superposition. In this paper, a channel attention mechanism is introduced to focus on some important discriminative features and ignore some unimportant features, so as to reasonably allocate the attention of the network to features of different scales and directions, and selectively enhance the original image. At the same time, considering the different sizes of the various tissues and organs in the image and the single size of the convolution kernel in the network, the receptive field is remarkably limited and it is difficult to extract multi-scale information of different targets. Based on the above considerations, this paper introduces spatial and channel attention mechanisms and adaptive kernel size attention mechanisms into the network to achieve end-to-end joint training and adaptive attention fusion of convolution features.

Image Enhancement Module

The data enhancement module in this paper mainly includes Log-Gabor filtering and channel attention mechanism fusion, where Log-Gabor filtering is to selectively enhance the original image with texture information and edge information under multi-direction and multi-scale, and channel attention mechanism fusion is to further enhance the filtered image.

Log-Gabor filter is a traditional filter with no direct current component, unlimited bandwidth, wide frequency coverage, and a good representation of the local space-frequency characteristics of the image. Compared with CNN, it needs no training, while owns fast calculation speed and very few parameters. Through frequency and directional adjustment, it can truly extract multi-level texture and edge in a cardiac image, and overlaying these kinds of information back can effectively alleviate the weak texture and unclear edge problems resided in the original image. Log-Gabor filter is a special filter whose transfer function is Gaussian function on logarithmic scale. It can be expressed as the product of radial component and angular component, as shown in the following formula (1):

\begin{matrix} L (f, θ) & = L (f) L (θ) \\ = e x p \{- \frac{{[l n (f / f_{0})]}^{2}}{2 {[l n (σ_{f} / f_{0})]}^{2}}\} e x p [- \frac{{(θ - θ_{0})}^{2}}{2 σ_{_{θ}}^{2}}] \end{matrix}

In the above formula, $f_{0}$ is the center frequency; $θ_{0}$ is the center direction angle of the filter; $σ_{f}$ determines the radial bandwidth $B_{f}$ of the filter; $B_{f} = 2 \sqrt{2 / l n 2} |l n (σ_{f} / f_{0})|$ , and $σ_{θ}$ determines the bandwidth angle $B_{θ}$ ; $B_{θ} = 2 \sqrt{2 l n 2} σ_{θ}$ . For a specific pair of parameters $(f_{i}, θ_{i})$ , Log-Gabor filter can be denoted as $L_{f_{i}, θ_{i}} (f, θ)$ , and the image to be filtered is writen as $I (f, θ)$ for emphasizing the frequency and direction information, the Log-Gabor filtering result $H_{f_{i}, θ_{i}} (f, θ)$ can be expressed as

\begin{matrix} H_{f_{i}, θ_{i}} (m, n) = F^{- 1} {F [I (f, θ)] L_{f_{i}, θ_{i}} (f, θ)} . \end{matrix}

In the above formula, F and $F^{- 1}$ are direct and inverse two-dimensional Fourier transform, respectively. (m,n) is the spatial location of pixels.

Log-Gabor filterbank can be generated by varying parameters as series of frequency and direction. To enhance the texture and edges along different directions as much as possible, it is necessary to improve the resolution of the directional filter. In other words, the finer direction $θ$ should be designed because it corresponds to a smaller bandwith angle $B_{θ}$ . On the other hand, the frequency f should cover the frequency domain as much as possible since there are complex frequency components in cardiac MRI slices. In the frequency filter, the relationship between scale and frequency can be defined as $f_{0} = 1 / [λ_{0} s]$ , where $λ_{0}$ is the wavelength of the smallest scale filter and s is a scale coefficient. Considering this, it is desirable to incorporate more scales for capturing the major frequency components. However, huge amount of directions as well as scales will bring heavy computational burden, and even cause mixture in digital implementation. So in practice we generate a bank of parameter pairs as $θ_{t} = π t / T (t = 0, 1, \dots, T - 1)$ and $f_{r} = 1 / [λ_{0} s_{0}^{r - 1}] (r = 1, 2, \dots, R)$ , where $s_{0}$ is the basic scale factor, and r is the index of current scale, and then selected the optimal direction and scale from the set $T = {4, 8}$ and $R = {1, 2, 3, 4}$ by experiments. To keep the resolutions of direciton and frequency as well as the shape of Log-Gabor filter, the directional resolution $σ_{θ}$ is set as 1.5, the frequency resolution $σ_{θ}$ is 0.6, the minimum wavelength $λ_{0}$ is 3, and the basic scale factor $s_{0}$ is 2 for a nearly two octaves.

With the transformation of scale, the filter can capture the low or middle frequency texture information of the original image and the high-frequency edge information. Since any vector can be decomposed into two vector directions that are not parallel to itself, only the information obtained by filtering in a certain direction is sufficient to characterize the original image. In our module, the maximum value of the filtering results in multiple directions is taken pixel by pixel to obtain the filtering result under one scale, as shown in the following formula:

\begin{matrix} H_{r} (m, n) = max \{{H_{r, t} (m, n)}_{t = 0}^{T - 1}\} ; r = 1, 2, \dots, R . \end{matrix}

In this way, a series of filted images are obtain by emphasizing the components in different directions. Figure 2 shows an example of fitered images and superimposed results obtained under different D and R.

Fig. 2 — Filtering results under different scale, directions and direct superpositions

As can be seen from the above figure, when the filtering results of various scales are directly superimposed with the original image, the image presents different degrees of enhancement. If the scale is too small, the detailed texture information of each tissue and organ can be obtained, while a proper scale can emphasize the texture and edge information of each tissue and organ. At the same time, when $T = 8$ , the information obtained is better than when $T = 4$ . The contrast is stronger, which can better describe the texture and edge information of various tissues and organs in cardiac MRI. The filtering results in each direction are maximized pixel by pixel, which is beneficial to the aggregation of discriminative features. With such a comprehensive consideration, this paper selects filters with $R = 3$ and $T = 8$ to enhance the image, as shown in the penultimate enhanced image with red mark in Fig. 2.

When the filtering results of proper scale are superimposed, the edge and contour information of the enhanced image is more obvious. To make the hierarchical features of texture and edge information more obvious, the channel attention mechanism is then used to obtain the attention spectrum of each direction in the third scale filtering results, which is multiplied by the original filtering results pixel by pixel, and finally superimposed on the original image to obtain the enhanced image.

Network Structure

This paper adopts the U-Net++ network structure based on dense jump connections to perform coarse segmentation. The dense jump connections are suitable for capturing feature information at different levels with the cost of the redundancy of the learned features in the network, so the importance of the network to the features is hard to be distinguished and some strong discriminative features are not paid enough attention. The neurons in each layer of the network are designed to share the same size, which is not conducive to the feature extraction of targets of different sizes. To handle this, this paper introduces two attention mechanisms as follows. In the down-sampling stage, through the nuclear attention mechanism, different core sizes are used to extract different scale information for adaptive feature fusion. In the skip connection stage, spatial and channel attention mechanisms are introduced to learn discriminative features for improving the ability of the network. More details about the two attention mechanisms are described below.

In the feature fusion stage, the spatial and channel attention mechanism is introduced to make the network adaptively focus on features with strong discriminative ability, while weakening some non-important features. As shown in Fig. 3, in the channel attention mechanism, the input feature maps are respectively subjected to global maximum pooling and global average pooling, and then respectively go through multilayer perceptron (MLP), and the features output by MLP are summed pixel by pixel, after a Sigmoid activation operation, the final channel attention feature is generated, furthermore the input feature and the channel attention feature are multiplied pixel by pixel to generate an output feature map.

Fig. 3 — Channel and spatial feature adaptive attention mechanism

The feature generation of channel attention mechanism is given in the following formula:

\begin{matrix} M_{c} (F) & = σ (MLP (AvgPool (F)) + MLP (MaxPool (F)) \\ = σ (W_{1} (W_{0} (F_{avg}^{c})) + W_{1} (W_{0} (F_{\max}^{c}))), \\ W_{0} \in R^{C / β \times C} ; W_{1} \in R^{C \times C / β} . \end{matrix}

In the above formula, $σ$ is the sigmoid operation, C is the number of channels, $β$ is the reduction rate, and the ReLU activation function is required after $W_{0}$ .

In spatial attention mechanism, the maximum pooling and average pooling of features are concatenated and then superimposed, so the two-dimensional convolution is used to generate a feature map, and the input feature is point-multiplied channel by channel to generate an output feature map. The feature generation of the spatial attention mechanism is given in the following equation:

\begin{matrix} M_{s} (F) & = σ (f^{7 \times 7} ([AvgPool (F) ; MaxPool (F)])) \\ = σ (f^{7 \times 7} ([F_{avg}^{s} ; F_{\max}^{s}])) . \end{matrix}

In the above formula, $σ$ is the sigmoid operation. $f^{7 \times 7}$ is a convolution with a $7 \times 7$ kernel size.

In the down-sampling stage, to adapt to the large difference in the size and shape of organs and tissues in cardiac MRI, this paper uses a variety of different receptive field convolution kernels for adaptive feature fusion, and introduces the kernel size adaptive fusion attention mechanism as shown in Fig. 4.

Fig. 4 — Core size feature adaptive fusion attention mechanism

The non-linear method is used to aggregate information from multiple convolution kernels by allowing each neuron to adaptively adjust the size of its receptive field according to the scale of the input information. The basic idea behind it is to use a gate mechanism to control the flow of information into different branches of the subsequent convolutional layer. The gate needs to fuse the information of the two branches, and first the pixel-level addition and fusion is performed as follows,

\begin{matrix} F = F_{1} + F_{2}, \end{matrix}

where $F_{1} \in R^{H \times W \times C}$ and $F_{2} \in R^{H \times W \times C}$ are the feature images extracted by two different convolution kernels.

Then global average pooling is used to encode global information, and further generate channel-wise statistical information $s \in R^{C}$ , the c-th element in s is obtained by compression calculation in the dimension of $F_{c}$ as follows,

\begin{matrix} s_{c} = AvgPool (F_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c} (i, j), \end{matrix}

where H and W denote the height and width of feature image. Then a channel feature $z \in R^{d \times 1}$ is generated through a fully connected layer for precise and adjusted selection, and dimensionality reduction is performed at this time, i.e.,

\begin{matrix} z = T_{fc} (C) = R (B (W_{C})) . \end{matrix}

In the above formula, R is the ReLU activation function, and B is the batch standardization, $W \in R^{d \times c}$ . To verify the role of d in W, an attenuation rate r in C channels is introduced as follows,

\begin{matrix} d = max (C / β, L), \end{matrix}

where L represents the minimum value of d, and L is set to 32 in our experiments.

The soft attention mechanism between channels can choose information of different sizes, which is guided by the compact feature information z, and a softmax operation is performed in channel-wise, i.e.,

\begin{matrix} a_{c} = \frac{e^{A_{c} z}}{e^{A_{c} z} + e^{B_{c} z}}, b_{c} = \frac{e^{B_{c} z}}{e^{A_{c} z} + e^{B_{c} z}} . \end{matrix}

In the above formula, $A, B \in R^{C \times d}$ . a and b represent the soft attention vectors of $F_{1}$ and $F_{2}$ , respectively. $A_{c} \in R^{1 \times d}$ is the c-th row of A, $a_{c}$ is the c-th element of a, $B_{c} \in R^{1 \times d}$ is the c-th row of B, and $b_{c}$ is the c-th element of b. In the two branches of the module, the B branch is redundant, because $a_{c} + b_{c} = 1$ . The channel feature obtained according to the attention weight of different cores is as follows,

\begin{matrix} V_{c} = a_{c} \cdot F_{1 c} + (1 - a_{c}) \cdot F_{2 c} ; c = 1, 2, \dots, C . \end{matrix}

In the above formula, $F_{1 c}$ and $F_{2 c}$ are the features in the first and second channel respectively. Finally, the output feature is formulated as $V = [V_{1}, V_{2}, \dots, V_{C}], V_{C} \in R^{H \times W}$ .

Fine Segmentation Stage

Due to the high similarity of some tissues and organs in cardiac MRI, and the low discrimination of some organs from the background, it is easy to cause mis-classification or adhesion when segmenting the edge of cardiac tissues. On the other hand, during the up-sampling and down-sampling process, the edge information of the target in the cardiac MRI will inevitably be lost, resulting in insufficient edge segmentation results, so it is necessary to perform a fine treatment for improving the segmentation accuracy and adherence. The overall block diagram of the proposed fine segmentation stage is shown in Fig. 5, where a distance metric learning approach is introduced to distinguish the pixels on the boundary. The details at this stage are described as follows.

Fig. 5 — Block diagram of the fine segmentation stage

Due to varying tissue and partial volume effect, the mis-classified pixels usually located near the adjacent regions among different tissues. The distance metric learning strategy can be incorporated to enlarge the difference of feature representation among these samples. ArcFace [37] is a representative measurement loss function which has been proven to have good results in face detection and recognition tasks. It adds an angle margin to the target angle to enhance the tightness within the cluster and the difference between clusters. In the metric classification task, it can effectively improve the discriminativeness of the embedded vector features. Furthermore, it does not require joint supervision with other loss functions, making it easy to converge to any training data set. The original intention of the design of the metric loss function does not depend on the image category. In this research, the target edge fine segmentation is a two-class classification task in the target edge ring zone to determine whether the central pixel of the image patch is the target area, and it does not need to be distinguished in a large category (for example, whether the myocardium and the ventricle are of the same type), to a certain extent, the difficulty of classification is reduced. In this paper, ResNet50 [38] is used as the backbone for CNN feature extraction, and the loss function is defined as follows,

\begin{matrix} L_{lmc} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{e^{s (c o s (θ_{y_{i}} + m))}}{e^{s (c o s (θ_{y_{i}} + m))} + \sum_{j = n, j \neq y_{i}}^{n} e^{s cos θ_{j}}} \end{matrix}

s . t W_{j} = \frac{W_{j}}{∥W_{j}∥} ; x_{i} = \frac{x_{i}}{∥x_{i}∥} ; cos θ_{j} = W_{j}^{T} x_{i} .

In the above formula, N is the number of training samples, $x_{i}$ is the i-th feature vector of the corresponding class $y_{i}$ , $W_{j}$ is the weight vector of the j-th class, $θ_{j}$ is the angle between $W_{j}$ and $x_{i}$ . m is an additional angle to separate $θ_{i}$ and $θ_{j}$ , and s is a scaling factor to increase the radius of the sphere. m should be chosen as a small value since it only aims to put a hard margin between two samples. s intends to increase the difference of the cosine transform. In our experiments, several sample pairs close to the target boundaries were chosen and their cosine values to be calculated and compared, then a possible range for s was settled. Then according to the result of a grid searching approach where $s \in {16, 32, 64, 128}$ and $m \in {0.5, 1, 1.5, 2}$ for maximum average pixel accuracy, s is empirically set to 64 and m is set to 1.

In the training stage, the corrosion and expansion algorithms were used to construct positive and negative samples at the edges of each segmented region, and $n \times n$ (n determined by subsequent experiments in the “Parameter Settings” section) structure operators with all 1 were selected. At the edge of the mask, an annular zone area D is formed by inward corrosion and outward expansion. The label area is $D_{L}$ , the positive sample distribution area $D_{p} = D \cap D_{L}$ , and the negative sample distribution area $D_{n} = D - D_{p}$ . In both positive and negative sample areas, the pixels are centered and the image patches were extracted with size of $p \times p$ (p determined by subsequent experiments in the “Parameter Settings” section) to represent the central pixel feature, and send it to the network for training.

In the test stage, an annulus was constructed at the edge of each type of segmentation target area through the corrosion and expansion algorithm. The image patches were extracted at the innermost edge of the ring zone as the region map library, also the image patches were extracted in the ring zone area and sent to the network for learning the embedded vector, and the cosine similarity was calculated between the embedded vectors extracted in the ring zone and those from the regional atlas one by one. After an appropriate threshold $τ$ (the $τ$ value was determined by subsequent experiments) was settled, the cosine similarity values between the embedded vector of the image patch and those of most map samples in the atlas library were computed and compared with $τ$ , if a larger similarity value was obtained, the central pixel would be judged to be the same as the target area, otherwise a non-target area. In two overlapping ring zones, the same pixel image patch may be judged to be two different categories. The strategy adopted in this paper was to classify the center pixel into a category that had a larger number of matches with the template library samples. To avoid mis-classification of some pixels, in the ring zone, if the surrounding pixels are all the same type, the center point is set as the same type for removing a small amount of noise.

Experimental Results and Discussion

Dataset and Experimental Platform

Considering cardiac MRI images are mostly acquired through professional medical equipment, it is often regarded as a small sample data set. The experiments were conducted on a small data set with 42 cardiac MR volumes acquired in a real clinical environment. The MRI acquisition equipment includes Siemens (Avanto 1.5T, Espree 1.5T, Symphony 1.5T), Philips (Achieva 1.5T, 3.0T, Intera 1.5 T) and GE (Signa 1.5T), due to different acquisition equipment, the image sequence is quite different, the image size is 160 $\times$ 288 $\sim$ 512 $\times$ 512, each example is composed of 256 $\sim$ 512 slice images along the z axis. There are seven types of whole heart segmentation labels that were manually annotated by medical experts, namely left ventricular myocardium, left ventricle, right ventricle, left atrium, right atrium, aorta, and pulmonary artery. Due to the imperfection of magnetic resonance equipment and the specificity of the object itself, MRI will inevitably have a certain degree of uneven brightness. In our experiments, the bias in the images had been corrected with standard N4 algorithm in the preprocessing stage.

This experiment randomly selected 34 subjects to obtain 4000 cardiac MR images along z axial slices, and the ratio of training set, validation set, and test set was randomly allocated to the experiment at 7:1:2. To further verify the generalization ability of the model, the rest 8 whole images in the data set were tested, and reconstructed the segmentation results into a three-dimensional whole heart for evaluation.

The experimental platform was 2.0 GHz Intel CPU, 48GB RAM, NVIDIA RTX 2080Ti, Linux 64-bit PC, and the programming environment includes Anaconda 5.0.1 (Python 3.6), TensorFlow 1.4, Pytorch 0.4.1, and mxnet 1.5.

Evaluation Metrics

The experiments were evaluated on two-dimensional heart MR slice images and three-dimensional whole heart respectively. In the evaluation of two-dimensional MR segmentation results, the commonly used evaluation indicators in semantic segmentation are used in the evaluation of pixel accuracy (PA), average pixel accuracy (MPA), average intersection and ratio (MIoU) and frequency-weighted intersection and ratio (FwIoU) are evaluated. Assuming that there are $k + 1$ categories (including k categories of targets and background), the definitions are given as follows. $n_{ij}$ represents the total number of pixels in which category i is predicted to be j, $n_{ii}$ represents the total number of pixels in which category i is predicted to be category i, and $n_{ji}$ represents the total number of pixels in which category j is predicted to be category i. The calculation formulas for the above indicators are shown as follows.

\begin{matrix} P A = \frac{\sum_{i = 0}^{k} n_{ii}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} n_{ij}} \end{matrix}

\begin{matrix} M P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{n_{ii}}{\sum_{j = 0}^{k} n_{ij}} \end{matrix}

\begin{matrix} M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{n_{ii}}{\sum_{j = 0}^{k} n_{ij} + \sum_{j = 0}^{k} n_{ji} - n_{ii}} \end{matrix}

\begin{matrix} F w I o U = \frac{1}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} n_{ij}} \sum_{i = 0}^{k} \frac{n_{ii} \sum_{j = 0}^{k} n_{ij}}{\sum_{j = 0}^{k} n_{ij} + \sum_{j = 0}^{k} n_{ji} - n_{ii}} \end{matrix}

In the evaluation of three-dimensional whole heart segmentation results, the commonly used indicators in medical image segmentation, namely Dice Index and Hausdordff $_{95}$ , are introduced in our experiments. Suppose $V_{seg}$ is defined as the three-dimensional segmentation result and $V_{gt}$ is the corresponding label, the Dice Index calculation formula is defined as

\begin{matrix} D i c e I n d e x = 2 \frac{V_{seg} \cap V_{gt}}{V_{seg} + V_{gt}} . \end{matrix}

On the other hand, the calculation formula of Hausdorff distance is defined as

\begin{matrix} D_{H} (V_{seg}, V_{gt}) & = max {d_{V_{seg} V_{gt}}, d_{V_{gt} V_{seg}}} \\ = max {max_{x \in V_{seg}} \underset{y \in V_{gt}}{min d} (x, y), \\ max_{y \in V_{gt}} \underset{x \in V_{seg}}{min d} (x, y)} . \end{matrix}

In the above formula, d represents the distance, and x and y correspond to the voxel in the segmentation result and groundtruth mesh respectively.

It should be noticed that the Dice Index is more sensitive to the filling inside the segmentation result because it adopts all the voxels, while Hausdorff distance is more sensitive to the boundary of the segmentation result because it only incorporates the edge points. Therefore, to eliminate the influence of a small number of separated group subsets, this paper evaluates the 3D segmentation results by multiplying 0.95 to Hausdorff distance, i.e., Hausdordff $_{95}$ .

Parameter Settings

In the first stage, in the data enhancement module, the number of directions and sizes of the Log-Gabor filter bank were set to 8 and 4, respectively. In the core size attention mechanism in the network part, a 3 $\times$ 3 convolution kernel and a convolution with an expansion size of 2 (approximately 5 $\times$ 5 convolution kernel) were selected, The stochastic gradient descent method was used in the network training and the learning rate was set to 0.01. The batch size was set to 4; the number of training was 200 epochs; the loss function was Lovasz-Softmax loss [39]; also the deep supervision mechanism was used in the network training process.

In the second stage, the learning rate was set to 0.1, the number of training was set to 200 epochs, the batch size of a single GPU was set to 96, 3 GPUs were used for training, the total batch size was 288, the embedded vector dimension was 512, and ResNet50 was chosen as a backbone. The training loss curve is shown in Fig. 6.

Fig. 6 — Training and validation loss curves of the compared methods

The subfigure (a) above is the loss verification curves of the first stage of the proposed method and U-Net++, and subfigure (b) is the loss curve of the second stage. From (a), we can see that the convergence value of the loss curve and the verification curve of the proposed method are both less than U-Net++, at the same time, the loss curve is closer to the verification curve, indicating that the proposed method has a smaller fitting error to the data, and it can learn the discriminative characteristics of the data set, also the network presents a stronger generalization ability. It can be seen from subfigure (b) that as the number of training increases, the network gradually converges, indicating that the network gradually strengthens its ability to distinguish different types of data, making the same type of image patches closer in space, and different image patches in space are far away.

In the second stage of the fine segmentation process, the image patch size has an important impact on the accuracy of the measurement classification. The grid search strategy was employed to obtain the optimal parameters in the experiment process, specifically, the image patch size was settled as 11 $\times$ 11, 13 $\times$ 13, ... 21 $\times$ 21 respectively and then fed to the following modules to obtain the segmentation results. In this process, 10,000 image pairs were randomly generated in the validation data set, including 5000 pairs with same labels and 5,000 pairs with different labels, and the relationship between the measurement classification accuracy (the proportion of the correct logarithm of the classification in all image pairs) and the image patch size was explored. The result is shown in Fig. 7, and it can be seen from the figure that as the size of the image patch increases, the accuracy of the classification increases. When the image patch size reaches to 17 $\times$ 17, a higher accuracy can be achieved, and then it slightly decreases.

Fig. 7 — Segmentation accuracy varies with the size of the image patch

In the metric classification, CNN was used to extract the embedded vector of the image patch, and the cosine similarity was taken as the metric. For this type of metric, the appropriate metric threshold should be selected to determine whether the central pixel value of the current image patch is the target area, so the choice of the threshold is particularly important. In order to find a more suitable threshold, a 10-fold cross-validation strategy was adopted to randomly generate 10,000 pairs of images on the validation set for experiments. In the threshold range [0,1] interval, the step size is 0.1, and the accuracy of the metric classification is used for evaluation. The result is shown in Fig. 8, from where it can be seen that as the threshold increases, the classification accuracy first increases and then decreases. When the threshold value is around 0.7, the accuracy reaches the maximum value. Therefore, the discrimination threshold in our experiments was set as 0.68, which means when the cosine similarity of the image pair is greater than 0.68, the image pairs are judged to be the same category; otherwise, it is judged as different categories.

Fig. 8 — Measurement accuracy varies with the threshold

In the second stage of the ring zone generation part, this paper expands and erodes the first-stage segmentation area through the p $\times$ p all 1 structure operator to generate the segmentation target edge ring zone. The width of the ring zone changes with the p value, the larger the p value, the wider the generated ring zone. The width of the ring zone also has a significant impact on the segmentation accuracy. If the ring zone is set too narrow, it may not contain the true edge area of the target, the edge of the target cannot be searched finely in the ring zone area. If the ring zone is set too wide, it will cause too many measurement image pairs and high time cost. At the same time, if the ring zone is set too wide, the image patches of the pixels inside the ring zone will be less similar to those of the target edge pixels, which will increase the difficulty of measurement to a certain extent. In this paper, in order to obtain more ideal parameters, the structure operator scale p = 3,5,,9 was set to conduct experiments in the validation set, and the MPA of the segmentation result was used as a standard to evaluate the influence of the parameters on the experimental results. The results are shown in Fig. 9. As can be seen from the figure, when $p = 5$ , the segmentation accuracy is the highest one.

Fig. 9 — MPA segmentation result vs. size of the structure operator

Experimental Results and Analysis

To comprehensively evaluate the performance of the two-stage algorithm in this paper and verify the rationality of the settings of each module, this paper designs an ablation experiment, by reducing each module or expanding the proposed method to 3 variant methods. The variant method 1 is using image enhancement module and two attention mechanisms, while it does not use metric classification for fine segmentation; The variant method 2 uses image enhancement module and kernel size adaptive mechanism, while it does not use metric classification for fine segmentation; The variant method 3 is use image enhancement module and space as well as channel attention mechanism, while it does not use metric classification for fine segmentation.

In our experiments, in each image, seven types corresponding to seven cardiac tissues were manually annotated by experts and then was taken as ground truth evaluation results, and experiments are carried out on all test data sets. The commonly indicators PA, MPA, MIoU, and FWIoU in semantic segmentation were used to measure the performance of the proposed, variants, and compared algorithms. To investigate the characteristic of the proposed method, some related representative methods such as DeepLabv3 [15], SegNet [5], U-Net [7], U-Net++ [8] and other algorithms in semantic segmentation [10] were taken as the compared methods.

Table 1 reports the comparison results, from where it can be seen that the proposed method is compared with a series of classic segmentation methods, and obtained better results in different evaluation metrics. From the comparison of the results of Variant1 and the proposed method, it can be found that the effectiveness of the second-stage edge fine optimization of this paper can be derived from the comparison of MPA and MIoU indicators. The proposed method effectively reduces the mis-classification and adhesion of the target edge, and obtains the finer segmentation.

Table 1.

Segmentation accuracy of compared methods

Method	PA	MPA	MIoU	FWIoU
Proposed	96.53	96.63	95.12	98.61
Variant1	96.17	95.47	93.28	97.93
Variant2	95.76	92.84	89.42	97.12
Variant3	95.62	91.94	91.66	96.68
U-Net	95.41	92.45	89.64	96.06
U-Net++	95.92	93.92	91.3	97
SegNet	94.42	84.47	68.82	92.83
DeepLabv3	94.94	84.37	71.18	94.69
PspNet	93.54	76.92	56.74	91.64

Open in a new tab

Bold values mean the best results

To more intuitively compare the segmentation results of the proposed method and each variant method, some examples of the results of the proposed method and related variant methods are shown in Fig. 10. It can be seen from the figure that the edge of the target segmentation method in this paper is more refined than that of the three variant methods. Due to the low differentiation and small difference of some organs and tissues in cardiac MR, Variant2 and Variant3 only contain one attention mechanism, so it is difficult to pay reasonable attention to and select features, which will lead to a certain degree of false segmentation or missing segmentation in segmentation results; Variant1 contains two attention mechanisms, and the segmentation results are more accurate. However, due to the complexity of medical MRI itself, the target edge segmentation is not fine enough, and it is also prone to adhesion problems between the edges of different organs. The second stage of fine edge segmentation is effective to solve the problem of insufficient edge segmentation and heavy adhesion.

Fig. 10 — Examples of segmentation results by the proposed method and its variants

To further analyze the comparison results of the proposed method and the classic segmentation algorithms, some examples of the proposed method and each classic segmentation algorithm are shown in Fig. 11. It can be seen from the figure that the two-stage segmentation algorithm proposed in this paper is more refined in the edge segmentation of the target, which alleviates the problem of easy adhesion of some organ edges in the segmentation results, and obtains competitive segmentation results. While in the results of SegNet and PspNet, the edges of different organs and tissues are more adherent, the accuracies of segmentation are poor, and the segmentation areas are more noisy.

Fig. 11 — Examples of segmentation results by compared methods

To further evaluate the generalization ability of the model, an experiment was conducted on slices of 8 subjects, and reconstructed the segmentation results into a three-dimensional whole heart. The Dice Index and Hausdorff Distance indexes are used for evaluation, and compared with the classic segmentation methods. The comparison results are reported in Tables 2 and 3.

Table 2.

Segmentation accuracy (Dice Index) of seven organs by compared methods

	Dice Index(%)
	Proposed	U-Net++	U-Net	SegNet	DeepLabv3	PspNet
Myo	92.841±0.023	83.327±0.071	85.348±0.122	62.726±0.112	77.059±0.272	52.724±0.080
LA	94.027±0.004	87.550±0.008	87.786±0.129	62.793±0.160	77.993±0.005	55.603±0.080
LV	96.663±0.013	93.995±0.021	95.102±0.017	83.173±0.156	90.376±0.174	79.102±0.185
RA	94.423±0.014	88.303±0.053	90.096±0.041	71.929±0.007	84.803±0.020	58.110±0.445
RV	96.351±0.017	90.860±0.174	91.055±0.069	80.867±0.231	86.767±0.457	73.230±0.964
AO	92.768±0.029	80.748±0.128	85.493±0.136	62.428±0.251	73.714±0.014	48.702±0.730
PA	90.227±0.045	78.497±0.119	82.186±0.065	66.124±0.220	67.351±0.316	42.559±0.645

Open in a new tab

Myo myocardium of LV, LA left atrium, LV left ventricle, RA right atrium, RV right ventricle, AO ascending aorta, PA pulmonary artery

Bold values mean the best results

Table 3.

Hausdorff $_{95}$ indices of three-dimensional whole heart segmentation by compared methods

	Huasdorff $_{95}$ (mm)
	Proposed	U-Net++	U-Net	SegNet	DeepLabv3	PspNet
Myo	1.00±0.01	2.46±0.21	1.85±0.56	10.95±57.38	3.25±2.69	18.59±97.73
LA	0.75±0.19	28.83±68.34	29.57±48.83	51.21±73.31	2.96±0.24	50.78±46.49
LV	0.25±0.19	0.75±0.19	0.25±0.19	5.51±24.40	3.75±22.69	5.20±19.44
RA	0.50±0.25	2.00±0.50	1.75±0.69	11.94±13.62	2.31±0.17	10.98±7.85
RV	0.25±0.19	1.50±0.75	1.10±0.03	3.80±9.77	2.23±4.53	6.02±35.99
AO	3.60±20.29	7.57±77.46	5.20±9.45	28.93±199.09	4.83±0.66	29.45±84.17
PA	2.00±0.50	14.37±69.63	7.22±24.82	7.86±5.98	9.44±43.37	9.84±3.99

Open in a new tab

Myo myocardium of LV, LA left atrium, LV left ventricle, RA right atrium, RV right ventricle, AO ascending aorta, PA pulmonary artery

Bold values mean the best results

It can be seen from the above tables that the Dice Index of the three-dimensional whole heart segmentation result of the proposed method is better than other segmentation methods. The average value of the index on the 8 cases of images is smaller, indicating that the internal filling of the segmentation result of this method is more accurate and can better divide the cardiac organs. It is segmented as a whole, and there is less noise inside the organ, and the degree of overlap with the label is higher, and the variance of the index is smaller, indicating that the proposed method has better stability and stronger generalization ability.

As can be seen from Table 3, the comparison between the proposed method and other segmentation methods has achieved competitive results on the huasdorff $_95$ index, which indicates that the edge-fitting degree of the segmentation results is high, while the noise of external organ segmentation is less.

To intuitively compare the 3D whole heart segmentation effect of the proposed method and compared methods, a subject is randomly selected from eight subjects as example to generate 3D visualization heat map by using Hausdorff distance as the standard, and the result is shown in Fig. 12. The color band in the figure corresponds to the segmentation result and the Hausdorff distance to the corresponding labels. When the distance is small, the color is toward red. As the distance becomes larger, the color gradually changes to dark blue. In the comparison of various methods, the segmentation results of our method fit better with the label, the degree of accordance is higher, and the outer blue false segmentation noise is less. Considering that the experiment in this paper is carried out on a relatively small sample data set and the cardiac MRI has the characteristics of complexity and low discrimination of some tissues and organs. The SegNet and PspNet methods produce more noise, especially in the symmetrically adjacent regions, such as mis-classification noise between LV and LA. Comparatively speaking, DeepLabv3 can segment various organs and tissues better, but the accuracy is lower than the proposed method and the U-Net series.

Fig. 12 — Hausdorff distance of segmentation results by compared methods

Discussion

From the comparison and analysis of the above experimental results, it can be seen that the proposed method has achieved better segmentation performance in two-dimensional whole heart segmentation and three-dimensional whole heart reconstruction. In the two-dimensional whole heart segmentation, the proposed method has a significant performance improvement compared with second best method Unet++, and the PA, MPA, MIoU and FwIoU indicators are increased by 0.61%, 2.71%, 3.82% and 1.61%, respectively.

The characteristic of the proposed method can be discovered by the collateral evidences from the variant methods which are slightly lower than the proposed method but comparative to the related algorithms. Specifically, the evaluation metric of Variant1 is higher than Unet++. The PA, MPA, MIOU and FWIOU indicators are increased by 0.25%, 1.55%, 1.98% and 0.93%, respectively. In this variant, the image enhancement attention mechanism is used to enrich the detailed texture information and edge contour information of the original image. Two attention mechanisms are also introduced into the network to adaptively extract the target feature size and learn the discriminatory features in the network for segmentation. They collaboratively produce a certain improvement in performance. However, Variant2 and Variant3 only introduce one attention mechanism, and the segmentation performance is lower than that of Variant1. Variant1 does not include the second-stage fine segmentation, so the segmented edges may not be fine enough, and some of them are easy to stick, resulting in lower performance indicators than the proposed method. The problems such as adhesion can be reduced by the fine metric classification in the proposed method and the segmentation accuracy is further improved. Therefore the evidences in the three variant methods have justified the rationality of each module setting of the proposed method.

In terms of three-dimensional whole heart segmentation, the Dice Index and Huasdorff $_{95}$ distance of the three-dimensional whole heart key tissue constructed by the proposed method are significantly better than UNet++ and related segmentation methods. The Dice Index of the key tissues of the heart constructed by the proposed method is 96.66% for the left ventricle, and the lowest is 90.23% for the pulmonary artery. The smallest distance between the key cardiac tissues and the label in Huasdorff $_{95}$ distance is 0.25mm for both the left and right ventricles, and the largest distance is 3.6mm for the aorta, indicating the outer contours and labels of the key cardiac tissues constructed by the proposed method with closer and higher similarity, which reflects the effectiveness of the second fine optimization stage. The difference between the Dice Index and Huasdorff $_{95}$ distance in the constructed three-dimensional heart tissue is smaller, indicating that the proposed method has stronger generalization ability and better adaptability.

According to the overall results, it is seen that the proposed method and the UNet series present better performance than other segmentation methods that are commonly used in natural images. For example, the better results of SegNet, DeepLabv3 and Pspnet show less 1.59%, 12.16%, 17.94% and 3.92% in the PA, MPA, MIoU and FwIoU indicators, respectively. SegNet and PspNet are prone to mis-segmentation in relatively symmetrical similar regions, indicating that the networks in these models are difficult to adapt to the complex small sample cardiac MRI data set.

It should be noticed that the seven organs are intrinsically three-dimensional structures while our model was trained on two-dimensional images for computational efficiency. The strategy of transforming three-dimensional volume to two-dimensional images is common in medical image analysis, however, it brings a fact that our model was trained on a rather small sample data set. Specifically, although there are 2800 images in the training dataset, considering each subject includes many slices, ranging from 105 to 168 in our dataset. For each segment of various cardiac tissues and organs in the two-dimensional images, there are less training images, ranging from 17 to 27. So, the images corresponding to these segments are very limited and insufficient for whole-imaged based deep learning. In this context, our model jointly incorporates image enhancement module, distance metric learning, and CNN to improve the representation of images and further achieves competitive segmentation results in comparison to related pure CNN-based algorithms, which indicates it is a desirable approach to improve the model’s capability by combining traditional image enhancement, machine learning and deep learning for similar tasks.

There are coarse and fine segmentation modules in the proposed method and they took different computational costs during both training and testing stages. In the coarse segmentation module, the network structure like UNet++ tooks much more time than the image enhancement submodule that adopts specfic Log-Gabor filterbank. The training time for this module was about 45 min, depending on the number of training images and the GPU acceleration. In fact, we could reduce the training time to nearly 20 min by further applying the pre-trained parameters. In the coarse segmentation module, the CNN submodule with ResNet50 backbone took nearly 5 min since the samples are rather less in the boundary region. It also can speed up the training efficiency by introducing the pre-training parameters. As a result, the whole training time of our model can be reduced to 30 min in our experiments. Nevertherless, it remains large space to optimize the network structure of our model by exploring pruning or scalable skip connection strategy, since intrinsically there too many skip connections in our network.

Similar to most deep learning-based segmentation models, the inferring speed of our method in the testing data set is very fast. For generating the segmentation results of 100 images, it took almost 200ms in coarse segmentation and 65ms in fine segmentation. The key reason is mainly because that the cardiac tissues and organs locate a very small part in the whole image, ranging from 0.03% to 5.8% in terms of number of pixels.

The proposed method can obtain more accurate segmentation results in most cardiac MR images, but in the slice position of part of the organ boundary, due to the characteristics of faint shape features, small area, low discrimination from background, it is inevitable to produce partial false segmentation. The examples of partial false segmentation in this paper are shown in Fig. 13. It can be seen from the figure that the proposed method caused mis-segmentation or miss-segmentation when the shape features of some tissues and organs are not obvious, and the target and background are low.

Fig. 13 — Examples of segmentation results with low quality by the proposed method

Conclusion and Future Works

A two-stage CNN segmentation model by integrating Log-Gabor filter attention mechanism and metric classification has been proposed in this paper. To enhance the discrimination of various tissues and organs, the channel attention mechanism is used to merge the detailed texture and edge information of the image with Log-Gabor filterbank, and the original image is selectively enhanced. To make the network adapt to targets with varying size, the core size adaptive mechanism and the space and channel attention mechanism are introduced into the encoding-decoding CNN network; To make the segmentation edge more refined and solve the problem of adhesion among tissues and organs, the metric classification network is introduced to refine the target edge optimization. The experimental results on a cardiac MRI data set with 7 types of labels have verified the rationality and effectiveness of the method. It has achieved better performance with comparison to related models according to the results of two-dimensional cardiac MRI segmentation and three-dimensional whole heart reconstruction.

Considering the proposed method is not designed toward pathological information in cardiac data set, the clinical analysis is not considered in our work. For example, the end-diastolic volume and end-systolic volume for the left ventricle based on the segmentation results were not measured. In the following research, we will collect more clinical images with pathological labeling and apply the segmentation model with the parameter adjustment or network modification for improving accuracy.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments, suggestions, and enlightenment.

Author Contribution

X. Wang: conceptualization, methodology, software, investigation, validation, writing — original draft, writing — review and editing, funding acquisition. F. Wang: software, investigation, validation, formal analysis, writing - original draft, visualization. Y. Niu: investigation, data curation, formal analysis, validation, writing — review and editing.

Funding

This work was partially supported by Chinese National Science Foundation (NSFC No. 61971076).

Declarations

Conflict of Interest

The authors declared that they have no conflicts of interest to this work.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Zhuang, X., Li, L., Payer, C., Stern, D., Urschler, M., Heinrich, M.P., Oster, J., Wang, C., Smedby, Ö., Bian, C., Yang, X., Heng, P., Mortazi, A., Bagci, U., Yang, G., Sun, C., Galisot, G., Ramel, J., Yang, G.: Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. Medical Image Anal. 58 (2019). 10.1016/j.media.2019.101537 [DOI] [PMC free article] [PubMed]
2.Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis 31, 77–87 (2016). 10.1016/j.media.2016.02.006 [DOI] [PubMed]
3.Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990). 10.1207/s15516709cog1402_1 [DOI]
4.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3431–3440. IEEE Computer Society (2015). 10.1109/CVPR.2015.7298965 [DOI] [PubMed]
5.Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). 10.1109/TPAMI.2016.2644615 [DOI] [PubMed]
6.Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: N. Navab, J. Hornegger, W.M.W. III, A.F. Frangi (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, Lecture Notes in Computer Science, vol. 9351, pp. 234–241. Springer (2015). 10.1007/978-3-319-24574-4_28 [DOI]
7.Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M.C.H., Heinrich, M.P., Misawa, K., Mori, K., McDonagh, S.G., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attention u-net: Learning where to look for the pancreas. CoRR abs/1804.03999 (2018). http://arxiv.org/abs/1804.03999
8.Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: D. Stoyanov, Z. Taylor, G. Carneiro, T.F. Syeda-Mahmood, A.L. Martel, L. Maier-Hein, J.M.R.S. Tavares, A.P. Bradley, J.P. Papa, V. Belagiannis, J.C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, A. Madabhushi (eds.) Deep Learning in Medical Image Analysis - and - for Clinical Decision Support - 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings, Lecture Notes in Computer Science, vol. 11045, pp. 3–11. Springer (2018). 10.1007/978-3-030-00889-5_1 [DOI] [PMC free article] [PubMed]
9.Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y., Wu, J.: Unet 3+: A full-scale connected unet for medical image segmentation. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp. 1055–1059. IEEE (2020). 10.1109/ICASSP40776.2020.9053405 [DOI]
10.Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6230–6239. IEEE Computer Society (2017). 10.1109/CVPR.2017.660 [DOI]
11.Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. CoRR abs/1506.04579 (2015). http://arxiv.org/abs/1506.04579
12.Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5168–5177. IEEE Computer Society (2017). 10.1109/CVPR.2017.549 [DOI]
13.Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: Y. Bengio, Y. LeCun (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.7062
14.Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915 (2016). http://arxiv.org/abs/1606.00915 [DOI] [PubMed]
15.Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017). http://arxiv.org/abs/1706.05587
16.Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 3640–3649. IEEE Computer Society (2016). 10.1109/CVPR.2016.396 [DOI]
17.Zhang, H., Dana, K.J., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7151–7160. IEEE Computer Society (2018). 10.1109/CVPR.2018.00747
18.Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., Jia, J.: Psanet: Point-wise spatial attention network for scene parsing. In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, Lecture Notes in Computer Science, vol. 11213, pp. 270–286. Springer (2018). 10.1007/978-3-030-01240-3_17 [DOI]
19.Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1529–1537. IEEE Computer Society (2015). 10.1109/ICCV.2015.179 [DOI]
20.Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1377–1385. IEEE Computer Society (2015). 10.1109/ICCV.2015.162 [DOI]
21.Shuai, B., Zuo, Z., Wang, B., Wang, G.: Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1480–1493 (2018). 10.1109/TPAMI.2017.2712691 [DOI] [PubMed]
22.Fan, H., Chu, P., Latecki, L.J., Ling, H.: Scene parsing via dense recurrent neural networks with attentional selection. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1816–1825. IEEE (2019). 10.1109/WACV.2019.00198 [DOI]
23.Hung, W., Tsai, Y., Liou, Y., Lin, Y., Yang, M.: Adversarial learning for semi-supervised semantic segmentation. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, p. 65. BMVA Press (2018). http://bmvc2018.org/contents/papers/0200.pdf
24.Billet, F., Sermesant, M., Delingette, H., Ayache, N.: Cardiac motion recovery and boundary conditions estimation by coupling an electromechanical model and cine-mri data. In: N. Ayache, H. Delingette, M. Sermesant (eds.) Functional Imaging and Modeling of the Heart, 5th International Conference, FIMH 2009, Nice, France, June 3-5, 2009. Proceedings, Lecture Notes in Computer Science, vol. 5528, pp. 376–385. Springer (2009). 10.1007/978-3-642-01932-6_41 [DOI]
25.Wang X, Chen H, Gan C, Lin H, Heng PA. Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE Transactions on Cybernetics. 2019;50(9):3950–3962. doi: 10.1109/TCYB.2019.2935141. [DOI] [PubMed] [Google Scholar]
26.Li, X., Chen, H., Qi, X., Dou, Q., Fu, C., Heng, P.: H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes. IEEE Trans. Medical Imaging 37(12), 2663–2674 (2018). 10.1109/TMI.2018.2845918 [DOI] [PubMed]
27.Duan, J., Bello, G., Schlemper, J., Bai, W., Dawes, T.J.W., Biffi, C., de Marvao, A., Doumou, G., O’Regan, D.P., Rueckert, D.: Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi- task deep learning approach. IEEE Trans. Medical Imaging 38(9), 2151–2164 (2019). 10.1109/TMI.2019.2894322 [DOI] [PMC free article] [PubMed]
28.Wu, F., Zhuang, X.: CF distance: A new domain discrepancy metric and application to explicit domain adaptation for cross-modality cardiac image segmentation. IEEE Trans. Medical Imaging 39(12), 4274–4285 (2020). 10.1109/TMI.2020.3016144 [DOI] [PubMed]
29.Zhu, X., Cheng, Z., Wang, S., Chen, X., Lu, G.: Coronary angiography image segmentation based on pspnet. Comput. Methods Programs Biomed. 200, 105897 (2021). 10.1016/j.cmpb.2020.105897 [DOI] [PubMed]
30.Tran, P.V.: A fully convolutional neural network for cardiac segmentation in short-axis MRI. CoRR abs/1604.00494 (2016). http://arxiv.org/abs/1604.00494
31.Avendi, M.R., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac mri. Medical Image Analysis pp. 108–119 (2016) [DOI] [PubMed]
32.Ngo, T.A., Lu, Z., Carneiro, G.: Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Medical Image Anal. 35, 159–171 (2017). 10.1016/j.media.2016.05.009 [DOI] [PubMed]
33.Payer, C., Stern, D., Bischof, H., Urschler, M.: Multi-label whole heart segmentation using cnns and anatomical label configurations. In: M. Pop, M. Sermesant, P. Jodoin, A. Lalande, X. Zhuang, G. Yang, A.A. Young, O. Bernard (eds.) Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges - 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada, September 10-14, 2017, Revised Selected Papers, Lecture Notes in Computer Science, vol. 10663, pp. 190–198. Springer (2017). 10.1007/978-3-319-75541-0_20 [DOI]
34.Xia, Y., Yang, D., Yu, Z., Liu, F., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A.L., Roth, H.: Uncertainty-aware multi-view co-training for semi-supervised medical image segmentation and domain adaptation. Medical Image Anal. 65, 101766 (2020). 10.1016/j.media.2020.101766 [DOI] [PubMed]
35.Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 510–519. Computer Vision Foundation / IEEE (2019). 10.1109/CVPR.2019.00060
36.Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: convolutional block attention module. In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, Lecture Notes in Computer Science, vol. 11211, pp. 3–19. Springer (2018). 10.1007/978-3-030-01234-2_1 [DOI]
37.Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4690–4699. Computer Vision Foundation / IEEE (2019). 10.1109/CVPR.2019.00482
38.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society (2016). 10.1109/CVPR.2016.90 [DOI]
39.Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4413–4421. IEEE Computer Society (2018). 10.1109/CVPR.2018.00464

[CR1] 1.Zhuang, X., Li, L., Payer, C., Stern, D., Urschler, M., Heinrich, M.P., Oster, J., Wang, C., Smedby, Ö., Bian, C., Yang, X., Heng, P., Mortazi, A., Bagci, U., Yang, G., Sun, C., Galisot, G., Ramel, J., Yang, G.: Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. Medical Image Anal. 58 (2019). 10.1016/j.media.2019.101537 [DOI] [PMC free article] [PubMed]

[CR2] 2.Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis 31, 77–87 (2016). 10.1016/j.media.2016.02.006 [DOI] [PubMed]

[CR3] 3.Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990). 10.1207/s15516709cog1402_1 [DOI]

[CR4] 4.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3431–3440. IEEE Computer Society (2015). 10.1109/CVPR.2015.7298965 [DOI] [PubMed]

[CR5] 5.Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). 10.1109/TPAMI.2016.2644615 [DOI] [PubMed]

[CR6] 6.Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: N. Navab, J. Hornegger, W.M.W. III, A.F. Frangi (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, Lecture Notes in Computer Science, vol. 9351, pp. 234–241. Springer (2015). 10.1007/978-3-319-24574-4_28 [DOI]

[CR7] 7.Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M.C.H., Heinrich, M.P., Misawa, K., Mori, K., McDonagh, S.G., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attention u-net: Learning where to look for the pancreas. CoRR abs/1804.03999 (2018). http://arxiv.org/abs/1804.03999

[CR8] 8.Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: D. Stoyanov, Z. Taylor, G. Carneiro, T.F. Syeda-Mahmood, A.L. Martel, L. Maier-Hein, J.M.R.S. Tavares, A.P. Bradley, J.P. Papa, V. Belagiannis, J.C. Nascimento, Z. Lu, S. Conjeti, M. Moradi, H. Greenspan, A. Madabhushi (eds.) Deep Learning in Medical Image Analysis - and - for Clinical Decision Support - 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings, Lecture Notes in Computer Science, vol. 11045, pp. 3–11. Springer (2018). 10.1007/978-3-030-00889-5_1 [DOI] [PMC free article] [PubMed]

[CR9] 9.Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y., Wu, J.: Unet 3+: A full-scale connected unet for medical image segmentation. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp. 1055–1059. IEEE (2020). 10.1109/ICASSP40776.2020.9053405 [DOI]

[CR10] 10.Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6230–6239. IEEE Computer Society (2017). 10.1109/CVPR.2017.660 [DOI]

[CR11] 11.Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. CoRR abs/1506.04579 (2015). http://arxiv.org/abs/1506.04579

[CR12] 12.Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5168–5177. IEEE Computer Society (2017). 10.1109/CVPR.2017.549 [DOI]

[CR13] 13.Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: Y. Bengio, Y. LeCun (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.7062

[CR14] 14.Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915 (2016). http://arxiv.org/abs/1606.00915 [DOI] [PubMed]

[CR15] 15.Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017). http://arxiv.org/abs/1706.05587

[CR16] 16.Chen, L., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 3640–3649. IEEE Computer Society (2016). 10.1109/CVPR.2016.396 [DOI]

[CR17] 17.Zhang, H., Dana, K.J., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7151–7160. IEEE Computer Society (2018). 10.1109/CVPR.2018.00747

[CR18] 18.Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., Jia, J.: Psanet: Point-wise spatial attention network for scene parsing. In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IX, Lecture Notes in Computer Science, vol. 11213, pp. 270–286. Springer (2018). 10.1007/978-3-030-01240-3_17 [DOI]

[CR19] 19.Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1529–1537. IEEE Computer Society (2015). 10.1109/ICCV.2015.179 [DOI]

[CR20] 20.Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1377–1385. IEEE Computer Society (2015). 10.1109/ICCV.2015.162 [DOI]

[CR21] 21.Shuai, B., Zuo, Z., Wang, B., Wang, G.: Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1480–1493 (2018). 10.1109/TPAMI.2017.2712691 [DOI] [PubMed]

[CR22] 22.Fan, H., Chu, P., Latecki, L.J., Ling, H.: Scene parsing via dense recurrent neural networks with attentional selection. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1816–1825. IEEE (2019). 10.1109/WACV.2019.00198 [DOI]

[CR23] 23.Hung, W., Tsai, Y., Liou, Y., Lin, Y., Yang, M.: Adversarial learning for semi-supervised semantic segmentation. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, p. 65. BMVA Press (2018). http://bmvc2018.org/contents/papers/0200.pdf

[CR24] 24.Billet, F., Sermesant, M., Delingette, H., Ayache, N.: Cardiac motion recovery and boundary conditions estimation by coupling an electromechanical model and cine-mri data. In: N. Ayache, H. Delingette, M. Sermesant (eds.) Functional Imaging and Modeling of the Heart, 5th International Conference, FIMH 2009, Nice, France, June 3-5, 2009. Proceedings, Lecture Notes in Computer Science, vol. 5528, pp. 376–385. Springer (2009). 10.1007/978-3-642-01932-6_41 [DOI]

[CR25] 25.Wang X, Chen H, Gan C, Lin H, Heng PA. Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE Transactions on Cybernetics. 2019;50(9):3950–3962. doi: 10.1109/TCYB.2019.2935141. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Li, X., Chen, H., Qi, X., Dou, Q., Fu, C., Heng, P.: H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes. IEEE Trans. Medical Imaging 37(12), 2663–2674 (2018). 10.1109/TMI.2018.2845918 [DOI] [PubMed]

[CR27] 27.Duan, J., Bello, G., Schlemper, J., Bai, W., Dawes, T.J.W., Biffi, C., de Marvao, A., Doumou, G., O’Regan, D.P., Rueckert, D.: Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi- task deep learning approach. IEEE Trans. Medical Imaging 38(9), 2151–2164 (2019). 10.1109/TMI.2019.2894322 [DOI] [PMC free article] [PubMed]

[CR28] 28.Wu, F., Zhuang, X.: CF distance: A new domain discrepancy metric and application to explicit domain adaptation for cross-modality cardiac image segmentation. IEEE Trans. Medical Imaging 39(12), 4274–4285 (2020). 10.1109/TMI.2020.3016144 [DOI] [PubMed]

[CR29] 29.Zhu, X., Cheng, Z., Wang, S., Chen, X., Lu, G.: Coronary angiography image segmentation based on pspnet. Comput. Methods Programs Biomed. 200, 105897 (2021). 10.1016/j.cmpb.2020.105897 [DOI] [PubMed]

[CR30] 30.Tran, P.V.: A fully convolutional neural network for cardiac segmentation in short-axis MRI. CoRR abs/1604.00494 (2016). http://arxiv.org/abs/1604.00494

[CR31] 31.Avendi, M.R., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac mri. Medical Image Analysis pp. 108–119 (2016) [DOI] [PubMed]

[CR32] 32.Ngo, T.A., Lu, Z., Carneiro, G.: Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Medical Image Anal. 35, 159–171 (2017). 10.1016/j.media.2016.05.009 [DOI] [PubMed]

[CR33] 33.Payer, C., Stern, D., Bischof, H., Urschler, M.: Multi-label whole heart segmentation using cnns and anatomical label configurations. In: M. Pop, M. Sermesant, P. Jodoin, A. Lalande, X. Zhuang, G. Yang, A.A. Young, O. Bernard (eds.) Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges - 8th International Workshop, STACOM 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada, September 10-14, 2017, Revised Selected Papers, Lecture Notes in Computer Science, vol. 10663, pp. 190–198. Springer (2017). 10.1007/978-3-319-75541-0_20 [DOI]

[CR34] 34.Xia, Y., Yang, D., Yu, Z., Liu, F., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A.L., Roth, H.: Uncertainty-aware multi-view co-training for semi-supervised medical image segmentation and domain adaptation. Medical Image Anal. 65, 101766 (2020). 10.1016/j.media.2020.101766 [DOI] [PubMed]

[CR35] 35.Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 510–519. Computer Vision Foundation / IEEE (2019). 10.1109/CVPR.2019.00060

[CR36] 36.Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: convolutional block attention module. In: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, Lecture Notes in Computer Science, vol. 11211, pp. 3–19. Springer (2018). 10.1007/978-3-030-01234-2_1 [DOI]

[CR37] 37.Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4690–4699. Computer Vision Foundation / IEEE (2019). 10.1109/CVPR.2019.00482

[CR38] 38.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society (2016). 10.1109/CVPR.2016.90 [DOI]

[CR39] 39.Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4413–4421. IEEE Computer Society (2018). 10.1109/CVPR.2018.00464

PERMALINK

Two-Stage CNN Whole Heart Segmentation Combining Image Enhanced Attention Mechanism and Metric Classification

Xuchu Wang

Fusheng Wang

Yanmin Niu

Abstract

Introduction

Related Work

Motivation and Contribution

Proposed Method

Overview

Fig. 1.

Coarse Segmentation Stage

Image Enhancement Module

Fig. 2.

Network Structure

Fig. 3.

Fig. 4.

Fine Segmentation Stage

Fig. 5.

Experimental Results and Discussion

Dataset and Experimental Platform

Evaluation Metrics

Parameter Settings

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Experimental Results and Analysis

Table 1.

Fig. 10.

Fig. 11.

Table 2.

Table 3.

Fig. 12.

Discussion

Fig. 13.

Conclusion and Future Works

Acknowledgements

Author Contribution

Funding

Declarations

Conflict of Interest

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases