Abstract
Purpose:
Cardiac boundary segmentation of echocardiographic images is important for cardiac function assessment and disease diagnosis. However, it is challenging to segment cardiac ventricles due to the low contrast-to-noise ratio and speckle noise of the echocardiographic images. Manual segmentation is subject to interobserver variability and is too slow for real-time image-guided interventions. We aim to develop a deep learning-based method for automated multi-structure segmentation of echocardiographic images.
Methods:
We developed an anchor-free mask convolutional neural network (CNN), termed Cardiac-SegNet, which consists of three subnetworks, that is, a backbone, a fully convolutional one-state object detector (FCOS) head, and a mask head. The backbone extracts multi-level and multi-scale features from endocardium image. The FOCS head utilizes these features to detect and label the region-of-interests (ROIs) of the segmentation targets. Unlike the traditional mask regional CNN (Mask R-CNN) method, the FCOS head is anchor-free and can model the spatial relationship of the targets. The mask head utilizes a spatial attention strategy, which allows the network to highlight salient features to perform segmentation on each detected ROI. For evaluation, we investigated 450 patient datasets by a five-fold cross-validation and a hold-out test. The endocardium and epicardium of the left ventricle and left atrium (LA) were segmented and compared with manual contours using the Dice similarity coefficient (DSC), Hausdorff distance (HD), mean absolute distance (MAD), and center-of-mass distance (CMD).
Results:
Compared to U-Net and Mask R-CNN, our method achieved higher segmentation accuracy and fewer erroneous speckles. When our method was evaluated on a separate hold-out dataset at the end diastole (ED) and the end systole (ES) phases, the average DSC were 0.952 and 0.939 at ED and ES for the , 0.965 and 0.959 at ED and ES for the , and 0.924 and 0.926 at ED and ES for the LA. For patients with a typical image size of 549 × 788 pixels, the proposed method can perform the segmentation within 0.5 s.
Conclusion:
We proposed a fast and accurate method to segment echocardiographic images using an anchor-free mask CNN.
Keywords: cardiac, CNN, deep learning, segmentation, ultrasound
1. INTRODUCTION
Imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography, single-photon emission computed tomography, and ultrasound imaging have been used in clinical cardiology for patient care tasks such as disease diagnosis, cardiac function assessment and therapy selection. Among these cardiac imaging modalities, cardiac ultrasound imaging, also known as echocardiography, stands out as a noninvasive, portable, and low-cost imaging modality with excellent spatial and temporal resolution. However, the ultrasound images have speckle noise and low contrast-to-noise ratios, which pose challenges to the interpretation and analysis of the images. Despite these challenges, numerous methods have been proposed to analyze echocardiographic images.1,2 Computer-assisted analysis of echocardiographic images enables the automatic assessment of cardiac morphology and function.3 For example, a cardiac function assessment requires estimation of the ejection fraction from the left ventricle (LV), which measures how much blood is pumped out of the LV with each contraction. The ejection fraction calculation requires the accurate segmentation of the LV endocardium at the phases of end systole (ES) and end diastole (ED). Current clinical practice is to perform manual delineation of the LV endocardium, which is prone to intra- and inter-observer variability and moreover is laborious and time-consuming.4
Traditional segmentation methods — based on active contours, level-sets, and active shape models — have achieved limited success for cardiac ultrasound image segmentation.5–7 The segmentation accuracy has been limited by various challenges to ultrasound imaging, for example, poor contrast, inhomogeneous brightness, low signal-to-noise ratio, varying speckle noise, edge dropout, and shadows cast by dense muscle and ribs.8 Furthermore, the variations in tissue echogenicity, cardiac shape, and motion across a patient population pose additional challenges. To overcome these challenges and achieve better segmentation accuracy, deep learning-based methods have been proposed.9 A deep belief network-based classification method was proposed by Carneiro et al. to predict the probability of LV presence within an echocardiographic image.10 Gao et al. used a convolutional neural network to perform voxel-wise image segmentation by classifying every voxel within the regions of interest (ROIs).11 This method, however, is relatively slow because the repetitive classifications need to be performed on each voxel. To reduce the computational time, end-to-end image segmentation methods such as U-Net have been proposed for medical image segmentation.12–14 Ali et al. developed a U-Net variation called a Res-U network to delineate the boundaries of the LV endocardium.15 The Res-U network utilized a modified ResNet-50 as the encoder in the U-Net-like network design to perform this task.
A potential limitation of the recently published end-to-end fully convolution networks (FCNs), such as the Res-U,15 is suboptimal segmentation because this framework does not explicitly model the spatial relationships among the multiple cardiac substructures. To model these spatial relationships, larger field of view (FOV) images of the heart are preferred as network inputs. However, the relatively large FOV presents other challenges; background regions contain noninformative data, which could degrade network performance at the segmentation task. In addition, the structures of interest have large variations in shapes and sizes, which could introduce class imbalance problems. Therefore, it is challenging to use a single FCN model for multi-structure segmentation.
In this study, we aim to solve this problem by developing a deep learning-based multi-structure segmentation method for echocardiographic images, called Cardiac-SegNet. The proposed deep learning model was trained to perform multi-structure segmentation, which includes the left ventricle endocardium , left ventricle epicardium , and left atrium (LA). Two-dimensional (2D) echocardiography is widely used to measure the size and volume of the LV.16 In this work, we perform segmentation on 2D echocardiography images. We first predict probability maps for the center-of-mass (this probability map is called as center-ness map in this work) of each structure using the learned multi-level feature maps to detect ROIs for each structure. Since the multi-level feature maps are extracted from the whole image and the prediction of center-ness map is performed in an end-to-end framework, the center-ness map can be derived by learning the spatial relationships of the multiple cardiac structures. The ROI detection step is important for removing the noninformative regions and potentially mitigating the class imbalance problem by better balancing the foreground and background. A binary mask is then created for each structure within a rescaled ROI. The ROI rescaling module assigns uniform size to each ROI so that the same mask head module can be used to segment multiple structures. For evaluation, we performed segmentation on the echocardiographic images from 450 patient studies. Manual contours were used as the ground truth for evaluating multiple segmentation metrics.
2. MATERIALS AND METHODS
2.A. Overview
The manually segmented , and LA were used as the learning target of the proposed Cardiac-SegNet. To learn the spatial relationships of each structure, a larger FOV image of the heart is taken as Cardiac-SegNet’s input. The network architecture consists of three subnetworks, including a backbone network to extract comprehensive features from the images, a fully convolutional one-state object detector (FCOS) head network to detect the ROI of each structure, a mask head network to perform segmentations within the detected ROIs. Unlike the Mask R-CNN,17 the proposed Cardiac-SegNet does not require a pre-defined anchor box (candidate ROI), and thus is anchorless. The schematic flow diagram of the Cardiac-SegNet, called an anchor-free Mask CNN, is shown in Fig. 1. The training and inference stages follow a similar feedforward path as shown in Fig. 1, where the black arrows denote the feedforward path during both training and inference, and the blue arrows denote the feedforward path only during the inference stage.
Fig. 1.
The workflow of the proposed Cardiac-SegNet method. The black arrow denotes the feedforward path in both the training and inference procedure. The blue arrow denotes the feedforward path for the inference procedure only. The red dashed arrow denotes the supervision. The orange regions are the subtraction of and . The major difference between proposed method and Mask R-CNN is the FCOS head. The major difference between proposed method and U-Net is the FCOS head and mask head.
We first collected the training data consisting of 2D echocardiographic images and their respective manual contours. The model was trained to perform ROI detection and classification by using the manually labeled ROI as the ground truth. The manually labeled ROI is obtained by setting a 2D bounding box that only covers the manual contour of each structure. Meanwhile, the network was trained to perform end-to-end segmentation of the detected ROI by using the manual segmentation as the ground truth. Data augmentation — including rotation (±5°, ±10°, ±15°), flipping (along x-axis and along y-axis), image resizing (0.9, 1.1), intensity scaling, adding random noise, and elastic deformation — was used to enlarge the variety of the training data. The random noise is added using multiplicative noise via , where is uniform noise with specified mean and variance. In this work, the mean is set to 0, the variance is set to be a random value selected within range of for each training image denotes the maximum intensity value of denotes the minimum intensity value of .
The elastic deformation is performed via a Python library called “batchgenerators.”18 The elastic deformation depends on two parameters, a magnitude and a scale . A small results in relatively local deformation while a large yields global elastic deformation. In this work, since we do not expect large deformation of the heart (heart motion should be within a reasonable range) the deformation for heart was performed globally. Thus, for each elastic deformation on the training data, the magnitude of the elastic deformation is set by randomly selecting a small non-negative number that was randomly and uniformly distributed over the half-open interval [0, 10). To scale the elastic deformation, was drawn from random and uniform distribution over the half-open interval [40, 50). It represents a global deformation.
The paired image and manual contours were fed into the backbone to extract comprehensive feature maps as shown in Fig. 1 for the FCOS head and mask head subnetworks. The features were first used by the FCOS head to perform a one-state object detection19 of the ROIs, which consist of the central position, the bounding box index and the class of each structure. Then the previously obtained feature map was cropped within detected ROIs and re-scaled via ROI rescaling to the same size. The purpose of the ROI rescaling is to transform the feature map cropped inside each ROI to have equal size so that a single mask head can be used without changing the network design to account for the variable dimensions of the input feature maps for different structures. The mask head accepts the incoming feature maps to perform segmentation within the detected ROIs for each structure.
At the inference stage, a series of binary masks within the detected ROIs were obtained by feeding a new patient’s US image into the trained Cardiac-SegNet. The final segmentation was obtained via tracing these binary masks back to the original image coordinate system with the ROI’s respective label, followed by consolidation fusion. The consolidation fusion is required for the proposed method because the trained Cardiac-SegNet model may introduce overlapping segmentations of neighboring structures. If an overlap occurs between two structures, the classification of the overlap region was determined by the relative weights of the detected ROIs. Namely, we classified an overlapped pixel to a structure with a larger weight. The setting of the weights for consolidation fusion was introduced in our previous work.17
2.B. Network architecture
2.B.1. Backbone
The network architecture of the backbone is shown in Fig. 2. A 2D ResNet was used as the backbone due to its ability to differentiate multiple classes of objects.20 The 2D ResNet has multiple pyramid feature levels, which were supervised by the manual contours of multiple structures. The purpose of the backbone is not to segment but to provide multiple-level features to the two following subnetworks. For an image patch, denoted as , where and denote the width and height of the input image, there are pyramid level feature maps that need to be extracted. These feature maps can be represented as
(1) |
where denotes the number of feature map channels in the first layer in the backbone. In this work, and were set to 640 (197.12 mm) and 960 (147.84 mm), respectively. To make the input image size uniform, central cropping and zero padding were used to crop and pad the arrival 2D US image. Segmentation loss was used to supervise the backbone by calculating the binary cross entropy loss between the segmented contours and the ground truth contours. In our proposed method, the backbone’s learnable parameters were optimized by minimizing this segmentation loss. However, we do not used segmentation results from the backbone, multi-level feature maps extracted via backbone were used for the following subnetworks.
Fig. 2.
The network architecture of backbone. The backbone has five pyramid levels’ feature map output. denotes the input image and denotes the contour patch. denotes the number of structures to be segmented. and denotes the width and height of input image, respectively. denotes feature map extracted from pyramid level denotes the number of feature map channels. The orange region is the subtraction of and .
2.B.2. FCOS head
The FCOS head was used to detect the ROIs using a group of feature maps . To reduce computational cost, the input and output size of the FCOS head should be uniform. Thus, for each feature level, a bicubic interpolation with a specific scaling factor was used as a pre-processing step to scale its feature map to the same size. For example, at level the scaling ratios in the x- and y-direction were and , respectively. The scaling ratio in the feature channel dimension was . By using these scaling factors, the group of feature maps were scaled to . We then concatenated the feature maps from all levels to have , as shown in Fig. 3. Due to memory constraints in this work, the width of feature maps was set to 320, the height of feature maps was set to 480, and the number of feature maps was set to 256.
Fig. 3.
The network architecture of FCOS head. denotes the feature maps of all extracted pyramid levels. denotes the rescaled and concatenated feature maps. denotes the center-ness map. denotes the corresponding classification. denotes the corresponding bounding box.
To ease the segmentation, the FCOS head aims to detect a ROI for each structure from the feature map . In this work, the detected ROI was represented for each structure by its center , class , and bounding box . In contrast, the existing Mask R-CNN method uses multiple anchors (potential ROIs) centered at many possible locations and then regresses the bounding box offsets to classify these anchors.17,21 However, this method requires substantial computational memory due to the large number of anchor candidates needed for the highly variable organ locations and shapes. These ambiguous anchors would increase the difficulty of training and thus affect the detection performance. In contrast to the anchor-based Mask R-CNN, we directly predict a ROI at its most probable position. In other words, the utilized FCOS head directly detects the locations and assigns each location a probability of being the center of the target structure, which is similar to the fully convolutional network for semantic segmentation.22 The derived possibility map is called “center-ness map” in our work, since it represents each location’s possibility of being the center. Let the center-ness map be , where each channel of represents one structure’s center-ness map. The is obtained by first passing the feature map through three convolution layers and an additional convolution layer for prediction. Then, was fed into a soft-max layer to derive the classification . Thus, for each channel, the largest element in with corresponding element close to 1 in denotes the most probable central position of a structure. To derive the bounding index (upper left and lower right) of each ROI assigned with and , for each structure, was multiplied with the feature map and fed into two convolutional layers with a stride size of 2 to reduce the resolution and then a fully connected layer to perform bounding box index regression.
To supervise the FCOS head, the center-of-mass of each structure was regarded as the ground truth . The ROI that only covers each structure was used as the ground truth . The corresponding label of that structure was used as the ground truth . The loss function of the FCOS head consists of three parts, including a sigmoid cross entropy loss between the predicted and the ground truth , a sigmoid cross-entropy loss between the predicted and the ground truth and a pixel-level intersection-over-union (IoU) loss between the predicted and the ground truth . The first term was used to supervise the accuracy of the center-ness map, namely the accuracy of the structure center-of-mass. The second term was used to supervise the accuracy of the structure classification. The third term was used to supervise the accuracy of the structure’s ROI detection.
2.B.3. Mask head
The mask head was used to segment the binary mask of contours within the detected ROIs. As shown in Fig. 4, the mask head’s network was implemented via a deep attention U-Net-like structure.23 It was composed of a convolution layer with a stride size of 1, two convolution layers with a stride size of 2, two deconvolutional layers to up-sample the feature map size, two convolution layers to regress the input features and a softmax layer to perform the end-to-end segmentation. In addition, since noise and uninformative elements could be involved from both the US images and the previous two subnetworks, a deep attention block, that is, an attention gate (AG),22 was integrated by skip connection to support the mask head for the segmentation within the ROIs. The AG was used as a feature selection operator but in an automatically learned way to highlight informative features that can well-represent the organ boundary texture. The input feature map can be denoted as , and the output of mask head can be represented as . The and denote the cropped and uniformly rescaled width and height of the feature map. In this work, was set to 128 and was set to 256. The segmentation loss, which is a combination of Dice loss and binary cross entropy loss, was used to supervise the mask head.
Fig. 4.
The network architecture of the mask head. denotes the cropped and rescaled feature map that covers the structure. denotes the segmented binary mask.
The relationship between and and , and and are as follows: and denotes the width and height of input image, respectively. and denotes the rescaled pyramid feature maps’ width and height, respectively. and denotes the resampled cropped feature map’s width and height, respectively. The multi-level pyramid feature maps were first extracted from an input image via the backbone. The size of the multi-level pyramid feature maps was , where denotes the number of feature map channels extracted from the first convolution layer in the backbone, denotes the pyramid feature level. Then, for each level of the pyramid feature map, we used a rescaling operator to interpolate each level of the feature map to the same size. The size of the rescaled feature map was set to a width and height .
Then, the FCOS head was used to obtain the ROI coordinates by regressing the arrival scaled feature maps. Structures were segmented via the rescaled feature map cropped within a ROI. Due to the variable sizes of cropped feature maps, we used ROI rescaling to resample the cropped feature map on a standard grid of width and height .
2.C. Patient data
To evaluate the proposed Cardiac-SegNet, we investigated a total of 450 patient studies from a public dataset called CAMUS (Cardiac Acquisitions for Multi-structure Ultrasound Segmentation) dataset24 in this study. The CAMUS datasets were acquired using GE Vivid E95 ultrasound scanners with a GE M5S probe at the University Hospital of St Etienne (France). Each case includes a 2D apical four-chamber view sequence and a two-chamber view sequence. To allow manual annotation of cardiac structures at end diastole (ED) and end systole (ES), each patient dataset contains at least one full cardiac cycle. The first frame at ED was defined as the frame after the mitral valve closes or when the LV has the largest volume. The ES was defined as the frame after the aortic valve closes or when the LV has the smallest volume. Each patient dataset has two ED and two ES datapoints. The CAMUS datasets include manually drawn, expert segmentations for the , and LA. The protocol for these three structures was set up by: (a) defining the as regions of the LV wall, mitral valve plane, trabeculations, papillary muscles, and apex; (b) defining the as the interface between the pericardium and the myocardium for the anterior, anterolateral, and inferior segments and the frontier between the right ventricle cavity and the septum for the inferoseptal segments; (c) defining the LA as the tissue within the LA wall.24 The pixel size is 0.308 × 0.154 mm2 in each 2D image.
2.D. Implementation and evaluation
The proposed Cardiac-SegNet was implemented on a NVIDIA Tesla V100 GPU with 32 GB of memory. The Adam gradient optimizer was used to optimize the learnable parameters of the network. The maximum number of epochs was set to 200. For each epoch, the number of training iterations was set to 20. The learning rate was set to 2e-4. For each iteration of one epoch, a batch of 40 datasets was used for training. The code was implemented using TensorFlow 1.8 in Python 3.6.9. The model training took about 2.5 h. After training, the proposed method can perform inference on a patient’s 2D images within 0.5 s.
We performed a fivefold cross-validation and hold-out test for the evaluation of our method. For fivefold cross-validation, 300 patient cases (600 ED data and 600 ES data, i.e., 1200 image sets in total) were used from the larger set of 450 patient cases. Fivefold cross-validation was used to test the data without using the same dataset for both training and testing. Specifically, we randomly grouped the 300 patients into five equal groups of 60 patients. For each experiment, four groups (240 patients) were used for training and the remaining one group (60 patients) was used for testing. The experiment was repeated five times to test each group once. The remaining 150 patients (300 ED data and 600 ES data) from the 450 patients were used for a hold-out test, and the previous 300 patients were used for training.
In the fivefold cross-validation test and hold-out test, we compared the performance of the proposed Cardiac-SegNet with that of the state-of-the art U-Net15 and Mask R-CNN25 methods. U-Net is an end-to-end FCN and often used for ultrasound image segmentation. In contrast to the Cardiac-SegNet, U-Net has an encoding path and a decoding path with long skip connections that can generate an equally sized multi-organ segmentation from the output of the decoding path. As an improvement, residual blocks were integrated into the U-Net’s architecture.15 Mask R-CNN is a type of R-CNN that recently has been used for medical image segmentation. Different from our proposed Cardiac-SegNet, Mask R-CNN needs to build a large number of anchors and relies on a regional proposal network to select anchors as detected ROIs.25
The automatic segmentation was thoroughly evaluated through metrics including Dice similarity coefficient (DSC), Hausdorff distance (HD), mean absolute distance (MAD) as recommended by the study of the CAMUS dataset.24 In addition, we calculated the center-of-mass distance (CMD) for the additional evaluation.26 Statistical tests were used to compare the segmentation accuracy provided by Cardiac-SegNet and the other competing networks. Normality tests were performed before the parametric/non-parametric statistic tests. Normality of DSC, HD, MAD, and CMD provided by each segmentation method was determined using a statistical tool provided by SciPy based on D’Agostino and Pearson’s tests.27,28 According to p-value from normality test, all p-values are less than 0.05 which suggests that the null hypothesis that the data is normally distributed is rejected, and therefore non-parametric statistical test should be performed. In this work, Mann-Whitney test was used to perform statistic test. Multiple comparisons were made using the same null hypothesis on DSC, HD, MAD, and CMD. Bonferroni correction suggests a threshold significance level of 0.05/4 = 0.0125 to account for the familywise type error. We have highlighted the test p-values that are greater than 0.0125 in Tables I and II using bold font. The ejection fraction (EF) was calculated for all patients using the area-length method described by Lang et al.16. The calculated EF was compared to the ground truth EF which was calculated from manual contours. A Bland-Altman plot was used to evaluate the EF estimates between the different methods. The Bland–-Altman plot was also used to evaluate the segmentation area accuracy of the , and LA.
Table I.
Numerical results calculated between the ground truth contour and segmented contours of the U-Net, Mask R-CNN and the proposed method using five-fold cross-validation. -values obtained via Mann-Whitney Test between results of the proposed method and each competing method using fivefold cross-validation. Both ED and ES data were evaluated.
Organ | Test | Method | DSC | HD (mm) | MAD (mm) | CMD (mm) |
---|---|---|---|---|---|---|
ED | U-Net | 0.937 ± 0.030 | 4.830 ± 1.923 | 1.976 ± 1.679 | 1.471 ± 1.104 | |
Mask R-CNN | 0.938 ± 0.027 | 3.975 ± 2.057 | 1.927 ± 1.655 | 1.421 ± 1.061 | ||
Proposed | 0.948 ± 0.024 | 2.288 ± 1.784 | 1.887 ± 1.530 | 1.265 ± 0.957 | ||
-value | Proposed vs U-Net | <0.001 | <0.001 | <0.001 | 0.002 | |
Proposed vs mask R-CNN | <0.001 | 0.002 | <0.001 | 0.007 | ||
ES | U-Net | 0.917 ± 0.045 | 3.635 ± 2.362 | 1.916 ± 1.754 | 1.607 ± 1.406 | |
Mask R-CNN | 0.919 ± 0.045 | 3.793 ± 2.620 | 1.919 ± 1.817 | 1.561 ± 1.414 | ||
Proposed | 0.927 ± 0.043 | 2.247 ± 2.274 | 1.893 ± 1.785 | 1.397 ± 1.275 | ||
-value | Proposed vs U-Net | <0.001 | <0.001 | <0.001 | 0.001 | |
Proposed vs mask R-CNN | <0.001 | <0.001 | <0.001 | 0.013 | ||
ED | U-Net | 0.955 ± 0.018 | 5.157 ± 2.069 | 3.154 ± 2.151 | 1.385 ± 0.918 | |
Mask R-CNN | 0.955 ± 0.018 | 4.287 ± 2.253 | 2.992 ± 2.340 | 1.440 ± 1.055 | ||
Proposed | 0.960 ± 0.016 | 2.946 ± 2.125 | 2.369 ± 2.029 | 1.264 ± 0.905 | ||
-value | Proposed vs U-Net | <0.001 | 0.003 | <0.001 | 0.003 | |
Proposed vs mask R-CNN | <0.001 | <0.001 | <0.001 | 0.004 | ||
ES | U-Net | 0.947 ± 0.025 | 4.884 ± 2.016 | 2.975 ± 2.150 | 1.374 ± 1.067 | |
Mask R-CNN | 0.948 ± 0.023 | 3.863 ± 2.151 | 2.757 ± 2.327 | 1.423 ± 1.089 | ||
Proposed | 0.953 ± 0.022 | 2.755 ± 2.157 | 2.746 ± 2.329 | 1.254 ± 1.025 | ||
-value | Proposed vs U-Net | <0.001 | 0.019 | <0.001 | 0.004 | |
Proposed vs mask R-CNN | <0.001 | 0.075 | <0.001 | 0.001 | ||
LA | ED | U-Net | 0.872 ± 0.113 | 4.893 ± 7.807 | 1.978 ± 1.961 | 2.064 ± 2.387 |
Mask R-CNN | 0.878 ± 0.101 | 3.585 ± 3.837 | 1.856 ± 1.228 | 1.999 ± 2.306 | ||
Proposed | 0.895 ± 0.085 | 2.214 ± 4.107 | 1.696 ± 1.750 | 1.645 ± 2.148 | ||
-value | Proposed vs U-Net | <0.001 | 0.006 | <0.001 | <0.001 | |
Proposed vs mask R-CNN | <0.001 | 0.001 | <0.001 | <0.001 | ||
ES | U-Net | 0.910 ± 0.069 | 4.121 ± 3.774 | 1.917 ± 1.504 | 1.688 ± 1.943 | |
Mask R-CNN | 0.917 ± 0.059 | 3.867 ± 3.477 | 1.872 ± 1.526 | 1.473 ± 1.812 | ||
Proposed | 0.922 ± 0.055 | 2.65 ± 3.453 | 1.703 ± 1.677 | 1.352 ± 1.639 | ||
-value | Proposed vs U-Net | <0.001 | 0.010 | <0.001 | <0.001 | |
Proposed vs mask R-CNN | 0.020 | 0.047 | 0.015 | 0.027 |
DSC: Dice similarity coefficient. HD: Hausdorff distance. MAD: mean absolute distance. CMD: center-of-mass distance.
The bold text denotes that the -test -values are greater than 0.0125.
Table II.
Numerical results calculated between manual contour and segmented contours of the U-Net, Mask R-CNN and the proposed method using hold-out test. -values obtained via Mann-Whitney Test between results of the proposed method and each competing method using hold-out test. Both ED and ES data were evaluated.
Organ | Test | Method | DSC | HD (mm) | MAD (mm) | CMD (mm) |
---|---|---|---|---|---|---|
ED | U-Net | 0.932 ± 0.032 | 3.995 ± 1.967 | 1.969 ± 1.754 | 1.504 ± 1.117 | |
Mask R-CNN | 0.948 ± 0.024 | 3.561 ± 1.784 | 1.875 ± 1.709 | 1.162 ± 0.784 | ||
Proposed | 0.952 ± 0.019 | 2.100 ± 1.440 | 1.826 ± 1.396 | 1.089 ± 0.755 | ||
-value | Proposed vs U-Net | <0.001 | <0.001 | <0.001 | <0.001 | |
Proposed vs mask R-CNN | 0.038 | 0.001 | 0.008 | 0.087 | ||
ES | U-Net | 0.925 ± 0.042 | 3.323 ± 1.907 | 1.848 ± 1.776 | 1.466 ± 1.162 | |
Mask R-CNN | 0.926 ± 0.041 | 3.495 ± 2.154 | 1.902 ± 1.821 | 1.459 ± 1.142 | ||
Proposed | 0.939 ± 0.038 | 2.105 ± 1.677 | 1.833 ± 1.667 | 1.379 ± 1.053 | ||
-value | Proposed vs U-Net | 0.001 | <0.001 | 0.202 | 0.279 | |
Proposed vs mask R-CNN | 0.001 | <0.001 | 0.169 | 0.275 | ||
ED | U-Net | 0.951 ± 0.014 | 3.763 ± 1.423 | 1.928 ± 1.385 | 1.258 ± 0.778 | |
Mask R-CNN | 0.961 ± 0.014 | 3.842 ± 1.479 | 2.061 ± 1.397 | 1.121 ± 0.662 | ||
Proposed | 0.965 ± 0.013 | 2.614 ± 1.569 | 1.924 ± 1.569 | 1.098 ± 0.673 | ||
-value | Proposed vs U-Net | <0.001 | 0.037 | 0.001 | 0.010 | |
Proposed vs mask R-CNN | <0.001 | 0.006 | <0.001 | 0.289 | ||
ES | U-Net | 0.953 ± 0.019 | 3.597 ± 1.464 | 1. 976 ± 1.382 | 1.234 ± 0.843 | |
Mask R-CNN | 0.955 ± 0.018 | 3.473 ± 1.422 | 1.974 ± 1.375 | 1.188 ± 0.796 | ||
Proposed | 0.959 ± 0.017 | 2.392 ± 1.504 | 1.860 ± 1.561 | 1.132 ± 0.776 | ||
-value | Proposed vs U-Net | <0.001 | 0.016 | 0.001 | 0.006 | |
Proposed vs mask R-CNN | 0.002 | 0.138 | 0.006 | 0.186 | ||
LA | ED | U-Net | 0.902 ± 0.065 | 3.673 ± 2.66 | 1.986 ± 1.807 | 1.473 ± 1.36 |
Mask R-CNN | 0.900 ± 0.061 | 3.854 ± 2.409 | 2.180 ± 1.519 | 1.511 ± 1.238 | ||
Proposed | 0.924 ± 0.071 | 2.741 ± 3.199 | 1.768 ± 1.998 | 1.493 ± 1.648 | ||
-value | Proposed vs U-Net | <0.001 | 0.338 | 0.213 | 0.128 | |
Proposed vs mask R-CNN | <0.001 | 0.018 | 0.009 | 0.039 | ||
ES | U-Net | 0.919 ± 0.057 | 3.482 ± 2.446 | 1.984 ± 1.757 | 1.331 ± 1.381 | |
Mask R-CNN | 0.922 ± 0.054 | 3.273 ± 2.090 | 1.947 ± 1.419 | 1.196 ± 1.267 | ||
Proposed | 0.926 ± 0.047 | 2.137 ± 2.112 | 1.708 ± 1.599 | 1.105 ± 1.088 | ||
-value | Proposed vs U-Net | 0.008 | 0.001 | 0.003 | 0.007 | |
Proposed vs mask R-CNN | 0.069 | 0.003 | 0.007 | 0.065 |
DSC: Dice similarity coefficient. HD: Hausdorff distance. MAD: mean absolute distance. CMD: center-of-mass distance.
The bold text denotes that the -test -values are greater than 0.0125.
3. RESULTS
We compared the segmentation results generated by the proposed Cardiac-SegNet with the corresponding manual contours. For illustration, Fig. 5 shows representative US images of high, intermediate and poor image quality, physician-drawn ground truth contours, and the automatic segmentation results of the competing methods followed by the proposed Cardiac-SegNet segmentation. As shown in Fig. 5, all the three methods gave reasonable results with similar shapes for the good image quality case (Fig. 5 row (1)), while the proposed Cardiac-SegNet outperformed the other two competing methods and showed better agreement and smoother contours on intermediate and poor image quality cases (Fig. 5 row (2) and (3)). The U-Net failed to generate accurate LA shapes because of ambiguous/blurry boundary contrast. While the Mask R-CNN yielded more reasonable LA segmentation than U-Net, its segmented contour is much larger than the ground truth LA contour.
Fig. 5.
Visualization of multi-structure segmentation on three patients’ case of hold-out test data. Row (1) shows the image with good quality. Row (2) shows the image with intermediate quality. Row (3) shows the image with poor quality. Columns (a)-(e) are the US images, ground truth contours, U-Net contours, Mask R-CNN contours and the proposed contours. The myocardium, the and LA are shown in green, red, and blue, respectively.
We also showed representative center-ness maps provided by the FCOS head and estimated bounding boxes overlaid on the center-ness maps. Figure 6 shows the patient’s US image, physician’s ground truth contours, and the center-ness maps of three structures obtained by the proposed method. The detected ROIs of structures are shown as red rectangles overlaid on the center-ness maps. As shown in Fig. 6, the center-ness maps generated by proposed method have a similar boundary and shape as the manual contour and the bounding boxes (detected ROIs), and can well enclose the structures for further segmentation.
Fig. 6.
Visualization of center-ness map and estimated bounding box (detected ROI) on three patients’ case of hold-out test data. The three rows show three patients’ cases. Columns (a)-(e) are the US images, ground truth contours, center-ness map of , center-ness map of , and center-ness of LA. Red rectangles overlaid on columns (c) to (e) show the bounding box (detected ROI) of the three structures.
Table I shows the quantitative evaluation results from the five-fold cross-validation of both the ED and ES data for the proposed Cardiac-SegNet and competing methods. Then, a student t-test was performed to evaluate the differences between the results of the proposed Cardiac-SegNet and each competing method. The corresponding p-values obtained via Mann-Whitney test are listed in Table I. As shown in Table I, our proposed method significantly outperformed the competing methods on DSC, HD, and CMD metrics across both the ED and ES datasets.
Table II shows the quantitative evaluation of the hold-out test results from both the ED and ES data for the proposed Cardiac-SegNet and competing methods. A Student’s -test was performed to evaluate differences of the proposed Cardiac-SegNet and each competing method. The -values are listed in Table II. As shown in Table II, the proposed method significantly outperformed competing methods on DSC, HD and MAD metrics for both ED and ES datasets. The resultant -values show that the proposed method is better than the other two competing methods with statistical significance.
Figure 7 shows the Bland-Altman plot of the algorithm vs manual EF measurements. As can be seen, the proposed method can reach the most similar level of EF as compared to the comparing methods.
Fig. 7.
Bland-Altman plot of the EF difference. X-axis denotes the (algorithm EF + manual EF)/2. Y-axis denotes the (algorithm EF - manual EF). The first, second, and third columns show the results of U-Net, Mask R-CNN, and the proposed Cardiac-SegNet, respectively. The first row shows the results for the validation datasets. The second row shows the results for the separate hold-out datasets. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2 × standard deviation lines.
To verify the segmentation area accuracy, Figs. S1–S4 in the appendix show the area error for the segmented structures using both the fivefold cross-validation and hold-out datasets. These figures suggest that our method outperformed the U-Net and Mask R-CNN.
4. DISCUSSION
In this study, we proposed a Cardiac-SegNet as an anchor-free Mask CNN for echocardiographic multi-structure segmentation. The proposed method has been tested using 450 patient datasets, totaling 1800 endocardium images. Leclerc et al. reported their best performance reached a mean DSC of 0.939 and 0.916 for on ED and ES datasets, respectively, and their best performance reached a mean DSC of 0.954 and 0.945 for LVEpi on ED and ES datasets, respectively.24 Our proposed method achieved a mean DSC of 0.92 for the , 0.95 for the , and 0.89 for LA, which are similar with Leclerc’s best performance. The superiority of the proposed method over the other two competing deep learning-based methods was also demonstrated to be statistically significant in terms of DSC and HD for all the segmented structures. Autosegmentation of cardiac structures is a fundamental step for many cardiac applications. The proposed multi-structure segmentation method could automate the contouring process by segmenting the important cardiac structures with great accuracy and speed. The proposed method can generate the segmentation results for all the structures in <0.5 s for a patient study.
It is worth noting that segmentations of the and were worse at the ES than the ED phase. The results demonstrate that both the and may be more challenging to contour at ES than at ED. However, the worst DSC of 0.895 is spotted for the LA at ED. The inconsistent performance between ED and ES datasets may be caused by (a) the inconsistency in the ground truth contours caused by the experts’ inconsistent daily clinical practice; and (b) the difficulty in contouring datasets acquired with non-standard views. The consistency of the training datasets could be improved by training the physicians to contour using familiar software that can interactively provide immediate three-dimensional (3D) volumetric rendering to reduce irregular contours. Since the DSC might be biased towards structures with large volumes, it is beneficial to evaluate the segmentation results using surface-based metrics as well. The distance-based metrics such as HD and MAD help quantify the surface distance, where the proposed method achieved a maximum of 2.95 mm surface error for all structures.
The superior segmentation results of the proposed method over the two competing methods can be attributed to a few factors. First, recent deep learning-based end-to-end semantic segmentation methods, such as U-Net, perform multi-structure segmentation in a whole image-based manner. The challenge of this kind of method is that different structures often have large variations in shapes and sizes, such as the LV and LA, which could introduce class imbalance during training and thus decrease the performance of the network. In addition, the noninformative image regions, such as the noisy image background, could introduce feature interference. In contrast, the proposed method first detects the ROI position of each structure. The ROI rescaling module was introduced to resize these ROIs to the same size. Then segmentation was performed within the detected ROIs. In this manner, the noninformative regions were removed from the network input. In addition, the use of ROIs may mitigate the class imbalance issue between different structures by better balancing the image foreground and background.
Although our method and the recent deep learning-based regional convolutional neural networks (R-CNNs), such as Mask R-CNN, both perform segmentation on detected ROIs, the methodology is different. First, the R-CNN-based methods randomly build different sized anchors (potential ROIs). Then, the R-CNNs perform classification on these anchors to determine whether the target is present within these anchors, predict the class of structures within these ROIs, and adjust the location of the anchors and the offsets of the bounding boxes. Since the ROIs are set randomly and have limited size, they may not provide enough global spatial information for the classification step. For example, if the ROI targeting at the LV is misplaced or wrong-sized, it may lack important image features that are helpful in delineating the LV. The major difference between the proposed method and the R-CNNs is that the ROIs are not randomly placed for each structure but are obtained by predicting each structure’s center-ness map, that is, end-to-end center-ness map, from the whole input frame. The center-ness map prediction step utilizes the global spatial information to detect the most probable location of each structure, and then a ROI is assigned surrounding it via a bounding box prediction step for further segmentation. In other words, our method first detects the ROI of structures globally and then segments them in ROI, whereas the R-CNN-based methods both detect and segment structures in ROI, which is challenging due to the lack of global image features. Models are dependent on the quality of input data, and the quality of the ground truth contours could be improved by having multiple observers to reach consensus. To increase the network’s generalizability, a greater number of datasets may be needed from other clinics in patients with diverse demographics and pathological abnormalities. Another limitation of this work is that 2D images were used. As compared to 3D images, 2D image segmentation results in less accurate EF calculation. Another disadvantage of using 2D images is the structural and spatial information in the third dimension is missing, which may limit the performance of the model. We plan to extend our work to 3D image sets in the future. One other limitation of the study is that we have only tested our method using images acquired from a single US probe and scanner. To increase the network’s robustness and generalizability, we plan to collect more datasets acquired using a variety of US probes and scanners for network training in the future. As shown in Fig. 7 for some cases the EF error reached 30%. This may be caused by the wrong segmentation as a small part of the region was missed by the ROI. The cropping issue of the true extent of the ventricle or atrium would occur due to the misestimation of the bounding box. A potential way is to enlarge the number of potential bounding boxes, and use weighted bounding boxes for segmentation consolidation, as recommended in Mask R-CNN17 and Mask scoring R-CNN.21 In contrast, we only estimated one detected ROI for one structure. By considering more potential bounding boxes, the failure of segmentation or misclassification may be diminished via consolidation.
5. CONCLUSION
We have developed an anchor-free Mask CNN-based approach for accurate and fast contouring of the endocardium and epicardium of the left ventricle and the left atrium wall from 2D echocardiographic images.
Supplementary Material
Fig. S1 Bland-Altman plot of the area error on ED dataset for the 5-fold validation datasets. X-axis denotes the (algorithm area + manual area)/2(mm2). Y-axis denotes the (algorithm area - manual area). The first, second and third columns show the results of U-Net, Mask R-CNN and the proposed Cardiac-SegNet, respectively. The first row shows the results of LVEndo. The second row shows the results of LVEpi. The third row shows the results of LA. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S2 Bland-Altman plot of the area error on ES dataset for the 5-fold validation datasets. X-axis denotes the (algorithm area + manual area)/2(mm2). Y-axis denotes the (algorithm area - manual area). The first, second and third columns show the results of U-Net, Mask R-CNN and the proposed Cardiac-SegNet, respectively. The first row shows the results of LVEndo. The second row shows the results of LVEpi. The third row shows the results of LA. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S3 Bland-Altman plot of the EF difference. X-axis denotes the (algorithm EF + manual EF)/2 (mm2). Y-axis denotes the (algorithm EF - manual EF). The first, second, and third columns show the results of U-Net, Mask R-CNN, and the proposed Cardiac-SegNet, respectively. The first row shows the results for the validation datasets. The second row shows the results for the separate hold-out datasets. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S4 Bland-Altman plot of the EF difference. X-axis denotes the (algorithm EF + manual EF)/2 (mm2). Y-axis denotes the (algorithm EF - manual EF). The first, second, and third columns show the results of U-Net, Mask R-CNN, and the proposed Cardiac-SegNet, respectively. The first row shows the results for the validation datasets. The second row shows the results for the separate hold-out datasets. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
ACKNOWLEDGMENTS
This research is supported in part by the National Cancer Institute of the National Institutes of Health under Award Number R01CA215718, Emory Winship Cancer Institute pilot grant.
REFERENCES
- 1.Shengfeng Liu YW, Yang X, Lei B, et al. Deep learning in medical ultrasound analysis. A review. Engineering. 2019;5:261–275. [Google Scholar]
- 2.Ghorbani A, Ouyang D, Abid A, et al. Deep learning interpretation of echocardiograms. NPJ Dig Med. 2020;3:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Angelini E, Homma S, Pearson G, Holmes J, Laine A. Segmentation of real-time three-dimensional ultrasound for quantification of ventricular function: a clinical study on right and left ventricles. Ultrasound Med Biol. 2005;31:1143–1158. [DOI] [PubMed] [Google Scholar]
- 4.Armstrong AC, Ricketts EP, Cox C, et al. Quality control and reproducibility in M-mode, two-dimensional, and speckle tracking echocardiography acquisition and analysis: the CARDIA Study, year 25 examination experience [published online ahead of print 2014/11/11]. Echocardiography. 2015;32:1233–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Leung K, Bosch J. Automated border detection in three-dimensional echocardiography: principles and promises. Eur J Echocardiogr. 2010;11:97–108. [DOI] [PubMed] [Google Scholar]
- 6.Mazaheri S, Sulaiman PSB, Wirza R, et al. Echocardiography Image Segmentation: A Survey. 2013 International Conference on Advanced Computer Science Applications and Technologies; 2013; 327–332. 10.1109/ACSAT.2013.71 [DOI] [Google Scholar]
- 7.Noble J, Boukerroui D. Ultrasound image segmentation: a survey. IEEE Trans Med Imaging. 2006;25:987–1010. [DOI] [PubMed] [Google Scholar]
- 8.Chen C, Qin C, Qiu H, et al. Deep learning for cardiac image segmentation. A Review. Front Cardiovasc Med. 2020;7:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang Z, Zhang Z, Zheng J-Q, Huang B, Voiculescu I, Yang G. Deep Learning in Medical Ultrasound Image Segmentation: a Review. ArXiv. 2020;abs/2002.07703. https://arxiv.org/abs/2002.07703v3 [Google Scholar]
- 10.Carneiro G, Nascimento JC, Freitas A. The segmentation of the left ventricle of the heart from ultrasound data using deep learning architectures and derivative-based search methods. IEEE Trans Image Process. 2012;21:968–982. [DOI] [PubMed] [Google Scholar]
- 11.Gao X, Li W, Loomes M, Wang L. A fused deep learning architecture for viewpoint classification of echocardiography. Inform Fusion. 2017;36:103–113. [Google Scholar]
- 12.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image Segmentation. Paper presented at: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015; 2015//, 2015; Cham. [Google Scholar]
- 13.Liu Y, Lei Y, Fu Y, et al. CT-based multi-organ segmentation using a 3D self-attention U-net network for pancreatic radiotherapy. Med Phys. 2020;47:4316–4324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang T, Lei Y, Tang H, et al. A learning-based automatic segmentation and quantification method on left ventricle in gated myocardial perfusion SPECT imaging: a feasibility study. J Nucl Cardiol. 2020;27:976–987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ali Y, Janabi-Sharifi F, Beheshti S. Echocardiographic image segmentation using deep Res-U network. Biomed Sign Process Control. 2021;64:102248. [Google Scholar]
- 16.Lang RM, Badano LP, Mor-Avi V, et al. Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging. J Am Soc Echocardiogr. 2015;28: 1–39.e14. [DOI] [PubMed] [Google Scholar]
- 17.Jeong J, Lei Y, Kahn S, et al. Brain tumor segmentation using 3D Mask R-CNN for dynamic susceptibility contrast enhanced perfusion imaging. Phys Med Biol. 2020;65:185009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fabian I, Paul J, Jakob W, et al. Inventors. batchgenerators - a python framework for data augmentation; 2020. 10.5281/zenodo.3632567 [DOI] [Google Scholar]
- 19.Tian Z, Shen C, Chen H, He T. FCOS: Fully convolutional one-Stage Object Detection. Paper presented at: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 27 Oct.−2 Nov. 2019, 2019. 10.1109/ICCV.2019.00972 [DOI] [Google Scholar]
- 20.He K, Zhang X, Ren S, Sun J. Deep residual Learning for Image Recognition. Paper presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 27–30 June 2016; 2016. 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
- 21.Lei Y, He X, Yao J, et al. Breast tumor segmentation in 3D automatic breast ultrasound using Mask scoring R-CNN. Med Phys. 2021;48:204–214. [DOI] [PubMed] [Google Scholar]
- 22.Lei Y, Dong X, Tian Z, et al. CT prostate segmentation based on synthetic MRI-aided deep attention fully convolution network. Med Phys. 2020;47:530–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Dong X, Lei Y, Tian S, et al. Synthetic MRI-aided multi-organ segmentation on male pelvic CT using cycle consistent deep attention network. Radiother Oncol. 2019;141:192–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Leclerc S, Smistad E, Pedrosa J, et al. Deep learning for segmentation using an open large-scale dataset in 2D echocardiography. IEEE Trans Med Imaging. 2019;38:2198–2210. [DOI] [PubMed] [Google Scholar]
- 25.Lei Y, Yao J, He X, et al. Deep learning-based breast tumor detection and segmentation in 3D ultrasound image. Paper presented at: Medical Imaging 2020: Ultrasonic Imaging and Tomography; 2020. 10.1117/12.2549157 [DOI] [Google Scholar]
- 26.Lei Y, Tian S, He X, et al. Ultrasound prostate segmentation based on multi-directional deeply supervised V-Net. Med Phys. 2019;46:3194–3206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.D’Agostino RB. An omnibus test of normality for moderate and large size samples. Biometrika. 1971;58:341–348. [Google Scholar]
- 28.D’Agostino R, Pearson ES. Tests for departure from normality. Biometrika. 1973;60:613–622. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Fig. S1 Bland-Altman plot of the area error on ED dataset for the 5-fold validation datasets. X-axis denotes the (algorithm area + manual area)/2(mm2). Y-axis denotes the (algorithm area - manual area). The first, second and third columns show the results of U-Net, Mask R-CNN and the proposed Cardiac-SegNet, respectively. The first row shows the results of LVEndo. The second row shows the results of LVEpi. The third row shows the results of LA. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S2 Bland-Altman plot of the area error on ES dataset for the 5-fold validation datasets. X-axis denotes the (algorithm area + manual area)/2(mm2). Y-axis denotes the (algorithm area - manual area). The first, second and third columns show the results of U-Net, Mask R-CNN and the proposed Cardiac-SegNet, respectively. The first row shows the results of LVEndo. The second row shows the results of LVEpi. The third row shows the results of LA. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S3 Bland-Altman plot of the EF difference. X-axis denotes the (algorithm EF + manual EF)/2 (mm2). Y-axis denotes the (algorithm EF - manual EF). The first, second, and third columns show the results of U-Net, Mask R-CNN, and the proposed Cardiac-SegNet, respectively. The first row shows the results for the validation datasets. The second row shows the results for the separate hold-out datasets. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.
Fig. S4 Bland-Altman plot of the EF difference. X-axis denotes the (algorithm EF + manual EF)/2 (mm2). Y-axis denotes the (algorithm EF - manual EF). The first, second, and third columns show the results of U-Net, Mask R-CNN, and the proposed Cardiac-SegNet, respectively. The first row shows the results for the validation datasets. The second row shows the results for the separate hold-out datasets. The red lines denote the mean lines. The green dash and orange long dash lines denote the mean ± 2× standard deviation lines.