Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 23.
Published in final edited form as: Stat Atlases Comput Models Heart. 2022 Jan 14;13131:123–131. doi: 10.1007/978-3-030-93722-5_14

Simultaneous Segmentation and Motion Estimation of Left Ventricular Myocardium in 3D Echocardiography Using Multi-task Learning

Kevinminh Ta 1, Shawn S Ahn 1, John C Stendahl 2, Jonathan Langdon 4, Albert J Sinusas 2,4, James S Duncan 1,3,4
PMCID: PMC9221412  NIHMSID: NIHMS1816353  PMID: 35759335

Abstract

Motion estimation and segmentation are both critical steps in identifying and assessing myocardial dysfunction, but are traditionally treated as unique tasks and solved as separate steps. However, many motion estimation techniques rely on accurate segmentations. It has been demonstrated in the computer vision and medical image analysis literature that both these tasks may be mutually beneficial when solved simultaneously. In this work, we propose a multi-task learning network that can concurrently predict volumetric segmentations of the left ventricle and estimate motion between 3D echocardiographic image pairs. The model exploits complementary latent features between the two tasks using a shared feature encoder with task-specific decoding branches. Anatomically inspired constraints are incorporated to enforce realistic motion patterns. We evaluate our proposed model on an in vivo 3D echocardiographic canine dataset. Results suggest that coupling these two tasks in a learning framework performs favorably when compared against single task learning and other alternative methods.

Keywords: Echocardiography, Motion estimation, Segmentation, Multi-task learning

1. Introduction

Reductions in myocardial blood flow due to coronary artery disease (CAD) can result in myocardial ischemia or infarction and subsequent regional myocardial dysfunction. Echocardiography provides a non-invasive and cost-efficient tool for clinicians to visually analyze left ventricle (LV) wall motion and assess regional dysfunction in the myocardium. However, such a qualitative method of analysis is subjective by nature and, as a result, is prone to high intra- and interobserver variability.

Motion estimation algorithms provide an objective method for characterizing myocardial contractile function, while segmentation algorithms assist in localizing regions of ischemic tissue. Traditional approaches treat these tasks as unique steps and solve them separately, but recent efforts in the medical image analysis and computer vision fields suggest that these two tasks may be mutually beneficial when optimized simultaneously [13].

In this paper, we propose a multi-task deep learning network to simultaneously segment the LV and estimate its motion between time frames in a 3D echocardiographic (3DE) sequence. The main contributions of this work are as follows: 1) We introduce a novel multi-task learning architecture with residual blocks to solve both 3D motion estimation and volumetric segmentation using a weight sharing feature encoder with task-specific decoding branches; 2) We incorporate anatomically inspired constraints to encourage realistic cardiac motion estimation; 3) We apply our proposed model to 3DE sequences which typically provides additional challenges over Magnetic Resonance (MR) and Computed Tomography (CT) due to lower signal-to-noise ratio as well as over 2DE due to higher dimensionality and smaller spatial and temporal resolution.

2. Related Works

Classic model-based segmentation methods such as active shape and appearance models usually require a priori knowledge or large amounts of feature engineering to achieve adequate results [4,5]. In recent years, data-driven deep learning based approaches have shown promising results, but still suffer challenges due to inherent ultrasound image properties such as low signal-to-noise ratio and low image contrast [4,6]. This becomes increasingly problematic as cardiac motion estimation approaches often rely on accurate segmentations to act as anatomical guides for surface and shape tracking or for the placement of deformable grid points [7,8]. Errors in segmentation predictions can be propagated to motion estimations. While it is conceivable that an expert clinician can manually segment or adjust algorithm predictions prior to the motion estimation step, this is a tedious and infeasible workaround. Several deep learning based approaches have been shown to be successful in estimating motion in the computer vision field, but the difficulty in obtaining true LV motion in clinical data makes supervised approaches challenging. Unsupervised approaches which seek to maximize intensity similarity between image pairs have been successful in MR and CT, however there are still limited applications in 3DE [912].

In recent years, efforts have been made to combine the tasks of motion estimation and segmentation. Qin et al. [2] proposed a Siamese-style joint encoder network using a VGG-16 based architecture that demonstrates promising results when applied to 2D MR cardiac images. The work done in [13] adapts this idea to 2D echocardiography by adopting a feature bridging framework [1] with anatomically inspired constraints. This is further expanded to 3DE in [3] through the use of an iterative training approach where results from one task influence the training of the other [14]. In this work, we propose an alternative novel framework for combining motion estimation and segmentation in 3DE that uses a shared feature encoder to exploit complementary latent representations in the data, which the iterative style of [3] is not capable of doing. In addition to being able to estimate 3D motion and predict volumetric LV segmentations, our model further differs from [2] through the implementation of a 3D U-Net-style architecture [15] with residual blocks [21] for the encoding and decoding branches as opposed to a VGG-16 and FCN architecture (Fig. 1).

Fig. 1.

Fig. 1.

The proposed network and its components. (A) Motion estimation and segmentation tasks are coupled in a multi-task learning framework. (B) An overview of the residual block.

3. Methods

3.1. Motion Estimation Branch

Motion estimation algorithms aim to determine the voxel-wise displacement between two sequential images. Given a source image Isource and a target image Itarget, motion estimation algorithms can be described by their formulation of the mapping function F such that F (Isource, Itarget) → Ux,y,z where Ux,y,z is the displacement along the x-y-z directions. Supervised deep learning formulations of F seek to directly learn the regression between the image pairs and ground truth displacement fields. However due to the scarcity of ground truth in cardiac motion, the motion branch of our network is designed and trained in an unsupervised manner similar to the framework presented in [10], which utilizes a spatial transformation to maximize a similarity metric between a warped source image and a target image.

Our proposed motion branch is comprised of a 3D U-Net inspired architecture with a downsampling analysis path followed by an upsampling synthesis path [15,16]. Skip connections are used to branch features learned in the analysis path with features learned in the synthesis path. Our model also uses residual blocks in order to improve model performance and training efficiency [21]. The downsampling analysis path serves as a feature encoder which shares its weights with the segmentation branch. The input to the motion branch is a pair of 3D images, Isource and Itarget. The branch outputs a displacement field Ux,y,z which describes the motion from Isource to Itarget. The displacement field is then used to morph Isource to match Itarget as described in [10]. The objective of the network is to maximize the similarity between the morphed Isource and Itarget by minimizing the mean squared error between each corresponding voxel p in the two frames. This can be described as follows:

Imorphed=T(Isource,Ux,y,z) (1)
Lsim=1ΩpΩ(Itarget(p)Imorphed(p))2 (2)

where Ω3 and T describes the spatial transforming operator that morphs Isource using Ux,y,z.

Anatomical Constraints.

In order to enforce realistic cardiac motion patterns, we incorporate some anatomical constraints. Cardiac motion fields should be generally smooth. That is, there should be no discontinuities or jumps within the motion field. To discourage this behavior, we penalize the L2-norm of the spatial derivatives in a manner similar to [8,10] as follows:

Lsmooth=1ΩpΩUx,y,z(p)22 (3)

Additionally, it is expected that the LV myocardium preserves its general shape through time. In order to enforce this notion, a shape constraint is added which morphs the manual segmentation of Isource using Ux,y,z and compares it against the manual segmentation of Itarget in a manner similar to [11] as follows:

Lshape=(12|StargetT(Ssource,Ux,y,z)||Starget|+|T(Ssource,Ux,y,z)|) (4)

where Ssource and Starget are the manual segmentations of Isource and Itarget, respectively. We can then define the full motion loss as:

Lmotion=λsimLsim+λsmoothLsmooth+λshapeLshape (5)

3.2. Segmentation Branch

The objective of segmentation is to assign labels to voxels in order to delineate objects of interest from background. The segmentation branch of our proposed model follows the same 3D U-Net architectural style of the motion branch [15]. A downsampling analysis path shares weights with the motion branch and separates to a segmentation-specific upsampling synthesis path. The goal of this branch is to minimize a combined dice and binary cross entropy loss between predicted segmentations and manual segmentations as follows:

Ldice=(12|SM||S|+|M|) (6)
Lbce=(yilog(Pi)+(1yi)(log(1Pi)) (7)
Lseg=λdiceLdice+λbceLbce (8)

where M is the model prediction of the LV mask, S is the manually traced ground truth mask. y is a binary indicator for if a voxel is correctly labeled, and P is the predicted probability a voxel is part of the LV mask.

3.3. Shared Feature Encoder

Multi-task learning is a popular method for combining closely related tasks in a unified framework. In this work, we adopt a soft parameter sharing approach [18]. Inspired by the success of [2] on 2D MR images, we employ a similar Siamese-style model (using a 3D U-Net style architecture with residual blocks) for feature encoding by sharing the weights between the downsampling analysis path of the motion estimation and segmentation branches. These shared weights are then concatenated to each task-specific upsampling synthesis path (feature decoding), thereby using features learned in both tasks to influence the final output. In this way, the model allows each branch to exploit the complementary latent representations for each task during training. Both branches are trained simultaneously and optimized using a composite loss function, weighted by α and β:

Ltotal=αLmotion+βLseg (9)

4. Experiments and Results

4.1. Datasets and Evaluation

In vivo studies were conducted on 8 anesthetized open-chested canines, which were each imaged under 5 conditions: healthy baseline, mild LAD stenosis, mild LAD stenosis with low-dose dobutamine (5 μ g/kg/min), moderate LAD stenosis, and moderate LAD stenosis with low-dose dobutamine [19]. Images were captured using a Philips iE33 scanner with an X7–2 probe. In total, we had 40 3D echocardiographic sequences which we then sampled into image pairs to be inputted to the network. Image pairs were sampled in a 1-to-Frame manner. In other words, for each sequence, we used the first time frame (which roughly corresponds to end-diastole) as Isource and all subsequent time frames as Itarget. All experiments conducted in support of this work were approved under Institutional Animal Care and Use Committee policies.

Due to the scarcity of true cardiac motion, quantitative evaluation of motion performance is often done by comparing propagated labels [25]. In this work, we employ a similar evaluation strategy by warping the endocardial (endo) and epicardial (epi) contours of the source mask and evaluating the mean contour distance (mcd) from the expected (manually traced) target mask contours. We compare our model against a conventional motion estimation approach (denoted as Optical flow - as formulated in [20]) as well as a state-of-the-art deep learning based model (denoted as Motion only - which resembles the VoxelMorph framework described in [10]). Results displayed on Fig. 2 and Table 1a suggest that the proposed model performs favorably against the alternative methods. Wilcoxon rank sum test indicates significant increase in performance (p < 0.05) of the proposed model [24].

Fig. 2.

Fig. 2.

Motion estimations from end-disatole to end-systole for healthy baseline canines. From left to right for both (a) and (b): optical flow, motion only, proposed model.

Table 1.

Quantitative evaluation. (a) Lower mcd indicates better performance. (b) Higher Jaccard (up to 1) and lower HD indicates better performance

Methods Endo (mm) Epi (mm)
Optical flow 0.78 ± 0.41 1.89 ± 1.37
Motion only 0.66 ± 0.31 1.62 ± 1.20
Proposed model 0.49 ± 0.19 0.61 ± 0.32
(a) Motion estimation: mean contour distances (mcd)
Methods Jaccard HD (mm)
Seg only 0.61 ± 0.06 4.00 ± 0.52
Proposed model 0.71 ± 0.04 3.65 ± 0.29
(b) Segmentation: Jaccard index and Hausdorff distance (HD)

To evaluate the segmentation predictions, we compare the Jaccard index and Hausdorff distance (HD) between model predictions and manually traced segmentations [22]. We evaluate the performance of the proposed model against a segmentation specific branch without feature sharing (denoted as Segmentation only - which resembled the 3D U-Net architecture described in [15]). Additionally, since both the segmentation and motion branches of the proposed model produce segmentation predictions as either part of their main task or in support of the shape constraint, we report the average values from these predictions. Results displayed on Fig. 3 and Table 1b suggest that the proposed model performs favorably in predicting LV myocardium segmentation. Wilcoxon rank sum test indicates significant improvement in performance (p < 0.05) of the proposed for both Jaccard and HD over the segmentation only model [24].

Fig. 3.

Fig. 3.

Predicted left ventricular masks. From left to right for both (a) and (b): segmentation only, proposed model.

4.2. Implementation Details

Of the 8 canine studies, we set aside 1 entire study, which consisted of all 5 conditions, for testing. Of the 7 remaining studies, we randomly divided the image pairs such that 90% were used for training and 10% were used for validation and parameter searching. The acquired images were resampled from their native ultrasound resolutions so that each voxel corresponded to 1 mm3. During training and testing, the images were further resized to 643 due to computational limitations. Images were resized back to 1 mm3 resolution prior to evaluation. An Adam optimizer with a learning rate of 1e−4 was used. The model was trained with a batch size of 1 for 50 epochs. Due to the small batch size, group normalization was used in place of standard batch normalization [23]. Hyperparameters and loss weights were empirically selected. The model was developed using PyTorch and trained on an NVIDIA GeForce RTX 2080 Ti. Pre and post processing were done using MATLAB 2019b.

5. Conclusion

In this paper, we proposed a novel multi-task learning architecture that can simultaneously estimate 3D motion and predict volumetric LV myocardial segmentations in 3D echocardiography. This is accomplished through a weight-sharing feature encoder that is capable of learning latent representations in the data that is mutually beneficial to both tasks. Anatomical constraints are incorporated during training in order to encourage realistic cardiac motion patterns. Evaluations using an in vivo canine dataset suggest that our model performs favorably when compared to single task learning and other alternative methods. Future work includes further evaluation such as cross-validation using our existing dataset or validation against a larger or different dataset. Furthermore, we will explore the potential clinical applications of our model in estimating cardiac strain and detecting and localizing myocardial ischemia.

Acknowledgements.

This work was funded by the following grants: R01 HL121226, R01 HL137365, and HL 098069. Additionally, we are grateful toward Drs. Nripesh Parajuli and Allen Lu and the Fellows and staff of the Yale Translational Research Imaging Center, especially Drs. Nabil Boutagy and Imran Al Khalil, for their technical support and assistance with the in vivo canine imaging studies.

References

  • 1.Cheng J, et al. : SegFlow: joint learning for video object segmentation and optical flow. In: IEEE International Conference on Computer Vision (2017) [Google Scholar]
  • 2.Qin C, et al. : Joint learning of motion estimation and segmentation for cardiac MR image sequences. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, Fichtinger G (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 472–480. Springer, Cham: (2018). 10.1007/978-3-030-00934-2_53 [DOI] [Google Scholar]
  • 3.Ta K, Ahn SS, Stendahl JC, Sinusas AJ, Duncan JS: A semi-supervised joint network for simultaneous left ventricular motion tracking and segmentation in 4D echocardiography. In: Martel AL, et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 468–477. Springer, Cham: (2020). 10.1007/978-3-030-59725-2_45 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chen C, et al. : Deep learning for cardiac image segmentation: a review. Front. Cardiovasc. Med 7, 25 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Huang X, et al. : Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Med. Image Anal 18, 253–271 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dong S, et al. : A combined fully convolutional networks and deformable model for automatic left ventricle segmentation based on 3D echocardiography. BioMed Res. Int 2018, 1–16 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Papademetris X, et al. : Estimation of 3-D left ventricular deformation from medical images using biomechanical models. IEEE Trans. Med. Imaging 21, 786–800 (2002) [DOI] [PubMed] [Google Scholar]
  • 8.Parajuli N, et al. : Flow network tracking for spatiotemporal and periodic point matching: applied to cardiac motion analysis. Med. Image Anal 55, 116–135 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Qiu H, Qin C, Le Folgoc L, Hou B, Schlemper J, Rueckert D: Deep learning for cardiac motion estimation: supervised vs. unsupervised training. In: Pop M, et al. (eds.) STACOM 2019. LNCS, vol. 12009, pp. 186–194. Springer, Cham: (2020). 10.1007/978-3-030-39074-7_20 [DOI] [Google Scholar]
  • 10.Balakrishnan G, et al. : An unsupervised learning model for deformable medical image registration. In: The IEEE Conference on Computer Vision and Pattern Recognition (2018) [Google Scholar]
  • 11.Zhu W, et al. : NeurReg: neural registration and its application to image segmentation. In: Winter Conference on Applications of Computer Vision (2020) [Google Scholar]
  • 12.Ahn SS, et al. : Unsupervised motion tracking of left ventricle in echocardiography. In: Medical Imaging 2020: Ultrasonic Imaging and Tomography, International Society for Optics and Photonics (2020) [PMC free article] [PubMed] [Google Scholar]
  • 13.Ta K, et al. : A semi-supervised joint learning approach to left ventricular segmentation and motion tracking in echocardiography. In: IEEE International Symposium on Biomedical Imaging (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tsai Y-H, et al. : Video segmentation via object flow. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) [Google Scholar]
  • 15.Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin S, Joskowicz L, Sabuncu MR, Unal G, Wells W (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham: (2016). 10.1007/978-3-319-46723-8_49 [DOI] [Google Scholar]
  • 16.Ronneberger O, Fischer P, Brox T: U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham: (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
  • 17.Lu A, et al. : Learning-based regularization for cardiac strain analysis with ability for domain adaptation. arXiv preprint arXiv:1807.04807 (2018) [Google Scholar]
  • 18.Ruder S: An overview of multi-task learning in deep neural networks. arXiv:1706.05098 (2017) [Google Scholar]
  • 19.Stendahl JC, et al. : Regional myocardial strain analysis via 2D speckle tracking echocardiography: validation with sonomicrometry and correlation with regional blood flow in the presence of graded coronary stenoses and dobutamine stress. Cardiovasc. Ultrasound 18(1), 2 (2020). 10.1186/s12947-019-0183-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Besnerais GL, et al. : Dense optical flow by iterative local window registration. In: IEEE International Conference on Image Processing (2005) [Google Scholar]
  • 21.He K, et al. : Deep residual learning for image recognition. arXiv:1512.03385 (2015) [Google Scholar]
  • 22.Yushkevich PA, et al. : User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31, 1116–1128 (2006) [DOI] [PubMed] [Google Scholar]
  • 23.Wu Y, He K: Group normalization. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds.) ECCV 2018 LNCS, vol. 11217, pp. 3–19. Springer, Cham: (2018). 10.1007/978-3-030-01261-81 [DOI] [Google Scholar]
  • 24.Gibbons JD, et al. : Nonparametric Statistical Inference, 5th edn. Chapman & Hall/CRC Press, Taylor & Francis Group, Boca Raton: (2011) [Google Scholar]
  • 25.Yu H, et al. : FOAL: fast online adaptive learning for cardiac motion estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020 [Google Scholar]

RESOURCES