Abstract
Accurate motion tracking of the left ventricle is critical in detecting wall motion abnormalities in the heart after an injury such as a myocardial infarction. We propose an unsupervised motion tracking framework with physiological constraints to learn dense displacement fields between sequential pairs of 2-D B-mode echocardiography images. Current deep-learning motion-tracking algorithms require large amounts of data to provide ground-truth, which is difficult to obtain for in vivo datasets (such as patient data and animal studies), or are unsuccessful in tracking motion between echocardiographic images due to inherent ultrasound properties (such as low signal-to-noise ratio and various image artifacts). We design a U-Net inspired convolutional neural network that uses manually traced segmentations as a guide to learn displacement estimations between a source and target image without ground-truth displacement fields by minimizing the difference between a transformed source frame and the original target frame. We then penalize divergence in the displacement field in order to enforce incompressibility within the left ventricle. We demonstrate the performance of our model on synthetic and in vivo canine 2-D echocardiography datasets by comparing it against a non-rigid registration algorithm and a shape-tracking algorithm. Our results show favorable performance of our model against both methods.
Keywords: Unsupervised motion tracking, Echocardiography, Deep learning
1. INTRODUCTION
Cardiovascular disease is the leading cause of mortality in the United States.1 To help in the diagnosis, echocardiography, an ultrasound imaging technique that is non-invasive and cost efficient, is widely used by physicians to study the heart function in patients. Cardiologists evaluate the motion of the myocardium as it provides valuable information into cardiac health as damaged tissue will behave differently from healthy tissue. Therefore, it is important to be able to accurately track the motion of the left ventricle in order to localize myocardial damage after an injury, such as myocardial ischemia or infarction. One popular method for modeling motion is non-rigid registration in which a lattice of control points is placed along a source image and transformed to match a target image.2–4 Another method is shape tracking in which points sampled along the myocardial surface in one time frame are matched to points in subsequent time frames based on trajectory-likelihood and similarity metrics.5,6 Efforts with deep learning models show promising motion estimation results using convolutional neural networks (CNN).7 Fischer et al. presented two CNN architectures, a single stream and a double stream architecture, that learns to estimate optical flow between pairs of images.7 However, their work requires ground-truth displacement fields to train the network which is difficult to obtain for medical images. In order to address this lack of ground-truth data, recent work has been done to develop an unsupervised deformable registration framework.8 Balakrishnan et al. developed a learning-based framework to register medical images in an unsupervised manner using a spatial transformer to morph a source image to match a target image.8 However, this work is focused on MR images, which has considerably higher signal-to-noise ratio (SNR) than ultrasound images. To our knowledge, there has been no success in implementing an unsupervised deep learning framework in the ultrasound domain for motion tracking.
In the work presented in this paper, we propose an unsupervised motion tracking framework using a U-Net inspired CNN architecture that learns to estimate displacement between sequential pairs of two-dimensional B-mode echocardiography image sequences. We introduce physiological constraints as regularization terms in order to enforce realistic cardiac behavior and alleviate the issues caused by low SNR. More specifically, we choose to enforce the incompressibility of the myocardium by modeling displacement as a divergence-free vector field and we choose to enforce myocardial shape by guiding displacement estimation using anatomical landmarks from manually traced left-ventricular segmentations. We demonstrate the performance of our method against a non-rigid registration based algorithm2 and a shape-tracking based algorithm6 using two 2-D echocardiography image sequences: a synthetic dataset generated by Alessandrini et al.9 and an in vivo canine dataset.
2. METHODS
2.1. Network Architecture
Our goal is to estimate the motion of the left ventricle between two time frames (a source frame and a target frame) in an echocardiography sequence by finding a displacement field that maps the target frame to the source frame in an unsupervised learning manner. Inspired by the success of Balakrishnan et al in applying an unsupervised learning model to track the deformation of images in MR sequences8 and the success of the Ronneberger et al U-Net in analyzing medical images,10 we construct an end-to-end unsupervised network that inputs two time frames from an echocardiography sequence into a U-Net like convolutional network, which outputs a displacement field that is then inputted into a spatial transformer that uses the displacement map to transform one time frame to match the other (Fig. 1, 2). Convergence is achieved when the mean squared error between the transformed source time frame and the target time frame is minimized. We incorporate physiological constraints such as incompressibility and shape in the form of divergence-free and segmentation-based regularization on the displacement field (Section 2.2) in order to introduce biomechanical properties of the myocardial movement.
Figure 1.

Overview of the model.
Figure 2.

U-net based CNN architecture for unsupervised motion tracking.
2.2. Loss Function
Let X = (x1 … xn) be an echocardiography sequence comprised of n time frames and let Y = (y1 … yn) be the corresponding left ventricular segmentations of each time frame. There exists possible pairs of possible pairs of time frames within this sequence. We define a model F that learns to calculate the displacement field Ut that characterizes the motion of the left ventricle from arbitrary time frame x to some subsequent time frame xt such that F (x, Ut) transforms x to match xt using bilinear interpolation. The goal is to minimize the transformed x and the target xt (denoted as Lreg) while penalizing divergence (denoted as Ldiv) in the displacement field and imposing shape constraints using Y (denoted as Lseg.):
| (1) |
where λ1 and λ2 are weighting parameters for the regularization terms.
2.2.1. Deformation
The U-Net portion of the network seeks to find a displacement field Ut that minimizes xt and F (x, Ut). To train the network to find the most accurate displacement field, we calculate the mean squared error between the transformed time frame, x, and the target time frame, xt, over all time frame pairs, N. This can be described as follows:
| (2) |
2.2.2. Divergence-Free
As seen in Parajuli et al11 we choose to penalize divergence as a method of enforcing incompressibility within the myocardium of the left ventricle. Penalizing divergence prevents sources and sinks in the displacement field, which are unrealistic motion patterns of the left ventricle. This penalization allows us to incorporate the biomechanical properties of the left ventricle motion. The divergence of the displacement field U can be calculated by taking the gradient as follows:
| (3) |
where the partial derivative terms are all displacements in their respective x-y directions. In the presence of sinks and sources, the magnitude of the divergence would be maximized, thus we seek to minimize the magnitude of the divergence. This regularization term can be described as follows:
| (4) |
2.2.3. Segmentation
In order to incorporate a shape-based regularization term, we enforce tracking along and within the epicardial and endocardial border of the left ventricle using manually segmented masks, Y, which correspond directly to the time frame pairs being inputted into the network. The same learned displacement field which maps the time frames in X is used to transform y to match yt by minimizing the mean square error between the transformed mask and the mask of the target time frame. This can be described as follows:
| (5) |
Thus, the complete loss function of the model can be described by combining Equations 2, 4, 5 into 1 to obtain:
| (6) |
3. EXPERIMENTS AND RESULTS
We tested our method on two different sets of data. First, we used a synthetic echocardiography dataset developed by Alessandrini et al.9 The synthetic dataset contains eight volumetric cardiac sequences which is generated by incorporating realistic ultrasound features that simulated the difficulty in tracking real ultrasound image sequences as well as the electro-mechanical model of the human heart. The 8 individual sequences from the dataset simulated different physiological conditions, including 1 normal, 4 sequences with occlusions in the proximal (ladprox) and distal (laddist) left anterior descending artery, left circumflex (lcx) artery, right coronary artery (rca), and 3 sequences with dilated geometry with 1 synchronous (sync) and 2 dyssynchronous activation of the left ventricle (lbbb, lbbbsmall).
Second, we used in vivo canine datasets that included 8 canine echocardiography studies. Each study contained three different physiological conditions: baseline image of the animal, severe stenosis in the mid-left anterior descending artery, and a dobutamine-induced stress condition in the continued presence of severe stenosis. Each canine was implanted with 16 sonomicrometer crystals in the left ventricle, which were used in order to obtain information regarding location and motion of the myocardium. The motion information from the 16 sonomicrometers were used as ground-truth displacement fields for the in vivo dataset. All canine studies were acquired using a Philips iE33 scanner (Philips Medical Systems, Andover, MA) and X7–2 probe and conducted in compliance with Institutional Animal Care and Use Committee policies. We used imaging frequency that ranges from 50–60 frames per second and typically produced around 15–30 3-D volumes for each 4-dimensional sequence.
For the 8 synthetic datasets, we set up our input data to be sample number × number of channels × width × length. Each sample number represents a pair of two frames in the image sequence. For these experiments, we take a mid-cavity slice of the heart at each time frame in order to observe the maximal movements during the cardiac cycle. We train our unsupervised framework using 6 synthetic datasets, validate on 1, and test on the last remaining set. Training set has a dimension of 3672 × 2 × 128 × 128. Validation and Testing sets both have dimensions of 528 × 2 × 128 × 128.
For the 8 in vivo canine datasets with 3 sequences each, we have a total of 24 sequences. The set up was similar to the synthetic dataset where we take the mid-cavity slice in order to generate our 2-D images at each time frame. We trained our model using 18 sequences which represents 6 animal studies, validated on 3 sequences (1 animal study), and tested on 3 sequences (1 animal study). In summary, the training set has a dimension of 7058 × 2 × 128 × 128. Validation set was 913 × 2 × 128 × 128 and testing set was 738 × 2 × 128 × 128.
In order to evaluate our model, we quantitatively assess our model by calculating the root mean squared error (RMSE) of our predicted displacement fields in the x-y directions with the ground truth or sonomicrometer crystals. We compared our model against a non-rigid registration based method2 and a shape-tracking based method.6 We further compare the effects of our regularization terms on our model by evaluating three scenarios: No constraints, segmentation regularizer (SR), and segmentation regularizer with divergence-free (SR + DF). Figure 3 and 4 show the results for one of the testing sequences for synthetic dataset and in vivo data set, respectively. We can visually see a gradual increase in accuracy in our unsupervised model when compared against the ground truth as each of the regularization term is added. We list the complete RMSE results in Table 1. For both the in vivo and synthetic studies, the SR + DF model outperformed both the SR and no constraints models. In the in vivo study, both the SR and SR + DF models outperformed the non-rigid registration and shape tracking methods. In the synthetic study, the SR and SR + DF models outperformed the non-rigid registration study, but not the shape tracking algorithm. Overall, our unsupervised model show improvements in both x- and y-displacement fields as contraints on the biomechanical properties and anatomical shape are incorporated into the loss term.
Figure 3.

Estimated displacement vector field for a normal (healthy) synthetic sequence: (a) Ground Truth, (b) Non-rigid Registration, (c) Shape Tracking, (d) Our Model (No Constraints), (e) Our Model (SR), (f) Our Model (SR + DF).
Figure 4.

Estimated displacement vector field for an LAD stenosis in vivo canine sequence: (a) Ground Truth, (b) Non-rigid Registration, (c) Shape Tracking, (d) Our Model (No Constraints), (e) Our Model (SR), (f) Our Model (SR + DF).
Table 1.
Root Mean Squared Error (RMSE) for the x and y displacements (Ux, Uy) through a sequence
| Methods | Synthetic | In Vivo | ||
|---|---|---|---|---|
| Uy | Uy | |||
| Non-rigid Registration | 0.73±0.31 | 1.12±0.72 | ||
| Shape Tracking | 0.24±0.05 | 0.83±0.37 | ||
| Our Model (No Constraints) | 1.10±0.34 | 0.86±0.31 | ||
| Our Model (SR) | 0.80±0.31 | 0.52±0.18 | ||
| Our Model (SR + DF) | 0.69±0.19 | 0.51±0.20 | ||
4. CONCLUSION
In conclusion, our work seeks to demonstrate the feasibility of an end-to-end unsupervised learning-based framework on tracking displacement in echocardiography, specifically of the left ventricle. Our framework exploits the incompressible nature of the myocardium and the general circular shape of the left ventricle in the short axis (SAX) view in order to enforce realistic cardiac motion by incorporating them as regularization terms on the displacement field. We compare our proposed model to nonrigid registration,2 which calculates a transformation matrix between two images by minimizing an objective function, and shape tracking,6 which focuses on tracking the motion of the epicardium and endocardium of the left ventricle as opposed to the myocardium. We also perform an evaluation on the effect of the regularization terms on overall performance. Our model with both divergence-free incompressibility and the segmentation regularization terms perform favorably against the competing methods. This shows that incorporating the biomechanical and anatomical information is critical in developing an unsupervised motion tracking in medical imaging.
It is important to note that while this work does not require any ground truth displacement fields, it does require segmentations (obtained manually or otherwise) to help guide tracking. This may serve as a practical limitation. Future work involves reducing the reliance on densely segmented data as well as expanding to 3-dimensional volume over all time frames.
ACKNOWLEDGMENTS
This work was supported by National Institute of Health (NIH) grant number R01HL121226 and Medical Scientist Training Program Grant T32GM007205. Additionally, we would like to acknowledge the technical assistance of the staff of the Yale Translational Research Imaging Center and Drs. Nabil Boutagy, Imran Alkhalil, Melissa Eberle, and Zhao Liu for assisting with the in vivo canine imaging studies.
REFERENCES
- [1].Mensah GA and Brown DW, “An overview of cardiovascular disease burden in the united states,” Health affairs 26(1), 38–48 (2007). [DOI] [PubMed] [Google Scholar]
- [2].Rueckert D, Sonoda LI, Hayes C, Hill DL, Leach MO, and Hawkes DJ, “Nonrigid registration using free-form deformations: application to breast mr images,” IEEE transactions on medical imaging 18(8), 712–721 (1999). [DOI] [PubMed] [Google Scholar]
- [3].Heyde B, Barbosa D, F. M. PC and D’hooge J, “Three-dimensional cardiac motion estimation based on non-rigid image registration using a novel transformation model adapted to the heart,” 7746 (2012). [Google Scholar]
- [4].Lin N and Duncan J, “Generalized robust point matching using an extended free-form deformation model: application to cardiac images,” 1 (2004). [Google Scholar]
- [5].Shi P, Sinusas AJ, E. R. RC and Duncan J, “Point-tracked quantitative analysis of left ventricular surface motion from 3-d image sequences,” IEEE transactions on medical imaging (2000). [DOI] [PubMed] [Google Scholar]
- [6].Parajuli N, Lu A, Ta K, Stendahl J, Boutagy N, Alkhalil I, Eberle M, Jeng G-S, Zontak M, ODonnell M, et al. , “Flow network tracking for spatiotemporal and periodic point matching: Applied to cardiac motion analysis,” Medical image analysis 55, 116–135 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Fischer P et al. , “Flownet: Learning optical flow with convolutional networks,” CoRR (2015). [Google Scholar]
- [8].Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, and Dalca AV, “An unsupervised learning model for deformable medical image registration,” in [Proceedings of the IEEE conference on computer vision and pattern recognition], 9252–9260 (2018). [Google Scholar]
- [9].Alessandrini M, De Craene M, Bernard O, Giffard-Roisin S, Allain P, Waechter-Stehle I, Weese J, Saloux E, Delingette H, Sermesant M, et al. , “A pipeline for the generation of realistic 3d synthetic echocardiographic sequences: Methodology and open-access database,” IEEE transactions on medical imaging 34(7), 1436–1451 (2015). [DOI] [PubMed] [Google Scholar]
- [10].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in [International Conference on Medical image computing and computer-assisted intervention], 234–241, Springer; (2015). [Google Scholar]
- [11].Parajuli N, Compas CB, Lin BA, Sampath S, ODonnell M, Sinusas AJ, and Duncan JS, “Sparsity and biomechanics inspired integration of shape and speckle tracking for cardiac deformation analysis,” in [International Conference on Functional Imaging and Modeling of the Heart], 57–64, Springer; (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
