Co-Attention Spatial Transformer Network for Unsupervised Motion Tracking and Cardiac Strain Analysis in 3D Echocardiography

Shawn S Ahn; Kevinminh Ta; Stephanie L Thorn; John A Onofrey; Inga H Melvinsdottir; Supum Lee; Jonathan Langdon; Albert J Sinusas; James S Duncan

doi:10.1016/j.media.2022.102711

. Author manuscript; available in PMC: 2024 Feb 1.

Published in final edited form as: Med Image Anal. 2022 Dec 9;84:102711. doi: 10.1016/j.media.2022.102711

Co-Attention Spatial Transformer Network for Unsupervised Motion Tracking and Cardiac Strain Analysis in 3D Echocardiography

Shawn S Ahn ^a, Kevinminh Ta ^a, Stephanie L Thorn ^b, John A Onofrey ^a,^c, Inga H Melvinsdottir ^b, Supum Lee ^b, Jonathan Langdon ^c, Albert J Sinusas ^a,^b,^c, James S Duncan ^a,^c,^d

PMCID: PMC9812938 NIHMSID: NIHMS1856327 PMID: 36525845

Abstract

Myocardial ischemia/infarction causes wall-motion abnormalities in the left ventricle. Therefore, reliable motion estimation and strain analysis using 3D+time echocardiography for localization and characterization of myocardial injury is valuable for early detection and targeted interventions. Previous unsupervised cardiac motion tracking methods rely on heavily-weighted regularization functions to smooth out the noisy displacement fields in echocardiography. In this work, we present a Co-Attention Spatial Transformer Network (STN) for improved motion tracking and strain analysis in 3D echocardiography. Co-Attention STN aims to extract inter-frame dependent features between frames to improve the motion tracking in otherwise noisy 3D echocardiography images. We also propose a novel temporal constraint to further regularize the motion field to produce smooth and realistic cardiac displacement paths over time without prior assumptions on cardiac motion. Our experimental results on both synthetic and in vivo 3D echocardiography datasets demonstrate that our Co-Attention STN provides superior performance compared to existing methods. Strain analysis from Co-Attention STNs also correspond well with the matched SPECT perfusion maps, demonstrating the clinical utility for using 3D echocardiography for infarct localization.

Keywords: Unsupervised motion tracking, Spatiotemporal attention, Echocardiography

1. Introduction

Cardiovascular disease (CVD) remains the leading cause of death worldwide and a major public health burden globally (Virani et al. (2021)). Among CVDs, ischemic heart diseases cause reduction of blood flow in the coronary arteries, leading to ischemia and/or infarct in the myocardium. Echocardiography is the most common imaging modality used to assess patients with CVD. It provides a cost-effective tool to analyze the functional status of the heart with analysis of global ejection fraction and evaluation of the regional mechanical changes in injured myocardium by analysis of regional myocardial strain. Strain analysis is particularly effective at quantification of the dynamic mechanical changes after myocardial infarction (MI) as infarct tissues have reduced contraction during the cardiac cycle. Global Longitudinal Strain (GLS) is commonly assessed using 2D echocardiography from long axis images (Reisner et al. (2004); Kalam et al. (2014)). However, the inability of 2D echocardiography to assess full volumetric deformation and focus on average global values limits the value of GLS. In order to calculate the dense 3D strain map in the left ventricle, a well-regularized motion tracking strategy in three-dimension is critical.

Many approaches have been explored within the medical imaging community through techniques like speckle tracking (Mondillo et al. (2011)), optical flow (Horn and Schunck (1981)), and surface-based tracking (Parajuli et al. (2019)) to produce the displacement fields of the myocardium. More recently, a data-driven deep learning approach has been popular for its fast computation (Ahn et al. (2020); Ta et al. (2020)). However, unlike many tasks targeted with neural network methods, motion tracking in the heart is difficult due to the lack of ground-truth displacement labels. This leads to limitations with supervised learning methods along with difficulty evaluating the produced output. Accordingly, unsupervised learning methods have been used to learn the dense displacement maps without any ground-truth labels (Balakrishnan et al. (2019); Ahn et al. (2020); Dalca et al. (2018)). Although these unsupervised approaches have performed well in imaging modalities like magnetic resonance imaging (MRI), performance in the ultrasound domain needs to be improved.

In this work, we propose a co-attention module to utilize the inter-frame correlations in echocardiography images to improve motion tracking based on the spatial transformer network (Jaderberg et al. (2015)). The proposed attention mechanism allows improved feature extraction using feature cross correlations inspired by speckle tracking. To our knowledge, this work is the first to implement a spatiotemporal co-attention module in 3D+time echocardiography for left ventricle motion tracking.

The main contributions of this paper are: (1) introduction of a co-attention spatial transformer network for improved unsupervised 3D left ventricle motion tracking between end-diastole (ED) and end-systole (ES) frames; (2) use of a model that generates attention maps that define in the 3D image where the model is learning the motion, (3) incorporation of a temporal consistency regularization term in our loss function, (4) use of both synthetic echocardiography dataset and an in vivo porcine 3D+time echocardiography dataset providing extensive performance analyses, and (5) evaluation of echocardiography-derived strain maps with single photon emission computed tomography (SPECT) perfusion maps.

The remainder of the paper is structured as follows: Section 2 summarizes the related works about cardiac motion tracking in echocardiography and attention mechanisms. Section 3 introduces our proposed co-attention spatial transformer network for unsupervised motion tracking of left ventricle in 3D+time echocardiography. Section 4 describes the implementation details and the training process. Section 5 summarizes our results from our extensive experiments, including comparison to state-of-the-art registration algorithms. Section 6 discusses various aspects about the co-attention module and the limitations of it. Section 7 presents the conclusions of this work.

2. Related work

2.1. Traditional Cardiac Motion Tracking Methods

Motion tracking of the heart has been widely studied. Block matching/speckle tracking is the most widely adopted algorithm in both the research and clinical settings with echocardiography taking advantage of the high temporal resolution inherent to ultrasound. The basic assumption is that the consistent intensity pattern across several consecutive time frames can be found by maximizing the similarity of the intensity patterns. The motion vectors are then computed using the distance between the maximum correlated intensities (Horn and Schunck (1981); Lucas and Kanade (1981); Lubinski et al. (1999); Chen et al. (2005); Jia et al. (2010)). Since each motion vector is estimated on the assumption that the intensity pattern remains consistent, when the frames are too far away from each other, the accuracy of the block matching and/or speckle tracking often suffer.

Other feature-based tracking includes curvature (Papademetris et al. (2002)) and shapes/surfaces (Papademetris et al. (2002); Huang et al. (2014); Parajuli et al. (2016, 2017); Shi et al. (2000); Lin and Duncan (2004)) which rely on accurate segmentation of the boundaries of the left ventricle. The boundaries are tracked by finding the closest point in the subsequent frame that has the most similar curvature or shape. The motion within the myocardium is then interporlated using the endocardial and epicardial boundary tracking. Although surface-based tracking has high accuracy along the boundaries of the heart, in settings such as subendocardial infarct or mid-myocardial scarring (i.e. cardiac sarcoidosis (Ichinose et al. (2008); Blankstein and Waller (2016))) where the entire myocardium is not fully infarcted, the motion along the entire myocardium may not be homogenous. As a result, tracking methods like shape or surface-based algorithms have some limitations due to the interpolation of the motion fields within the myocardium.

2.2. Deep Learning-based Cardiac Motion Tracking Methods

Deep learning-based motion tracking/registration methods have been widely implemented in the past few years in research. Starting with supervised motion tracking, Lu et al. (2021) used synthetic echocardiography dataset to train a multi-layered perceptron (MLP) to regularize the noisy displacement fields from traditional motion tracking algorithms. Wu et al. (2018) used a deep Boltzmann machine (DBM) to train the global heart shape variations and used it to characterize the motion of the heart by delineating the heart contours on each frame. Rohé et al. (2017) used segmentations as labels to train a convolutional neural network for learning the deformations between the segmentations. Similarly, Parajuli et al. (2019) used a point cloud generated from a myocardium segmentation to track the boundaries using point matching. All these methods rely on either a simulated dataset that has ground-truth deformation information or expert-labeled segmentations of the myocardium to learn the transformation between images.

However, due to the limited availability of established motion labels in heart models, unsupervised motion tracking has become popular in recent years. Jaderberg et al. (2015) proposed a spatial transformer network (STN) that uses a loss function to directly learn the transformation matrix between two images. STN inspired many works in the medical image processing field by utilizing a similar spatial transformer to register magnetic resonance images (MRI) in the brain (Balakrishnan et al. (2019)). Cardiac motion tracking utilizing similar U-Net Ronneberger et al. (2015) architectures to perform unsupervised motion tracking has also emerged over the years (Ahn et al. (2020); Ta et al. (2020); Dai et al. (2021); Yu et al. (2020b,)a). However, many of these approaches have been limited to cardiac MRI, and those work that focus its task in echocardiography have been often limited to two-dimensional images. Although global longitudinal strain (GLS), which can be derived from clinical machines (Reisner et al. (2004)), can be calculated from a 2D apical view of the heart, GLS neglects important mechanical and functional changes in the radial and circumferential directions which can be valuable in assessing patients with ischemic cardiac diseases.

2.3. Overview of Attention

Attention is a mechanism by which a neural network focuses its attention to the pertinent features and weighs them based on the level of importance to a task. This ultimately helps the models to ”attend” to and focus their training to improve their performance.

There are many different types of attention that have been employed in the computer vision and medical image analysis communities. Spatial attention generates an attention map based on the inter-spatial relationship of the features in a given image. It focuses on ”where” in the image the relevant object is located. Schlemper et al. (2019) used a type of spatial attention that proposes a gated module at each level of the U-Net in order to focus on the relevant area of the image at each field of view. Channel attention is another type of attention where the goal is to generate an attention map based on the inter-channel relationship of the features. In a typical image input, there is a range of one (B/W) to three (R/G/B) channels. After undergoing convolutional layers, each channel will generate new channels that contain additional information. A traditional convolutional neural network will weigh each channel equally when creating the output feature map. Channel attention, on the other hand, focuses on how each channel is related to each other as well as its relevance to a given task, and it puts more weights on a channel that has high correspondence. Hu et al. (2018) is a classic example of a channel attention by ”squeezing” each channel to a single value and computing the weights based on the relation to each other.

Unlike spatial and channel attention, temporal attention has been explored to take advantage of the temporal aspect of video sequence dataset. Lu et al. (2019) proposed a co-attention mechanism to augment the performance of an object segmentation in a video by utilizing the fact that the primary object is distinguishable in all images in the time sequence and that it is frequently appearing throughout the video. Although many spatial and channel attention mechanisms have been explored in the medical imaging communities, the work has focused on the use of attention in segmentation tasks (Schlemper et al. (2019); Ahn et al. (2021)). Therefore, in our study, we investigate the use of a spatiotemporal co-attention mechanism for unsupervised motion tracking of the left ventricle with echocardiography.

3. Method

3.1. Co-Attention Spatial Transformer Network

We propose to extract and utilize the spatiotemporal features between two time frames to improve the feature extraction within the myocardial tissue in the left ventricle in 3D+time echocardiography for unsupervised motion tracking. The idea of co-attention mechanism has been proposed by others previously for use in segmenting out objects in a 2D video sequence (Lu et al. (2019); Wu et al. (2020); Liu et al. (2021)). In this work, we adopt the idea of co-attention in a 3D+time dataset by employing a spatial transformer network to improve unsupervised learnable registration. The proposed co-attention spatial transformer architecture is illustrated in Fig. 1. The idea of using the spatial relationship between a pair of images to guide the calculation of motion tracking is similar to a traditional speckle tracking approach that is widely used in ultrasound. Inspired by this, we propose a co-attention module to guide and improve the unsupervised motion tracking in 3D+time echocardiography images by learning the inter-frame dependent features. Our model is composed of three phases: the attention feature extraction phase, co-attention phase, and the motion tracking phase. The phases are sequential and the output of the co-attention phase is closely tied to the input of the motion tracking phase.

Fig. 1. — The network architecture of proposed Co-Attention Spatial Transformer Network. The network is comprised of three phases: 1) Attention Feature Extraction Phase, 2) Co-Attention Phase, and 3) Motion Tracking Phase. Phase 1 is based on residual blocks with atrous spatial pyramid pooling to capture the entire image frame during the feature extraction. Phase 2 learns the inter-frame dependent features to generate attention maps. Phase 3 is a standard U-Net architecture as a spatial transformer network. The optional temporal constraint regularization allows an additional input in between ED and ES to guide the motion to pass through the mid-frame.

Given a pair of end-diastole (ED) and end-systole (ES) frames in a 3D+time echocardiography sequence, I_ED and I_ES, we aim to find the registration matrix, or in our case, the displacement map between the two time points in the x-y-z Cartesian coordinate system. In the typical clinical setting, cardiologists and sonographers focus mostly on the relationship between the ED and ES frames to calculate measurements like ejection fraction and to observe the most prominent wall-motion abnormalities. Therefore, we focus our work on the displacement fields between these two time points.

Each image volume from either ED or ES frame goes through a standard encoder architecture with shared weights. We utilize a modified ResNet encoder with atrous spatial pyramid pooling (ASPP) as our encoder network similar to (Chen et al. (2017)). This encoder allows feature extraction to be performed using atrous convolutions which allows the fusing of multiple feature maps at different fields of view to capture the whole image. At the end of the encoder phase, there will be 2 feature maps, F_ED and F_ES. Each feature map corresponding to the input volume has the dimension of C × W × H × D where C = channel, W = width, H = height, and D = depth.

The attention maps leverage the correlation between the features extracted from each feature maps, F_ED, F_ES. First, we can represent the correlation map as $Corr = F_{E D}^{T} \otimes W \otimes F_{E S}$ , where W represents a diagonal matrix weight matrix and ⊗ represents the matrix multiplication. To utilize the matrix multiplication, F_ED and F_ES are reshaped to the dimension of C × HWD. Thus, the correlation map has the dimension of. HWD × HWD

Corr (F_{E D}, F_{E S}) = F_{E D}^{T} \otimes W \otimes F_{E S} \in ℝ^{H W D \times H W D}

(1)

The correlation map is then normalized with a softmax function to calculate the attention map, much like how a speckle guide the tracking computes the maximum correlation using a cross-correlation function. Specifically, each entry (i, j) of the attention weight matrix represents the spatiotemporal attention feature between the i location of target feature map, F_ES, and the j location of reference feature map, F_ED.

A_{E D, E S} = softmax (Corr) \in ℝ^{H W D \times H W D}

(2)

Once the attention map, A_ED,ES, is calculated, it is again matrix multiplied by the flattened individual feature maps to ultimately focus in on the regions of the feature maps that show high spatiotemporal correlation with each other. Ultimately, the post-attention feature maps, Z_ED and Z_ES represent refined feature maps for each time frame. The post-attention feature maps are then interpolated to the original dimensions of the input image volume to be multiplied using a dot product with the original image pair. A schematic of the co-attention mechanism is illustrated in Fig. 2

Z_{E D} = F_{E D}^{T} \otimes A_{E D, E S} \in ℝ^{H W D \times C}

(3)

Z_{E S} = F_{E S} \otimes A_{E D, E S} \in ℝ^{C \times H W D}

(4)

Fig. 2. — Visual representation of proposed co-attention mechanism. The blue block represents the extracted feature map of the end-diastole (ED) frame and the yellow block represents the extracted feature map of the end-systole (ES) frame. Each feature map is flattened to utilize matrix multiplication in order to compute the attention map. The weight matrix is a diagonal matrix represented by the white box with a diagonal green square components. The green block represents the attention map correlating between the F_ED and F_ES. Note that the channel size is 1 in this figure for simplicity.

The interpolated attention maps are given as one of the outputs of the model to interpret where the model is focusing in on the image to learn the motion. The morphed post-attention feature map from end-diastole frame is also compared against the post-attention map of end-systole frame to further guide the learning process. The loss term regarding the attention map is the mean squared error of the transformed Z_ED to match to Z_ES :

L_{attention} = \frac{1}{Ω} \sum_{p \in Ω} {(Z_{E S} (p) - 𝓣 (Z_{E D}, U_{E D \to E S}) (p))}^{2}

(5)

Here, U is the displacement field in the x, y, and z directions. 𝓣 is the spatial transformation operation which in our case is trilinear interpolation. p is a voxel in the set of all the voxels of a given image which is defined by Ω. The co-attention mechanism is similar to traditional speckle tracking in a sense where maximum intensity correlations are found to determine the movement between two frames. However, unlike speckle tracking, our proposed co-attention mechanism computes the correlation at the extracted feature level. The original spatial transformer network (Jaderberg et al. (2015)) and VoxelMorph (Balakrishnan et al. (2019)) rely on the network to learn the registration matrix without any guidance, resulting in lack of interpretability and noisy displacements. In contrast, our method allows further guidance in the setting of unsupervised learning by taking advantage of the traditional speckle tracking idea. Instead of performing block-wise search for speckle correlation in the raw image, our proposed co-attention mechanism attends to the inter-spatial relationship between the ED and ES features.

I_{E D}^{*} = I_{E D} \times Z_{E D} \in ℝ^{C \times H \times W \times D}

(6)

I_{E S}^{*} = I_{E S} \times Z_{E S} \in ℝ^{C \times H \times W \times D}

(7)

The refined image volumes, $I_{E D}^{*}$ and $I_{E S}^{*}$ , are computed by taking the dot product of the post-attention feature map, Z_ED and Z_ES, and the original image volumes, I_ED and I_ES. × represents the dot product in Eq. 6 and Eq. 7. The refined image volumes then go through a standard encoder-decoder architecture with skip connections, similar to a UNet (Ronneberger et al. (2015)), as a base model for the spatial transformer. The 3 channel output indicates the displacement vectors in each x-, y-, z-direction. Similar to the post-attention feature comparison, the original inputs, I_ED and I_ES, are compared with one another by transforming the I_ED using the displacements learned to the I_ES.

L_{image} = \frac{1}{Ω} \sum_{p \in Ω} {(I_{E S} (p) - 𝓣 (I_{E D}, U_{E D \to E S}) (p))}^{2}

(8)

3.2. Temporal Regularization

Many motion tracking algorithms focus on image-to-image registration, or in an echocardiography sequence, frame-to-frame (FtoF) registration. In cardiac motion tracking, FtoF registration can lead to ”drift” errors when constructing the full cardiac cycle motion from ED to ES and back to ED due to temporal interpolation errors that propagate through the FtoF pairs. Previous work like Lu et al. (2021) have looked at 1-to-frame registration where the beginning frame is always the ED frame. Others have used diffeomorphic constraints (De Craene et al. (2012); Ye et al. (2021)) to regularize the FtoF registration temporally. In our work, we propose a new temporal consistency regularization term, where we input an additional frame in between the ED and ES. The mid-frame is used as a guide so that the motion vectors pass through the mid-frame during the spatial transformation, allowing the cardiac cycle movement to be modeled more realistically.

The modification to the base model in Fig. 1 is trivial. Instead of a two frame input, we give three images as the input to our model, I_ED, I_mid and I_ES, where I_mid is defined as a frame anywhere in between ED and ES. To utilize the registration between time frames with most deformation, we select I_mid to be the middle frame between ED and ES. Using the same feature extraction, each image gets a feature map represented as, F_ED, F_mid and F_ES. Because our co-attention idea comes from maximum intensity cross-correlations in speckle tracking, we get additional attention maps that relate F_ED ↔ F_mid, F_ES ↔ F_mid, and F_ED-F_ES:

I_{E S}^{*} = I_{E S} \times Z_{E S} \in ℝ^{C \times H \times W \times D}

(9)

I_{m i d}^{*} = I_{m i d} \times Z_{m i d} \in ℝ^{C \times H \times W \times D}

(10)

I_{E D}^{*} = I_{E D} \times Z_{E D} \in ℝ^{C \times H \times W \times D}

(11)

The main assumption is that the sum of the displacements of I_ED → I_mid and I_mid → I_ES should equal the displacement between I_ED → I_ES. Thus, our temporal regularization is defined as the mean squared error between the summation of the displacements of I_ED → I_mid and I_mid → I_ES and the displacement of I_ED → I_ES:

L_{temporal} = \frac{1}{3} \frac{1}{Ω} \sum_{x, y, z} \sum_{p \in Ω} (U_{(E D \to E S)} (p) - {(U_{(E D \to m i d)} (p) + U_{(m i d \to E S)} (p)))}^{2}

(12)

Ultimately, our total loss is sum of the three loss functions with λ₁ = 0.05 and λ₂ = 0.02:

L_{total} = L_{image} + λ_{1} L_{attention} + λ_{2} L_{temporal}

(13)

4. Experiments

4.1. Dataset

We use two different 3D+time echocardiography datasets to assess our proposed method. First, we use a synthetic ultrasound dataset (Alessandrini et al. (2015)) composed of 8 images, each representing a different physiological condition. The synthetic dataset has a ground-truth motion maps derived from the left ventricle mesh movement used to model the heart motion. Second, we acquired 50 3D+time echocardiography images acquired at multiple time points in chronic MI studies in 14 pigs. Each pig underwent a 90 minute balloon occlusion of the mid left anterior descending artery (LAD). The pigs were imaged at baseline pre-occlusion, at 3 days post-MI, and at 7 days post-MI using the X7–2 probe and Philips iE33 ultrasound system. The different time points correspond to different physiological states where different degree of wall-motion abnormality may be observed. Images were acquired in a standard 2-chamber apical view, allowing full volumetric capture of the left ventricle from the apex to the base. All studies were approved by the Yale University School of Medicine Institutional Animal Care and Use Committee and according to the National Institute of Health Guidelines for Care and Use of Laboratory Animals. The endocardium and epicardium boundaries were labeled at the end-systole (ES) and end-diastole (ED) frames by an expert cardiac echosonographer technologist. The original image volume size was 224 × 208 × 208. All images were resampled to 64 × 64 × 64 as input to our model.

4.2. Implementation Details

In training our co-attention STN models, we divide each of our datasets into training/validation/testing sets. Since the synthetic dataset is limited by only 8 different 3DE sequences, we use a 6/1/1 strategy where 6 images were used for training, 1 for validation, and the last 1 for testing. We are limited by the number of synthetic 3D echocardiography sequences, so we take additional frame pairs from within each sequence that are far apart in the cardiac cycle where the DICE overlap is less than 0.7. This reflects similar in vivo settings where we only have end-diastole (ED) and end-systole (ES) frames. This ultimately results in 458 frame pairs for training, 88 frame pairs for validation, and 79 frame pairs for testing. The porcine in vivo 3D echocardiography frames are divided into 35 training, 5 validation, and 10 testing set. Each sample is a pair of images from ED and ES frames. The model was trained using a SGD optimizer, batch size of 1, learning rate of 0.0001. A total of 100 epochs were trained. The network is trained by minimizing the loss function defined in Eq. 13.

The model was implemented using PyTorch ver. 1.4.0 and the experiments were conducted on a GTX 1080 Ti GPU. The attention feature extraction branch was an encoder network adopted from Chen et al. (2020) which was modified to have a cascaded ResNet blocks up to 3, starting with 64 filters and ending with 256 filters (64 → 128 → 256). Atrous spatial pyramid pooling (ASPP) was done with one 1 × 1 × 1 convolution and three 3 × 3× convolutions with rates = (6, 12, 18). Thus, given the input of 1-channel echocardiography image of size 64 × 64 × 64 × 1, the size of the attention feature embedding, F_ED and F_ES, were (H = 16, W = 16, D = 16, C = 256). The co-attention mechanism results in A_ED,ES with size (H = 16, W = 16, D = 16, C = 1) which is upsampled to the original size, resulting in Z_ED and Z_ES with size (H = 64, W = 64, D = 64, C = 1). The refined image volumes, $I_{E D}^{*}$ and $I_{E S}^{*}$ , were concatenated and entered as input to the motion tracking branch. The basic U-Net architecture for the motion tracking branch is composed of 3 encoding layers and 3 decoding layers. The base number of filters was 128, and was doubled at each encoding layer and halved at each decoding layer.

4.3. Experimental Studies

The experimental studies using the synthetic datasets were evaluated using the provided ground-truth motion vectors, which allows us to calculate the median tracking error (MTE) and the cosine similarity. The cosine similarity metric is a measure of accuracy of the directionality of the motion vectors, where 1 represents a perfect alignment, and 0 representing a vector pointing in the opposite direction. The MTE and the cosine similarity combined allows a complete assessment of the magnitude and the direction of the computed motion fields. However, for the porcine dataset, there are no ground-truth motion vectors, so our evaluation relied on segmentation-based metrics such as Hausdorff Distance (HD), Dice Similarity Coefficient, and Jaccard Index with the provided segmentation masks at the two time frames. The myocardium segmentation at ED frame was transformed using the displacements learned to compare against the myocardium segmentation at ES frame.

For quantitative comparison, we evaluated our proposed Co-Attention Spatial Transformer Network (Co-Attention STN) against other registration/motion tracking algorithms for each of the individual dataset. We compare our proposed method with other deep learning-based methods such as VoxelMorph (Balakrishnan et al. (2019)) and VoxelMorph with shape regularization using segmentation loss (SegLoss, (Ta et al., 2020)). Finally, we include an ablation study to compare the improvement of our proposed temporal regularization and the effect of various number of intermediate frames. The intermediate frames are selected to be evenly distributed between ED and ES frames. Therefore, a single midframe will be the time frame right in the middle of ED and ES, and two midframes will be evenly spaced between ED and ES.

5. Results

Table 1 shows the quantitative results on the synthetic echocardiography dataset. Compared to VoxelMorph and VoxelMorph+SegLoss, our proposed Co-Attention STN performs better in all evaluation metrics. The median tracking error is comparable between with and without SegLoss, showing that manual segmentation guidance is not necessarily needed when using the co-attention mechanism. Furthermore, our results show that adding additional intermediate midframes show only mild improvement compared to a single midframe.

Table 1.

Quantitative comparison of Co-Attention STN motion fields using synthetic echocardiography dataset (Alessandrini et al. (2015)) against other deep learning based methods.

Methods	Median Tracking Error (mm)	HD (mm)	DICE	Jaccard Index	Cosine Similarity
VoxelMorph (Balakrishnan et al., 2019)	1.33 ± 0.19	5.87 ± 0.44	0.76 ± 0.02	0.61 ± 0.03	0.38 ± 0.20
VoxelMorph + SegLoss (Ta et al., 2020)	1.24 ± 0.09	7.52 ± 3.04	0.74 ± 0.08	0.59 ± 0.09	0.49 ± 0.08
VoxelMorph + 1 midframe	1.19 ± 0.10	5.81 ± 0.32	0.76 ± 0.01	0.58 ± 0.08	0.45 ± 0.25
Co-AttentionSTN	1.05 ± 0.16	5.64 ± 0.17	0.76 ± 0.01	0.62 ± 0.01	0.60 ± 0.09
Co-AttentionSTN + SegLoss	1.04 ± 0.08	5.72 ± 0.31	0.72 ± 0.03	0.57 ± 0.03	0.71 ± 0.05
Co-AttentionSTN + Temporal (1 midframe)	1.11 ± 0.5	4.94 ± 0.18	0.80 ± 0.01	0.67 ± 0.02	0.50 ± 0.30
Co-AttentionSTN + Temporal (2 midframes)	1.07 ± 0.06	5.01 ± 0.21	0.81 ± 0.01	0.66 ± 0.01	0.56 ± 0.25
Co-AttentionSTN + Temporal (3 midframes)	0.99 ± 0.12	4.99 ± 0.15	0.81 ± 0.02	0.65 ± 0.03	0.51 ± 0.20

Open in a new tab

Best results are marked in red.

The displacement fields generated from the 3D echocardiography scans in porcine post-myocardial infarction studies are shown in Fig. 3. Visual observation shows that our proposed Co-Attention STN gives a smoother displacement field with a well defined contraction motion between end-diastole and end-systole even without regularization functions. VoxelMorph (Balakrishnan et al. (2019)) and VoxelMorph+SegLoss (Ta et al. (2020)) show that the motion fields are not homogenous in certain regions of the myocardium. Our proposed temporal constraint further regularizes the motion fields allowing improved accuracy from the baseline Co-Attention STN. Also, addition of intermediate frames consistently show smooth displacement fields that are similar to one another, verifying that the motion vectors learned among 1 midframe vs. 2 midframes vs. 3 midframes are the expected true motion fields of the heart. Table 2 shows the quantitative evaluation metrics from the porcine dataset. There is improved accuracy with our proposed Co-Attention STN compared to VoxelMorph with and without segmentation loss function. Similar to the results seen in the synthetic dataset, the addition of the SegLoss function did not give a significant improvement when using the co-attention mechanism. Additionally, we tested whether the improvement from our proposed method is from the co-attention mechanism itself or the addition of a midframe. Therefore, we added the same midframe in the VoxelMorph framework to compute the displacement fields. Our results showed that even with the addition of a midframe without co-attention, it did not outperform our proposed Co-Attention STN (VoxelMorph + 1 midframe). However, the addition of our proposed temporal regularization gave the best accuracy among all three evaluation metrics. Like the results shown in the synthetic dataset, the comparison among 1 midframe vs. 2 midframes vs. 3 midframes showed no significant difference in performance.

Table 2.

Quantitative comparison of Co-Attention STN motion fields using porcine dataset.

Methods	DICE	HD (mm)	Jaccard Index
VoxelMorph (Balakrishnan et al., 2019)	0.73 ± 0.05	6.08 ± 1.02	0.58 ± 0.06
VoxelMorph + SegLoss (Ta et al., 2020)	0.72 ± 0.05	6.37 ± 1.06	0.57 ± 0.06
VoxelMorph + 1 midframe	0.74 ± 0.05	5.95 ± 0.98	0.59 ± 0.06
Co-AttentionSTN	0.74 ± 0.04	5.71 ± 0.98	0.60 ± 0.05
Co-AttentionSTN + SegLoss	0.75 ± 0.03	5.74 ± 1.04	0.60 ± 0.04
Co-AttentionSTN + Temporal (1 midframe)	0.88 ± 0.01	3.85 ± 0.81	0.78 ± 0.02
Co-AttentionSTN + Temporal (2 midframes)	0.85 ± 0.02	3.78 ± 0.54	0.76 ± 0.01
Co-AttentionSTN + Temporal (3 midframes)	0.87 ± 0.01	3.80 ± 0.67	0.77 ± 0.02

Open in a new tab

Best results are marked in red.

To visualize where in the image the model is focusing its attention on, we show a heatmap from our attention map after Phase 1 of the training (Fig. 4). When using the baseline Co-Attention STN, the heatmap started to focus on the myocardial walls at the apex and the base of the left ventricle. However, there is still a relative lack of focus in certain regions of the myocardium. When the model was trained with the manual segmentation loss, the attention is now focusing on the entire image since the left ventricle is located in the center. However, it is picking up regions such as the left ventricle cavity where the blood pool may give noise in the motion tracking predictions. The Co-Attention STN with the temporal regularizations (1 midframe, 2 midframes, 3 midframes) give the most interpretable attention map. The intermediate frame(s) between ED and ES helps the model define that the left ventricle myocardium is the main moving object in the image instead of the turbulent and random blood pool noise in the left ventricle cavity. Thus, the attention map focuses only on the left ventricle myocardium.

The advantage of the proposed Co-Attention STN and the temporal regularization are most evident when calculating the Lagrangian strain in the cardiac coordinate system (Fig. 5). Since strain is a derivative of the motion fields, noisy motion vectors will result in noisy strain maps. This is shown in Fig. 5 where all three radial, circumferential, and longitudinal strains are smoothed out as we utilize the Co-Attention STN with the temporal regularization. The strain maps generated from VoxelMorph and VoxelMorph+SegLoss show noisy strain maps especially in circumferential and longitudinal directions. However, as we utilize our proposed co-attention mechanism, we start to see the homogenous regions representing the reduction in strains. The Co-Attention STN with temporal regularization show the best infarct localization as referenced by the red arrow where there is overall decrease in the absolute magnitudes of the strain.

Fig. 5. — Cardiac strain analysis in radial, circumferential, and longitudinal directions from the generated motion fields. Red strain represents positive strain values. Blue strain represents negative strain values. Normal myocardial radial strain is positive, representing contraction motion inward. Normal myocardial circumferential and longitudinal strain are negative. Strain in infarct regions show decreased magnitude in strain. Red arrow in the CoAttention STN+Temporal method shows the localized reduced strain regions in the anteroapical myocardium, corresponding to the left anterior descending artery infarct.

5.1. Comparison to SPECT

The ultimate goal of cardiac motion tracking is to evaluate regional strain changes in the left ventricle. Many unsupervised motion tracking algorithms are limited by the lack of ground-truth displacement fields to directly evaluate the accuracy of the predicted motion fields. Therefore, we performed serial 3D echocardiography and ^99mTc-tetrofosmin SPECT imaging in a well-established porcine model of transmural anteroseptal myocardial infarction to evaluate the localization of ischemic/infarct regions using multi-modality imaging (Sinusas et al. (1994)). This well-controlled experimental model allows us to image the porcine heart with 3D echocardiography and SPECT at the same time points following injury. ^99mTc-tetrofosmin is extensively used clinically as a myocardial SPECT perfusion agent due to the relatively short half-life (6 hrs) and higher energy level (140 keV) resulting in improved image quality, allowing for localization of infarct based on decreased blood flow in dead tissue (Glover et al. (1997)). Therefore, SPECT perfusion maps provide an alternative way for defining the infarct regions which can be used as a plausible reference standard to compare our infarct localization from echocardiography-derived strain maps. Because there is no true ground truth motion vectors to compare our results in in vivo echocardiography scans, we rely on the localization of the infarct from SPECT to evaluate whether our echocardiograph-derived strain maps also localize to the same infarct area as defined by myocardial perfusion. The comparison is done using the American Heart Association (AHA) 17-segment model of the left ventricle myocardium that standardizes specific regions to the corresponding coronary artery territories (on Myocardial Segmentation et al. (2002)).

Fig. 6 shows the polar map distribution of the radial, circumferential, and longitudinal strains calculated from the Co-Attention STN with temporal regularization (1 midframe) with the corresponding SPECT perfusion map. Infarcted myocardium is expected to result in wall-motion abnormality which in turn would translate to an absolute reduction in magnitude of strain in all three directions. The black arrows point to the same region in the myocardium that correspond to the territory supplied by the left anterior descending artery (LAD). Our porcine studies involved only occlusion of the LAD and creation of an anteroseptal infarct. Therefore, we expect that the region supplied by the LAD would have lower absolute strain magnitudes in radial, circumferential, and longitudinal directions. The regions that the strain maps localized correspond well with the region of the hypoperfusion in the SPECT map which defines the infarct location, showing that our proposed Co-Attention STN with temporal regularization gives accurate displacement fields that result in well defined infarct localization when translated to strain analysis.

6. Discussion

In this work, we proposed a co-attention mechanism in a spatial transformer network to improve the unsupervised motion tracking of left ventricle from 3D+time echocardiography. Specifically, we developed three key components in our Co-Attention Spatial Transformer Network to improve the full volumetric displacement predictions of left ventricle. Similar co-attention mechanisms have been proposed before by Lu et al. (2019) to improve video object segmentation, but there has not been any work on utilizing co-attention to localize features in the image that improve registration. Other attention mechanisms have also been explored previously such as in Schlemper et al. (2019) and Hu et al. (2018). Most of these works have been limited to classification or segmentation tasks, where the input to the neural network is usually a single branch. Therefore, it made sense to utilize attention gates (Schlemper et al. (2019)) and squeeze-and-excitation (Hu et al. (2018)) to boost the performance of segmentation. However, since our work is focused on the inter-frame relationship between inputs, it is difficult to do a direct comparison of our proposed method to the ones like attention-gates and squeeze-and-excitation. This was the reasoning behind why we focused our attention to a direct comparison to VoxelMorph only since our co-attention mechanism builds on the idea of a two-channel input registration, or two input Siamese network for registration.

Although others have proposed diffeomorphic regularizations (Ye et al. (2021); De Craene et al. (2012)) to generate cycle-consistent displacement fields, these approaches are still limited by the frame-to-frame motion tracking framework. Therefore, we proposed the addition of an intermediate frame in a standard frame-to-frame motion tracking to guide the motion fields to pass through the mid-frame to arrive at the end-systole frame that could extend to the use of multiple intermediate frames. By regularizing the motion fields based on the intermediate frame instead of predefined regularization functions, it bypasses any assumptions about the left ventricle motion which may vary depending on changes in the pathological states as seen in serial imaging of post-myocardial infarction. In the future, whole sequence classification to identify infarct regions or even whole sequence deformation analysis will be critical in better understanding the full cardiac cycle strain analysis. However, as previously mentioned, given the high temporal resolution of echocardiography sequences, the input size becomes quite large to repurpose previously proposed whole sequence analysis frameworks such as Recurrent Convolutional Neural Network (RCNN) (Liang and Hu (2015)). It is especially more difficult for a 3D+time echocardiography sequences. This was a large part in why we were attempting to solve the temporal analysis of echocardiography images by utilizing intermediate, or midframes, in the sequence. Interestingly, when we incorporated multiple intermediate frames, our findings showed that there was negligible performance improvement beyond the addition of a single midframe as seen in Table 1 and 2. This is likely due to the fact that the frames too close to either the ED and ES show minimal changes in deformation that the neural network does not pick up the deformation field. This is in agreement with our findings that our proposed method is especially useful for bulk motion field rather than fine motion fields such as those computed by RF speckle tracking (Jeng et al. (2018)).

Given the lack of ground-truth motion fields in 3D echocardiography, it is difficult to evaluate the accuracy of displacement fields directly. Therefore, we utilized a synthetic dataset developed by Alessandrini et al. (2015) and augmented the number of frame pairs by including combinations within the cardiac cycle that are not end-diastole and end-systole. Our proposed Co-Attention STN with and without the temporal regularization show improvement compared to VoxelMorph with and without the manual segmentation regularization. However, we see less improvement in the Co-Attention STN with the temporal regularization in the median tracking error and cosine similarity. This is likely due to the synthetic dataset having a randomly generated noise in the background outside the myocardium. This random noise inhibits our temporal regularization to pick up consistent inter-frame dependent features, resulting in less improvement overall. However, the motion fields produced from our proposed methods still consistently achieves the best performance despite the limitation of the synthetic dataset. The performance improvement is most evident when using a real sequential echocardiography scans acquired in the porcine myocardial infarction models as seen in Table 2.

There was significant improvement when comparing results from VoxelMorph and our proposed CoAttention STN with temporal regularization in the in vivo porcine dataset (DICE: 0.73 vs. 0.88, HD: 6.08 vs. 3.85, Jaccard Index: 0.58 vs. 0.78). This is also visually noticeable in Fig. 5, showing that VoxelMorph with and without segmentation loss resulted in noisy displacement fields. This makes it difficult to localize areas of wall-motion abnormality. Similar to others, we evaluated our motion fields to the segmentation-based evaluation metrics like Hausdorff distance, DICE Similarity Coefficient, and Jaccard Index. However, segmentation-based metrics have limitations in evaluating the motion fields within the myocardium since the segmentation will not move beyond the boundaries. Therefore, our motion-derived strain maps were compared with matched SPECT perfusion maps that act as independent ”gold” standard for infarct localization. As seen in Fig. 6, the predicted motion field-derived strain maps from our Co-Attention STN with temporal regularization show well-defined infarct region represented by the absolute reductions in magnitude of the strains in all three directions. The localization from our echocardiography-derived strain maps also matched well with the hypoperfusion region as defined by the ^99mTctetrofosmin SPECT polar maps of perfusion/viability. Our echocardiography-derived strain values were also within the reference ranges when compared to CT-based analysis (Midgett et al. (2022)). When compared to the ^99mTc-tetrofosmin SPECT perfusion/viability maps, our strain maps yielded wider regions of myocardial dysfunction. These differences likely relate to tethering of normally perfused myocardium by adjacent infarcted tissue, in the well-known dysfunctional border zone. Also, while SPECT perfusion maps serve as a plausible reference standard for this work, SPECT blood flow and echocardiography-derived mechanical strain provide complementary measures in the assessment of myocardial infarct. Thus, a full volumetric strain analysis may give additional insight into the mechanical changes in the left ventricle that can be complementary to the SPECT perfusion maps, particularly when being applied in the setting of pharmacological stress. An example of a full cardiac cycle strain analysis utilizing our Co-Attention STN with temporal regularization (1 midframe) is shown in Fig. 7 to observe the strain changes in infarct zone vs. border zone vs. remote zone. The full cardiac cycle strain analysis was computed using multiple permutations of a single midframe between ED and ES and interpolating between the time points to reconstruct the cardiac cycle.

Fig. 7. — Radial, circumferential, and longitudinal strain changes across the full cardiac cycle in infarct region, border region, and remote region.

The presented work also has potential limitations. First, the quality of simulated echocardiography images affects the network performance since the regions outsides the myocardium in the synthetic dataset are randomly generated. Second, it is clinically infeasible to acquire multimodal cardiac imaging of patients at the same time point. Although patients with history of cardiovascular diseases typically undergo multiple imaging studies, they are recommended at different time points. Therefore, we relied on porcine myocardial infarction studies to acquire multimodal images at controlled time points to evaluate the accuracy of our motion tracking algorithm in 3D echocardiography. In future studies, we will investigate our Co-Attention STN’s performance on real patient images by acquiring a well-curated 3D echocardiography from patients who undergo nuclear imaging in an approximately similar time frame.

7. Conclusion

We present a novel addition of co-attention mechanism in a classical Spatial Transformer Network, called the Co-Attention Spatial Transformer Network (Co-Attention STN). Co-Attention STN extracts inter-frame dependent features between end-diastole and end-systole frames to improve the motion tracking in an otherwise noisy 3D echocardiographic image sets. The addition of our novel temporal constraint further regularizes the motion field to produce smooth and realistic cardiac motion. Experimental results demonstrate that our Co-Attention STN provide a superior performance compared to existing methods. Strain analysis from Co-Attention STN also correspond well with the matched SPECT perfusion/viability maps, demonstrating the clinical utility for using 3D echocardiography for infarct localization.

Highlights.

We present a novel co-attention spatial transformer network for unsupervised motion tracking of left ventricle in 3D echocardiography.
Co-attention mechanism allows better feature extraction that leads to smoother motion fields and also interpretability.
We also present a temporal regularization term to further guide the motion of the left ventricle.
3D cardiac strain analysis was done using the motion field output and compared against ^99mTc-tetrofosmin single photon emission positron tomography (SPECT) perfusion/viability maps.
Our 3D echocardiography-derived strain maps allow reliable method to localize infarct and quantify regional changes in myocardial strain after ischemic injury.

Acknowledgments

The authors are immensely thankful for the past and present members of the Yale Translational Research Imaging Center (YTRIC) who were involved in the image acquisition process. This research was made possible through the utilization of the Yale Translational Research Imaging Center. This work was supported in part by the NIH grants R01HL121226, R01HL137365, T32HL098069, S10RR02555, 1S10OD028738-01A1, F30HL158154 and NIH Medical Scientist Training Program Grant T32GM007205.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration if interests

James S. Duncan is one of the Editors-in-Chief for Medical Image Analysis.

Credit authorship contribution statement

Shawn S. Ahn: Conceptualization, Methodology, Data Acquisition, Software, Visualization, Validation, Formal analysis, Writing original draft. Kevinminh Ta: Conceptualization, Writing - review and editing. Stephanie L. Thorn: Data Acquisition, Writing - review and editing. John A. Onofrey: Conceptualization, Writing - review and editing. Inga H. Melvinsdottir: Data Acquisition, Writing - review and editing. Supum Lee: Data Acquisition, Writing - review and editing. Jonathan Langdon: Writing - review and editing. Albert J. Sinusas: Conceptualization, Writing - review and editing, Supervision. James S. Duncan: Conceptualization, Writing - review and editing, Supervision.

References

Ahn SS, Ta K, Lu A, Stendahl JC, Sinusas AJ, Duncan JS, 2020. Unsupervised motion tracking of left ventricle in echocardiography, in: Medical Imaging 2020: Ultrasonic Imaging and Tomography, International Society for Optics and Photonics. p. 113190Z. [PMC free article] [PubMed] [Google Scholar]
Ahn SS, Ta K, Thorn S, Langdon J, Sinusas AJ, Duncan JS, 2021. Multi-frame attention network for left ventricle segmentation in 3d echocardiography, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 348–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alessandrini M, De Craene M, Bernard O, Giffard-Roisin S, Allain P, Waechter-Stehle I, Weese J, Saloux E, Delingette H, Sermesant M, et al. , 2015. A pipeline for the generation of realistic 3d synthetic echocardiographic sequences: Methodology and open-access database. IEEE transactions on medical imaging 34, 1436–1451. [DOI] [PubMed] [Google Scholar]
Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV, 2019. Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38, 1788–1800. [DOI] [PubMed] [Google Scholar]
Blankstein R, Waller AH, 2016. Evaluation of known or suspected cardiac sarcoidosis. Circulation: Cardiovascular Imaging 9, e000867. [DOI] [PubMed] [Google Scholar]
Chen C, et al. , 2020. Deep learning for cardiac image segmentation: A review. Frontiers in Cardiovascular Medicine 7, 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen LC, Papandreou G, Schroff F, Adam H, 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 [Google Scholar]
Chen X, Xie H, Erkamp R, Kim K, Jia C, Rubin J, O’Donnell M, 2005. 3-d correlation-based speckle tracking. Ultrasonic Imaging 27, 21–36. [DOI] [PubMed] [Google Scholar]
Dai X, Lei Y, Roper J, Chen Y, Bradley JD, Curran WJ, Liu T, Yang X, 2021. Deep learning-based motion tracking using ultrasound images. Medical Physics 48, 7747–7756. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dalca AV, Guttag J, Sabuncu MR, 2018. Anatomical priors in convolutional networks for unsupervised biomedical segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9290–9299. [Google Scholar]
De Craene M, Piella G, Camara O, Duchateau N, Silva E, Doltra A, Dhooge J, Brugada J, Sitges M, Frangi AF, 2012. Temporal diffeomorphic free-form deformation: Application to motion and strain estimation from 3d echocardiography. Medical image analysis 16, 427–450. [DOI] [PubMed] [Google Scholar]
Glover DK, Ruiz M, Yang JY, Smith WH, Watson DD, Beller GA, 1997. Myocardial 99mtc-tetrofosmin uptake during adenosine-induced vasodilatation with either a critical or mild coronary stenosis: comparison with 201tl and regional myocardial blood flow. Circulation 96, 2332–2338. [DOI] [PubMed] [Google Scholar]
Horn BK, Schunck BG, 1981. Determining optical flow. Artificial intelligence 17, 185–203. [Google Scholar]
Hu J, Shen L, Sun G, 2018. Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. [Google Scholar]
Huang X, Dione DP, Compas CB, Papademetris X, Lin BA, Bregasi A, Sinusas AJ, Staib LH, Duncan JS, 2014. Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Medical image analysis 18, 253–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ichinose A, Otani H, Oikawa M, Takase K, Saito H, Shimokawa H, Takahashi S, 2008. Mri of cardiac sarcoidosis: basal and subepicardial localization of myocardial lesions and their effect on left ventricular function. American Journal of Roentgenology 191, 862–869. [DOI] [PubMed] [Google Scholar]
Jaderberg M, Simonyan K, Zisserman A, et al. , 2015. Spatial transformer networks. Advances in neural information processing systems 28, 2017–2025. [Google Scholar]
Jeng GS, Zontak M, Parajuli N, Lu A, Ta K, Sinusas AJ, Duncan JS, ODonnell M, 2018. Efficient two-pass 3-d speckle tracking for ultrasound imaging. IEEE Access 6, 17415–17428. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jia C, Yan P, Sinusas AJ, Dione DP, Lin BA, Wei Q, Thiele K, Kolias TJ, Rubin JM, Huang L, et al. , 2010. 3d elasticity imaging using principal stretches on an open-chest dog heart, in: Ultrasonics Symposium (IUS), 2010 IEEE, IEEE. pp. 583–586. [Google Scholar]
Kalam K, Otahal P, Marwick TH, 2014. Prognostic implications of global lv dysfunction: a systematic review and meta-analysis of global longitudinal strain and ejection fraction. Heart 100, 1673–1680. [DOI] [PubMed] [Google Scholar]
Liang M, Hu X, 2015. Recurrent convolutional neural network for object recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3367–3375. [Google Scholar]
Lin N, Duncan JS, 2004. Generalized robust point matching using an extended free-form deformation model: application to cardiac images, in: Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on, IEEE. pp. 320–323. [Google Scholar]
Liu F, Wang K, Liu D, Yang X, Tian J, 2021. Deep pyramid local attention neural network for cardiac structure segmentation in two-dimensional echocardiography. Medical Image Analysis 67, 101873. [DOI] [PubMed] [Google Scholar]
Lu A, Ahn SS, Ta K, Parajuli N, Stendahl JC, Liu Z, Boutagy NE, Jeng GS, Staib LH, ODonnell M, et al. , 2021. Learning-based regularization for cardiac strain analysis via domain adaptation. IEEE Transactions on Medical Imaging. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F, 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632. [Google Scholar]
Lubinski MA, Emelianov SY, O’Donnell M, 1999. Speckle tracking methods for ultrasonic elasticity imaging using short-time correlation. IEEE transactions on ultrasonics, ferroelectrics, and frequency control 46, 82–96. [DOI] [PubMed] [Google Scholar]
Lucas BD, Kanade T, 1981. An iterative image registration technique with an application to stereo vision.
Midgett D, Thorn S, Ahn S, Uman S, Avendano R, Melvinsdottir I, Lysyy T, Kim J, Duncan J, Humphrey J, et al. , 2022. Cinect platform for in vivo and ex vivo measurement of 3d high resolution lagrangian strains in the left ventricle following myocardial infarction and intramyocardial delivery of theranostic hydrogel. Journal of Molecular and Cellular Cardiology 166, 74–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mondillo S, Galderisi M, Mele D, Cameli M, Lomoriello VS, Zacà V, Ballo P, D’Andrea A, Muraru D, Losi M, et al. , 2011. Speckle-tracking echocardiography: a new technique for assessing myocardial function. Journal of Ultrasound in Medicine 30, 71–83. [DOI] [PubMed] [Google Scholar]
on Myocardial Segmentation, A.H.A.W.G., for Cardiac Imaging:, R., Cerqueira MD, Weissman NJ, Dilsizian V, Jacobs AK, Kaul S, Laskey WK, Pennell DJ, Rumberger JA, Ryan T, et al. , 2002. Standardized myocardial segmentation and nomenclature for tomographic imaging of the heart: a statement for healthcare professionals from the cardiac imaging committee of the council on clinical cardiology of the american heart association. Circulation 105, 539–542. [DOI] [PubMed] [Google Scholar]
Papademetris X, Sinusas AJ, Dione DP, Constable RT, Duncan JS, 2002. Estimation of 3-d left ventricular deformation from medical images using biomechanical models. IEEE transactions on medical imaging 21, 786–800. [DOI] [PubMed] [Google Scholar]
Parajuli N, Lu A, Stendahl JC, Zontak M, Boutagy N, Alkhalil I, Eberle M, Lin BA, ODonnell M, Sinusas AJ, et al. , 2017. Flow network based cardiac motion tracking leveraging learned feature matching, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 279–286. [Google Scholar]
Parajuli N, Lu A, Stendahl JC, Zontak M, Boutagy N, Eberle M, Alkhalil I, ODonnell M, Sinusas AJ, Duncan JS, 2016. Integrated dynamic shape tracking and rf speckle tracking for cardiac motion analysis, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 431–438. [Google Scholar]
Parajuli N, Lu A, Ta K, Stendahl J, Boutagy N, Alkhalil I, Eberle M, Jeng GS, Zontak M, ODonnell M, et al. , 2019. Flow network tracking for spatiotemporal and periodic point matching: Applied to cardiac motion analysis. Medical image analysis 55, 116–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reisner SA, Lysyansky P, Agmon Y, Mutlak D, Lessick J, Friedman Z, 2004. Global longitudinal strain: a novel index of left ventricular systolic function. Journal of the American Society of Echocardiography 17, 630–633. [DOI] [PubMed] [Google Scholar]
Rohé MM, Datar M, Heimann T, Sermesant M, Pennec X, 2017. Svf-net: Learning deformable image registration using shape matching, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 266–274. [Google Scholar]
Ronneberger O, Fischer P, Brox T, 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241. [Google Scholar]
Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B, Rueckert D, 2019. Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shi P, Sinusas AJ, Constable RT, Ritman E, Duncan JS, 2000. Point-tracked quantitative analysis of left ventricular surface motion from 3-d image sequences. IEEE transactions on medical imaging 19, 36–50. [DOI] [PubMed] [Google Scholar]
Sinusas AJ, Shi Q, Saltzberg MT, Vitols P, Jain D, Frans JT, Zaret BL, et al. , 1994. Technetium-99m-tetrofosmin to assess myocardial blood flow: experimental validation in an intact canine model of ischemia. Journal of Nuclear Medicine 35, 664–671. [PubMed] [Google Scholar]
Ta K, Ahn SS, Stendahl JC, Sinusas AJ, Duncan JS, 2020. A semi-supervised joint network for simultaneous left ventricular motion tracking and segmentation in 4d echocardiography, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 468–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Call-away CW, Carson AP, Chamberlain AM, Cheng S, Delling FN, et al. , 2021. Heart disease and stroke statistics2021 update: a report from the american heart association. Circulation 143, e254–e743. [DOI] [PubMed] [Google Scholar]
Wu J, Mazur TR, Ruan S, Lian C, Daniel N, Lashmett H, Ochoa L, Zoberi I, Anastasio MA, Gach HM, et al. , 2018. A deep boltzmann machine-driven level set method for heart motion tracking using cine mri images. Medical image analysis 47, 68–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu L, et al. , 2020. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Transactions on Neural Networks and Learning Systems. [DOI] [PubMed] [Google Scholar]
Ye M, Kanski M, Yang D, Chang Q, Yan Z, Huang Q, Axel L, Metaxas D, 2021. Deeptag: An unsupervised deep learning method for motion tracking on cardiac tagging magnetic resonance images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7261–7271. [Google Scholar]
Yu H, Chen X, Shi H, Chen T, Huang TS, Sun S, 2020a. Motion pyramid networks for accurate and efficient cardiac motion estimation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 436–446. [Google Scholar]
Yu H, Sun S, Yu H, Chen X, Shi H, Huang TS, Chen T, 2020b. Foal: Fast online adaptive learning for cardiac motion estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4313–4323. [Google Scholar]

[R1] Ahn SS, Ta K, Lu A, Stendahl JC, Sinusas AJ, Duncan JS, 2020. Unsupervised motion tracking of left ventricle in echocardiography, in: Medical Imaging 2020: Ultrasonic Imaging and Tomography, International Society for Optics and Photonics. p. 113190Z. [PMC free article] [PubMed] [Google Scholar]

[R2] Ahn SS, Ta K, Thorn S, Langdon J, Sinusas AJ, Duncan JS, 2021. Multi-frame attention network for left ventricle segmentation in 3d echocardiography, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 348–357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Alessandrini M, De Craene M, Bernard O, Giffard-Roisin S, Allain P, Waechter-Stehle I, Weese J, Saloux E, Delingette H, Sermesant M, et al. , 2015. A pipeline for the generation of realistic 3d synthetic echocardiographic sequences: Methodology and open-access database. IEEE transactions on medical imaging 34, 1436–1451. [DOI] [PubMed] [Google Scholar]

[R4] Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV, 2019. Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38, 1788–1800. [DOI] [PubMed] [Google Scholar]

[R5] Blankstein R, Waller AH, 2016. Evaluation of known or suspected cardiac sarcoidosis. Circulation: Cardiovascular Imaging 9, e000867. [DOI] [PubMed] [Google Scholar]

[R6] Chen C, et al. , 2020. Deep learning for cardiac image segmentation: A review. Frontiers in Cardiovascular Medicine 7, 25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chen LC, Papandreou G, Schroff F, Adam H, 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 [Google Scholar]

[R8] Chen X, Xie H, Erkamp R, Kim K, Jia C, Rubin J, O’Donnell M, 2005. 3-d correlation-based speckle tracking. Ultrasonic Imaging 27, 21–36. [DOI] [PubMed] [Google Scholar]

[R9] Dai X, Lei Y, Roper J, Chen Y, Bradley JD, Curran WJ, Liu T, Yang X, 2021. Deep learning-based motion tracking using ultrasound images. Medical Physics 48, 7747–7756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Dalca AV, Guttag J, Sabuncu MR, 2018. Anatomical priors in convolutional networks for unsupervised biomedical segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9290–9299. [Google Scholar]

[R11] De Craene M, Piella G, Camara O, Duchateau N, Silva E, Doltra A, Dhooge J, Brugada J, Sitges M, Frangi AF, 2012. Temporal diffeomorphic free-form deformation: Application to motion and strain estimation from 3d echocardiography. Medical image analysis 16, 427–450. [DOI] [PubMed] [Google Scholar]

[R12] Glover DK, Ruiz M, Yang JY, Smith WH, Watson DD, Beller GA, 1997. Myocardial 99mtc-tetrofosmin uptake during adenosine-induced vasodilatation with either a critical or mild coronary stenosis: comparison with 201tl and regional myocardial blood flow. Circulation 96, 2332–2338. [DOI] [PubMed] [Google Scholar]

[R13] Horn BK, Schunck BG, 1981. Determining optical flow. Artificial intelligence 17, 185–203. [Google Scholar]

[R14] Hu J, Shen L, Sun G, 2018. Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. [Google Scholar]

[R15] Huang X, Dione DP, Compas CB, Papademetris X, Lin BA, Bregasi A, Sinusas AJ, Staib LH, Duncan JS, 2014. Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Medical image analysis 18, 253–271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Ichinose A, Otani H, Oikawa M, Takase K, Saito H, Shimokawa H, Takahashi S, 2008. Mri of cardiac sarcoidosis: basal and subepicardial localization of myocardial lesions and their effect on left ventricular function. American Journal of Roentgenology 191, 862–869. [DOI] [PubMed] [Google Scholar]

[R17] Jaderberg M, Simonyan K, Zisserman A, et al. , 2015. Spatial transformer networks. Advances in neural information processing systems 28, 2017–2025. [Google Scholar]

[R18] Jeng GS, Zontak M, Parajuli N, Lu A, Ta K, Sinusas AJ, Duncan JS, ODonnell M, 2018. Efficient two-pass 3-d speckle tracking for ultrasound imaging. IEEE Access 6, 17415–17428. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Jia C, Yan P, Sinusas AJ, Dione DP, Lin BA, Wei Q, Thiele K, Kolias TJ, Rubin JM, Huang L, et al. , 2010. 3d elasticity imaging using principal stretches on an open-chest dog heart, in: Ultrasonics Symposium (IUS), 2010 IEEE, IEEE. pp. 583–586. [Google Scholar]

[R20] Kalam K, Otahal P, Marwick TH, 2014. Prognostic implications of global lv dysfunction: a systematic review and meta-analysis of global longitudinal strain and ejection fraction. Heart 100, 1673–1680. [DOI] [PubMed] [Google Scholar]

[R21] Liang M, Hu X, 2015. Recurrent convolutional neural network for object recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3367–3375. [Google Scholar]

[R22] Lin N, Duncan JS, 2004. Generalized robust point matching using an extended free-form deformation model: application to cardiac images, in: Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on, IEEE. pp. 320–323. [Google Scholar]

[R23] Liu F, Wang K, Liu D, Yang X, Tian J, 2021. Deep pyramid local attention neural network for cardiac structure segmentation in two-dimensional echocardiography. Medical Image Analysis 67, 101873. [DOI] [PubMed] [Google Scholar]

[R24] Lu A, Ahn SS, Ta K, Parajuli N, Stendahl JC, Liu Z, Boutagy NE, Jeng GS, Staib LH, ODonnell M, et al. , 2021. Learning-based regularization for cardiac strain analysis via domain adaptation. IEEE Transactions on Medical Imaging. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F, 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632. [Google Scholar]

[R26] Lubinski MA, Emelianov SY, O’Donnell M, 1999. Speckle tracking methods for ultrasonic elasticity imaging using short-time correlation. IEEE transactions on ultrasonics, ferroelectrics, and frequency control 46, 82–96. [DOI] [PubMed] [Google Scholar]

[R27] Lucas BD, Kanade T, 1981. An iterative image registration technique with an application to stereo vision.

[R28] Midgett D, Thorn S, Ahn S, Uman S, Avendano R, Melvinsdottir I, Lysyy T, Kim J, Duncan J, Humphrey J, et al. , 2022. Cinect platform for in vivo and ex vivo measurement of 3d high resolution lagrangian strains in the left ventricle following myocardial infarction and intramyocardial delivery of theranostic hydrogel. Journal of Molecular and Cellular Cardiology 166, 74–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Mondillo S, Galderisi M, Mele D, Cameli M, Lomoriello VS, Zacà V, Ballo P, D’Andrea A, Muraru D, Losi M, et al. , 2011. Speckle-tracking echocardiography: a new technique for assessing myocardial function. Journal of Ultrasound in Medicine 30, 71–83. [DOI] [PubMed] [Google Scholar]

[R30] on Myocardial Segmentation, A.H.A.W.G., for Cardiac Imaging:, R., Cerqueira MD, Weissman NJ, Dilsizian V, Jacobs AK, Kaul S, Laskey WK, Pennell DJ, Rumberger JA, Ryan T, et al. , 2002. Standardized myocardial segmentation and nomenclature for tomographic imaging of the heart: a statement for healthcare professionals from the cardiac imaging committee of the council on clinical cardiology of the american heart association. Circulation 105, 539–542. [DOI] [PubMed] [Google Scholar]

[R31] Papademetris X, Sinusas AJ, Dione DP, Constable RT, Duncan JS, 2002. Estimation of 3-d left ventricular deformation from medical images using biomechanical models. IEEE transactions on medical imaging 21, 786–800. [DOI] [PubMed] [Google Scholar]

[R32] Parajuli N, Lu A, Stendahl JC, Zontak M, Boutagy N, Alkhalil I, Eberle M, Lin BA, ODonnell M, Sinusas AJ, et al. , 2017. Flow network based cardiac motion tracking leveraging learned feature matching, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 279–286. [Google Scholar]

[R33] Parajuli N, Lu A, Stendahl JC, Zontak M, Boutagy N, Eberle M, Alkhalil I, ODonnell M, Sinusas AJ, Duncan JS, 2016. Integrated dynamic shape tracking and rf speckle tracking for cardiac motion analysis, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 431–438. [Google Scholar]

[R34] Parajuli N, Lu A, Ta K, Stendahl J, Boutagy N, Alkhalil I, Eberle M, Jeng GS, Zontak M, ODonnell M, et al. , 2019. Flow network tracking for spatiotemporal and periodic point matching: Applied to cardiac motion analysis. Medical image analysis 55, 116–135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Reisner SA, Lysyansky P, Agmon Y, Mutlak D, Lessick J, Friedman Z, 2004. Global longitudinal strain: a novel index of left ventricular systolic function. Journal of the American Society of Echocardiography 17, 630–633. [DOI] [PubMed] [Google Scholar]

[R36] Rohé MM, Datar M, Heimann T, Sermesant M, Pennec X, 2017. Svf-net: Learning deformable image registration using shape matching, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 266–274. [Google Scholar]

[R37] Ronneberger O, Fischer P, Brox T, 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241. [Google Scholar]

[R38] Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B, Rueckert D, 2019. Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Shi P, Sinusas AJ, Constable RT, Ritman E, Duncan JS, 2000. Point-tracked quantitative analysis of left ventricular surface motion from 3-d image sequences. IEEE transactions on medical imaging 19, 36–50. [DOI] [PubMed] [Google Scholar]

[R40] Sinusas AJ, Shi Q, Saltzberg MT, Vitols P, Jain D, Frans JT, Zaret BL, et al. , 1994. Technetium-99m-tetrofosmin to assess myocardial blood flow: experimental validation in an intact canine model of ischemia. Journal of Nuclear Medicine 35, 664–671. [PubMed] [Google Scholar]

[R41] Ta K, Ahn SS, Stendahl JC, Sinusas AJ, Duncan JS, 2020. A semi-supervised joint network for simultaneous left ventricular motion tracking and segmentation in 4d echocardiography, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 468–477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Call-away CW, Carson AP, Chamberlain AM, Cheng S, Delling FN, et al. , 2021. Heart disease and stroke statistics2021 update: a report from the american heart association. Circulation 143, e254–e743. [DOI] [PubMed] [Google Scholar]

[R43] Wu J, Mazur TR, Ruan S, Lian C, Daniel N, Lashmett H, Ochoa L, Zoberi I, Anastasio MA, Gach HM, et al. , 2018. A deep boltzmann machine-driven level set method for heart motion tracking using cine mri images. Medical image analysis 47, 68–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Wu L, et al. , 2020. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Transactions on Neural Networks and Learning Systems. [DOI] [PubMed] [Google Scholar]

[R45] Ye M, Kanski M, Yang D, Chang Q, Yan Z, Huang Q, Axel L, Metaxas D, 2021. Deeptag: An unsupervised deep learning method for motion tracking on cardiac tagging magnetic resonance images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7261–7271. [Google Scholar]

[R46] Yu H, Chen X, Shi H, Chen T, Huang TS, Sun S, 2020a. Motion pyramid networks for accurate and efficient cardiac motion estimation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 436–446. [Google Scholar]

[R47] Yu H, Sun S, Yu H, Chen X, Shi H, Huang TS, Chen T, 2020b. Foal: Fast online adaptive learning for cardiac motion estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4313–4323. [Google Scholar]

PERMALINK

Co-Attention Spatial Transformer Network for Unsupervised Motion Tracking and Cardiac Strain Analysis in 3D Echocardiography

Shawn S Ahn

Kevinminh Ta

Stephanie L Thorn

John A Onofrey

Inga H Melvinsdottir

Supum Lee

Jonathan Langdon

Albert J Sinusas

James S Duncan

Abstract

1. Introduction

2. Related work

2.1. Traditional Cardiac Motion Tracking Methods

2.2. Deep Learning-based Cardiac Motion Tracking Methods

2.3. Overview of Attention

3. Method

3.1. Co-Attention Spatial Transformer Network

Fig. 1.

Fig. 2.

3.2. Temporal Regularization

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Experimental Studies

5. Results

Table 1.

Fig. 3.

Table 2.

Fig. 4.

Fig. 5.

5.1. Comparison to SPECT

Fig. 6.

6. Discussion

Fig. 7.

7. Conclusion

Highlights.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases