Fetal Pose Estimation in Volumetric MRI using a 3D Convolution Neural Network

Junshen Xu; Molin Zhang; Esra Abaci Turk; Larry Zhang; Ellen Grant; Kui Ying; Polina Golland; Elfar Adalsteinsson

doi:10.1007/978-3-030-32251-9_44

. Author manuscript; available in PMC: 2020 Oct 1.

Published in final edited form as: Med Image Comput Comput Assist Interv. 2019 Oct 10;11767:403–410. doi: 10.1007/978-3-030-32251-9_44

Fetal Pose Estimation in Volumetric MRI using a 3D Convolution Neural Network

Junshen Xu ^1,^*, Molin Zhang ^2,^*, Esra Abaci Turk ³, Larry Zhang ⁴, Ellen Grant ^3,⁵, Kui Ying ², Polina Golland ^1,⁴, Elfar Adalsteinsson ^1,⁶

PMCID: PMC7267040 NIHMSID: NIHMS1586283 PMID: 32494782

Abstract

The performance and diagnostic utility of magnetic resonance imaging (MRI) in pregnancy is fundamentally constrained by fetal motion. Motion of the fetus, which is unpredictable and rapid on the scale of conventional imaging times, limits the set of viable acquisition techniques to single-shot imaging with severe compromises in signal-to-noise ratio and diagnostic contrast, and frequently results in unacceptable image quality. Surprisingly little is known about the characteristics of fetal motion during MRI and here we propose and demonstrate methods that exploit a growing repository of MRI observations of the gravid abdomen that are acquired at low spatial resolution but relatively high temporal resolution and over long durations (10–30 minutes). We estimate fetal pose per frame in MRI volumes of the pregnant abdomen via deep learning algorithms that detect key fetal landmarks. Evaluation of the proposed method shows that our framework achieves quantitatively an average error of 4.47 mm and 96.4% accuracy (with error less than 10 mm). Fetal pose estimation in MRI time series yields novel means of quantifying fetal movements in health and disease, and enables the learning of kinematic models that may enhance prospective mitigation of fetal motion artifacts during MRI acquisition.

Keywords: Pose estimation, Fetal magnetic resonance imaging (MRI), Deep learning, Convolutional neural network (CNN)

1. Introduction

Estimation of fetal pose from volumetric MRI in pregnancy has applications that include motion tracking and prospective artifact mitigation during diagnostic imaging, retrospective analysis and evaluation of movement by the fetus, as well as the establishment of kinematic models of fetal movement during MRI. Prior work in fetal motion includes methods that rely on simple indices for fetal motion analysis and quantification, such as the angle of the fetal body axes with respect to the maternal body [1] and maternal perception of fetal movements [2].

Although pose estimation for the human (adult) body is an established domain in computer vision [3], to the best of our knowledge, no work has demonstrated fetal pose estimation over time in medical images by MRI. In contrast to human pose estimation from 2D photography, in fetal pose estimation we need to predict 3D pose from dense volumetric data, which increases the computational burden. Further complicating the task is the variable orientation of the fetus within the mother, rapid growth and change in fetal features over gestational age, and poor-quality observations of ground truth pose.

In pose estimation, handcrafted features such as graphical models and tree-based methods typically suffer from low accuracy and low processing speed while recent developments in deep learning have demonstrated great success in computer vision with acceleration by GPUs and the capability to learn high-level features from data. Consequently, deep convolution neural networks have also found their way into human pose estimation and achieved state-of-the-art results.

In an ongoing study of placental function by EPI BOLD imaging time series (see Figure 1 (a)), we have built an archive of over 70 subjects, each with 200–500 time frames of EPI volumes, imaged continuously over 10–30 minute observation intervals and resulting in over 18,000 EPI volumes. By visual inspection, the fetal pose can be inferred from these data but manual labeling of keypoints for pose estimation (see Figure 1 (b)) across these volumes is prohibitive and here we propose a method based on deep neural networks to identify fetal key points.

Fig. 1. — (a) A representative slice from one MRI volume used in this study, and (b) an example of the associated 15 keypoints that characterize fetal pose in three dimensions at a single 3.5-second time frame extracted from a 30-minute observation of the fetus by MRI.

We propose, demonstrate, and characterize the performance of a two-stage framework for fetal pose estimation in 3D MRI using deep learning, where we first generate heatmaps for each fetal keypoint using a convolution network and then infer fetal pose from heat maps using a Markov Random Field (MRF) that exploit anatomically rational information about connections between keypoints. Evaluation of performance shows that the proposed method achieves a mean error of 4.47 mm and a percentage of correct detection of 96.4%. Further, computation time of our pipeline is less than 1 s/volume, which potentially enables low-latency tracking of fetal pose during diagnostic MRI in pregnancy.

2. Methods

2.1. Pose Estimation Framework

Exploring the idea of heatmap prediction in human pose estimation [3], here we propose a two-stage framework for fetal pose estimation in 3D MRI using deep learning (see Fig. 2). In the first stage, a CNN is used to generate heatmaps from input MR volume, which produce per-pixel likelihoods for keypoints on the fetal skeleton. However, the generated heatmap may have multiple local maxima and simply using max activating location as prediction may lead to low accuracy.

Fig. 2. — The framework of fetal pose estimation in 3D MRI which consists of two stages. Stage 1: generate 3D heatmaps of each keypoint from the input MR volume. Stage 2: estimate keypoint locations from heatmaps.

To address this problem, a second stage is proposed to infer location from estimated heatmaps, exploiting the constraints of fetal pose to refine the results. We model the fetal pose as a MRF, where each keypoint of fetus is represented by a node in the graph and the states are the plausible locations of the keypoint. The final prediction is generated by performing inference on this MRF.

The following subsections describe the proposed framework in detail.

2.2. Heatmap Prediction using CNN

Inspired by the successful application of hourglass networks in human pose estimation [3], we propose a 3D hourglass network for heatmap prediction of fetal keypoints. The overall architecture of the proposed network is shown in Fig. 3. The network is based on the encoder-decoder structure which is motivated by the idea of capturing multi-scale information. In pose estimation, while local evidence, e.g., local contrast, is important for identification of keypoint, global information can help resolve ambiguity, such as fetus’ orientation and relative position of other joints or body parts. In each scale of the network, resblocks with 3D convolution layers are used to extract features. To recover loss of high resolution information in downscale-upscale structure, skipped connections with element-wise addition are adopted to connect symmetric scales.

Fig. 3. — Left: architecture of 3D hourglass network for heatmap prediction. Right: structure of resblock.

The CNN tries to learn a mapping from MR images to target heatmaps, which is generated by placing a Gaussian distribution with σ = 2 on the ground-truth position and stacking together. So the output heatmaps will be of the same spatial dimensions but have J channels, where J is the number of keypoints need to predict. The loss function used for training is the mean-squared error (MSE) between the predicted heatmap and target heatmap. Instead of using the whole volume, 3D patches with size of 64×64×64 are used as input for training. This strategy can reduce GPU memory usage, enabling mini-batch training. Since the network is fully convolutional, in inference, the whole 3D MR volumes are fed into the network to generate heatmap of full scale.

2.3. Location Estimation from Heatmap

Given the output heatmap from CNN, the second stage of the pose estimation framework is to estimate location of each keypoint. Let x_i and H_i be the location and heatmap of the i–th keypoint, i = 1, …, J. Let x = (x₁, …, x_J). Then one simple idea to infer keypoint positions from heatmaps is taking the max activating location of each heatmapHowever, this method handles each keypoint independently and does not make use of the connection between keypoints, e.g., the distance between two joints should be a constant if they are connected by bones. To exploiting these connections, we model the fetal pose as a MRF, where each keypoint correspond to a node in the graph and connections of keypoints are represented as edges in the graph. The states $S_{i} = {x_{i}^{(1)}, \dots, x_{i}^{(L)}}$ for node i is the top-L local maxima in heatmap i. Our prediction of fetal pose would be a particular configuration of the MRF, i.e., $\hat{x} \in S_{1} \times \dots \times S_{J}$ . Each configuration is assigned an energy, E(x), defined as

E (x) = \sum_{i = 1}^{J} φ_{i} (x_{i}) + \sum_{(i, j) \in B} ϕ_{i, j} (x_{i}, x_{j})

where B is the set of connections. A low energy of a configuration implies high probability. Therefore, the inference is equivalent to finding the configuration with lowest energy.

Since the heatmap can be considered as a surrogate for the probability distribution of the corresponding keypoint, the unary term in energy function F can be modeled as

φ_{i} (x_{i}) = - \log H_{i} (x_{i})

As for the pairwise term, we define ϕ_i,j as a quadratic function of ${‖ x_{i} - x_{j} ‖}_{2}$ , the distance between keypoint i and j.

ϕ_{i, j} (x_{i}, x_{j}) = - \frac{α {({‖ x_{i} - x_{j} ‖}_{2} / r_{t} - μ_{i j})}^{2}}{σ_{i j}^{2}},

where r_t is the mean bone length at gestational age t, so that ${‖ x_{i} - x_{j} ‖}_{2} / r_{t}$ can be regarded as the distance of two keypoints normalized by gestational age. μ_ij and $σ_{i j}^{2}$ are the mean and variance of the normalized distance, which are estimated from training data. α is the regularization weight. The optimization problem is solved by a belief propagation algorithm [4].

3. Experiments and Results

3.1. Dataset

The data for this study consist of volumetric MRI time series from imaging of 70 mothers pregnant with singletons at a gestational age ranging from 25 to 35 weeks. MRIs were acquired on a 3T Skyra scanner (Siemens Healthcare, Erlangen, Germany). Multislice, single-shot, gradient echo EPI sequence was used for acquisitions with in-plane resolution of 3 × 3mm², slice thickness of 3 mm, mean matrix size = 120 × 120 × 80; TR=5 − 8s, TE=32 − 38ms, FA=90°. Each subject was scanned for 10 to 30 min.

Similar to the task of adult human pose estimation, we model the pose of a fetus with a set of keypoints. We chose fifteen keypoints (ankles, knees, hips, bladder, shoulders, elbows, wrists and eyes) to capture pose and labeled manually, with a representative example shown in Fig. 1(b). These fifteen landmarks were selected as keypoints as they capture gross fetal anatomy that is critical in subsequent motion analysis, and they presented with adequate image contrast to be relatively robustly observed in the MR volumes, thus mitigating the error and noise in labelling. In total, 1705 MR volumes were labelled, 1028(~ 60%) for training, 240(~ 15%) for validation and 437(~ 25%) for testing, where the testing set consists of subjects different from training and validation sets.

In order to improve the generalization capacity and avoid overfitting, several data augmentation techniques were used, including intensity scaling, 3D rotation and flipping.

3.2. Experiments Setup

All experiments were performed on a server with an Intel Xeon E5–1650 CPU, 128GB RAM and a NVIDIA TITAN X GPU. Neural networks were implemented with TensorFlow and for optimization we use Adam with an initial learning rate of 5×10⁻³, weight decay of 1×10⁻⁴ and the restart strategy [5]. The networks are trained for 200 epochs. For the second stage, we set L = 3 and α = 1.

3.3. Results

In this section, we evaluate the proposed pipeline for fetal pose estimation. First, we evaluate the proposed 3D hourglass network (HG) with max activating location of the heatmap as final prediction. For comparison, 3D UNet[6] is used in our experiment, which has been used for heatmap regression[7]. Finally, we examine the whole pipeline by combine the CNN-based heatmap regression and MRF. These models are denoted as UNet-M and HG-M respectively.

Several metrics are used for evaluation: a) Percentage of Correct Keypoint (PCK), where a detected keypoint is considered correct if the distance between the predicted and the true keypoint is within a certain threshold, b) mean error (in mm),i.e., the mean distance between the predicted and the ground-truth keypoint, and c) median of error.

Fig 4 shows PCK with two threshold, 5mm (1.67 pixel) and 10mm (3.33 pixel) while the mean and median of error of different models are illustared in table 1. Applying the proposed pipeline, 96.4% of the keypoints are located correctly (with error < 10mm) and the mean distance between predicted and ground-truth keypoints is 4.47mm (1.5 pixel). Besides, we see that, in average, the proposed 3D hourglass network has similar performance compared to 3D UNet. However, as illustrated in table 2, the number of parameters of UNet is 6 times as large as that of hourglass network, indicating that the proposed network is more compact and efficient. The main reason is that the hourglass network use elementwise sum instead of concatenate in skip connection and fix the number of channels across different scales. We also notice that the second stage Markov network refinement improves the performance upon CNN heatmap regression, in terms of PCK as well as mean error. As illustrated in Fig. 5(b), fetal pose estimation based on max activating location of heatmap may result in irrational prediction. Such error is corrected in the MRF refinement by making a trade-off between prior information of keypoint connections and heatmaps generated by the CNN. As for computation time, the proposed 3D hourglass network runs at a speed of 225 ms/volume on a GPU and solving the optimization problem for inferring keypoint locations from heatmaps takes 290 ms/volume on CPU. Therefore, the end-to-end processing time of the whole pipeline is less than 1 s/volume and therefore shorter than the temporal resolution in the current fetal MR protocol, which potentially enables low latency tracking of fetal pose in fetal MR imaging.

Table 1.

mean and median of error of different models.

metric	method	wrist	elbow	shoulder	eye	bladder	hip	knee	ankle	all
median (mm)	UNet	3.84	3.43	2.87	2.74	3.20	3.12	4.00	4.42	3.47
	UNet-M	3.84	3.43	2.87	2.73	3.19	3.12	3.99	4.36	3.46
	HG	3.82	3.42	2.83	2.72	3.37	3.16	3.87	4.15	3.42
	HG-M	3.82	3.41	2.83	2.72	3.36	3.16	3.86	4.15	3.42

mean (mm)	UNet	7.34	4.06	4.27	3.96	4.48	3.33	5.19	10.2	5.41
	UNet-M	5.64	3.81	3.75	3.29	3.52	3.23	4.84	8.18	4.60
	HG	7.48	4.81	3.24	3.35	4.69	3.58	4.39	7.49	4.89
	HG-M	6.37	4.11	3.10	3.28	4.12	3.33	4.19	7.07	4.47

Open in a new tab

Table 2.

Computation time and number of parameters of different networks.

network	computation time (ms/volume)	number of parameters
UNet	271	22M
HG	225	3.5M

Open in a new tab

Fig. 5. — (a) An example of fetal pose successfully predicted by the max activating location of heatmaps, where solid lines are the ground-truth pose and dashed lines are the predicted pose. (b) A failed case of fetal pose estimation with max activation (left), and the corresponding successful result after processed by MRF (right).

4. Conclusions

In this work, we proposed a two-stage deep learning framework for fetal pose estimation in 3D MRI. The proposed method achieves mean error of 4.47 mm (~ 1.5 pixels) and percentage of correct detection of 96.4%, which indicates that deep neural networks are able to identify key features for fetal pose estimation from time frames in low-resolution, volumetric EPI data from pregnant mothers. Further, the total processing time of the proposed framework is less than 1 s, potentially enabling low latency tracking of fetal pose in fetal MR imaging. Limitations of the current method include a pipeline that was only trained on singleton pregnancies. Also, the current pose detection was performed on each time frame in isolation without utilizing any form of temporal correlations in the MR series. In future work the proposed framework could be extended to work with multiplet pregnancies as well as exploit temporal correlations across volumes in a time sequence.

Overall, the proposed pipeline could be deployed for fetal motion estimation during MR scanning of pregnant mothers with applications to fetal health and disease, establishment of fetal kinetic motion models, and prospective motion correction with slice-prescription updates for more robust diagnostic fetal and maternal MRI.

Acknowledgements

This research was supported by NIH U01HD087211, NIH R01EB01733, NIH NIBIB NAC P41EB015902 and NIH NICHD U01HD087211.

References

1.Biglari H, Sameni R: Fetal motion estimation from noninvasive cardiac signal recordings. Physiological measurement 37(11), 2003 (2016) [DOI] [PubMed] [Google Scholar]
2.Heazell AP, Frøen J: Methods of fetal movement counting and the detection of fetal compromise. Journal of Obstetrics and Gynaecology 28(2), 147–154 (2008) [DOI] [PubMed] [Google Scholar]
3.Newell A, Yang K, Deng J: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision pp. 483–499. Springer (2016) [Google Scholar]
4.Schmidt M: Ugm: Matlab code for undirected graphical models. URL http://www.di.ens.fr/mschmidt/Software/UGM.html (2012)
5.Loshchilov I, Hutter F: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017) [Google Scholar]
6.Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: International conference on medical image computing and computer-assisted intervention pp. 424–432. Springer (2016) [Google Scholar]
7.Payer C, Štern D, Bischof H, Urschler M: Regressing heatmaps for multiple landmark localization using cnns. In: International Conference on Medical Image Computing and Computer-Assisted Intervention pp. 230–238. Springer (2016) [Google Scholar]

[R1] 1.Biglari H, Sameni R: Fetal motion estimation from noninvasive cardiac signal recordings. Physiological measurement 37(11), 2003 (2016) [DOI] [PubMed] [Google Scholar]

[R2] 2.Heazell AP, Frøen J: Methods of fetal movement counting and the detection of fetal compromise. Journal of Obstetrics and Gynaecology 28(2), 147–154 (2008) [DOI] [PubMed] [Google Scholar]

[R3] 3.Newell A, Yang K, Deng J: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision pp. 483–499. Springer (2016) [Google Scholar]

[R4] 4.Schmidt M: Ugm: Matlab code for undirected graphical models. URL http://www.di.ens.fr/mschmidt/Software/UGM.html (2012)

[R5] 5.Loshchilov I, Hutter F: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017) [Google Scholar]

[R6] 6.Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: International conference on medical image computing and computer-assisted intervention pp. 424–432. Springer (2016) [Google Scholar]

[R7] 7.Payer C, Štern D, Bischof H, Urschler M: Regressing heatmaps for multiple landmark localization using cnns. In: International Conference on Medical Image Computing and Computer-Assisted Intervention pp. 230–238. Springer (2016) [Google Scholar]

PERMALINK

Fetal Pose Estimation in Volumetric MRI using a 3D Convolution Neural Network

Junshen Xu

Molin Zhang

Esra Abaci Turk

Larry Zhang

Ellen Grant

Kui Ying

Polina Golland

Elfar Adalsteinsson

Abstract

1. Introduction

Fig. 1.