Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Sep 17.
Published in final edited form as: Med Image Comput Comput Assist Interv. 2022 Sep 17;13437:94–103. doi: 10.1007/978-3-031-16449-1_10

Multimodal-GuideNet: Gaze-Probe Bidirectional Guidance in Obstetric Ultrasound Scanning

Qianhui Men 1,, Clare Teng 1, Lior Drukker 2,3, Aris T Papageorghiou 2, J Alison Noble 1
PMCID: PMC7614062  EMSID: EMS159396  PMID: 36649382

Abstract

Eye trackers can provide visual guidance to sonographers during ultrasound (US) scanning. Such guidance is potentially valuable for less experienced operators to improve their scanning skills on how to manipulate the probe to achieve the desired plane. In this paper, a multimodal guidance approach (Multimodal-GuideNet) is proposed to capture the stepwise dependency between a real-world US video signal, synchronized gaze, and probe motion within a unified framework. To understand the causal relationship between gaze movement and probe motion, our model exploits multitask learning to jointly learn two related tasks: predicting gaze movements and probe signals that an experienced sonographer would perform in routine obstetric scanning. The two tasks are associated by a modality-aware spatial graph to detect the co-occurrence among the multi-modality inputs and share useful cross-modal information. Instead of a deterministic scanning path, Multimodal-GuideNet allows for scanning diversity by estimating the probability distribution of real scans. Experiments performed with three typical obstetric scanning examinations show that the new approach outperforms single-task learning for both probe motion guidance and gaze movement prediction. Multimodal-GuideNet also provides a visual guidance signal with an error rate of less than 10 pixels for a 224 × 288 US image.

Keywords: Probe Guidance, Multimodal Representation Learning, Ultrasound Navigation, Multitask Learning

1. Introduction

Obstetric ultrasound (US) scanning is a highly-skilled medical examination that requires refined hand-eye coordination as the sonographer must look at a screen and simultaneously manipulate a handheld probe. Computer-assisted scanning with probe motion guidance could improve the training process for non-specialists to develop their scanning skills, which has been increasingly investigated among researchers and clinicians [10,14,18]. Within the robotics field, work has focused on guiding operators to scan simple structures such as the liver [13], lumbar and vertebrae [11]. Such solutions are not feasible for obstetric scans because of the variety of fetal anatomy to be measured and the unpredictable fetal movement.

Previous studies in obstetric scanning guidance have proposed positioning the probe based on a behavioral cloning system [6] or landmark-based image retrieval [22]. In [6], different strategies are modeled for operators to either follow the next-action instruction or directly approach the anatomical Standard Plane (SP) [1]. Other work [18,19] deployed probe guidance signals to a robotic arm that is not practically applicable in a clinical environment. A common practice in these models is to treat probe guidance as an image-guided navigation problem. However, as multiple fetal anatomies can appear in a single US image, the gaze of the operator can provide instructive context about the likely next probe movement. Using gaze information to inform probe motion guidance has not been researched before now, and we explore this as the first aim of this work.

In addition to probe navigation, gaze information is also used as a guiding signal, usually in the form of gaze-point or saliency map (eye-tracking heat maps) prediction in US image or video. Cai et al. [2,3] leveraged visual saliency as auxiliary information to aid abdominal circumference plane (ACP) detection, and Droste et al. [5] extended it to saliency prediction with diverse anatomical structures. Teng et al. [17] characterized the visual scanning patterns from normalized time series scanpaths. Here, with the assumption that a sonographer will react to the next image inferred from their hand movement on probe, the second aim of this work is to explore whether probe motion is useful in guiding gaze.

In this work, we investigate how experienced sonographers coordinate their visual attention and hand movement during fetal SP acquisition. We propose the first model to provide useful guidance in both synchronized probe and gaze signals to achieve the desired anatomical plane. The model termed Multimodal-GuideNet observes scanning patterns from a large number of real-world probe motion, gaze trajectory, and US videos collected from routine obstetric scanning (data acquisition in Fig. 1). Multimodal-GuideNet employs multitask learning (MTL) for the two highly-related US guidance tasks of probe motion prediction and gaze trajectory prediction, and identifies commonalities and differences across these tasks. The performance boost over single-task learning models suggests that jointly learning gaze and probe motion leads to more objective guidance during US scanning. Moreover, the model generates real-time probabilistic predictions [8] that provide unbiased guidance of the two signals to aid operators.

Fig. 1. Data acquisition and the correspondence between captured signals.

Fig. 1

(a) Overview of the multi-modality data acquisition in the clinical obstetric ultrasound scanning. (b) The unrolled US guiding process between the acquired US image, the probe motion signal, and the gaze signal.

2. Methods

Figure 1 outlines the principles of the approach. The probe orientation is recorded in 4D quaternions by an inertial measurement unit (IMU) motion sensor attached to the US probe, and the 2D gaze-point signal is captured by an eye-tracking sensor mounted on the bottom of the US screen. Given an US image starting at a random plane, its change in gaze between neighbour time steps, and its corresponding probe rotation, our multitask model Multimodal-GuideNet estimates the instructive next-step movements of both the gaze and probe for the SP acquisition. The two tasks: probe motion prediction and gaze shift prediction complement each other for more accurate US scanning guidance. The problem definition and network architecture are as follows.

2.1. Problem Formulation

Unlike previous US guidance models that only predict a fixed action, we regard the gaze and probe movements as random variables to account for inter- and intra-sonographer variation. For a more continuous prediction, the relative features are used from neighbour frames of these two modalities. Let st = gtgt−1 be the shift of gaze point g = (x, y) at time t and rt=qt1*qt be the rotation from the probe orientation q = (qw, qx, qy, qz), where q* is the conjugate. We make the assumption that the gaze shift st follows a bi-variate Gaussian distribution, i.e., st~N(μts,σts,ρts), where μts and σts denote the mean and standard deviation respectively in 2D, and ρts is the correlation coefficient between x and y. Therefore, at every step, the model outputs a 5D vector for gaze estimation. Similarly, we achieve a 14D vector for probe rotation rt which follows a multi-variate Gaussian distribution rt~N(μtr,σtr,ρtr). The multitask objective for model training is to jointly minimize the negative log-likelihoods of the two learning tasks

=t=t0T(λslog(P(st|μts,σts,ρts))λrlog(P(rt|μtr,σtr,ρtr))+η(1μtr2)2), (1)

where t0 and T are the start and end indices for prediction. λs, and λr control the training ratio of the two tasks and are both set to 1. η is the weighting parameter for the quaternion prior to normalize μr with η = 50.

2.2. Multimodal-GuideNet

To facilitate multitask learning, Multimodal-GuideNet constructs a lightweight graph shared among the three modalities. The network backbone is formed by graph convolutional Gated Recurrent Unit (GCGRU) [12] that automatically allocates useful dependencies within the graph at each time step. The designed lightweight spatial graph is also computationally efficient for online inference. Temporally, the gaze and probe dynamics complement each other through a bidirectional pathway. The entire multitask framework is presented in Fig. 2. To facilitate interactive learning within the graph structure, the input of each modality is embedded into an equal-sized 128-channel vector separately through a linear transformation block fI, fs, and fr, each of which contains a fully-connected (FC) layer, a batch normalization (BN) layer, and a ReLU activation function. Before fI, the grayscale US image is initially mapped to a flattened image representation I [6] with MobileNetV2 (MNetV2) [16].

Fig. 2.

Fig. 2

Flowchart of Multimodal-GuideNet for a single time step. The two tasks share a modality-aware spatial graph from the three modalities.

Modality-aware Graph Representation Sharing

To model spatial proximity at time t, we propose a common graph structure Gt=(Vt,t) that is shared among the three modalities, where Vt={fI(It),fs(st),fr(rt)} is the vertex set with 3 nodes. t is the edge set specified by a 3 × 3 adaptive adjacency matrix At + Mt with the first term indicating the spatial relationship within Vt, and the second term a trainable adjacency mask [20] to increase graph generalization. Inspired by [21], the edge weight of any two nodes in At is formed by the affinity between the corresponding two modality features in the embedded space

At(j,k)=softmax(θ(fj(jt))Tϕ(fk(kt))),j,k{I,s,r} (2)

where θ and ϕ are 1 × 1 convolutions with θ(x),ϕ(x)256, and the softmax operation is to normalize the row summation of At. The message passed for s and r is therefore aggregated by one layer of a spatial graph convolution

k{I,s,r}sigmoid(At(j,k)fk(kt)Wj),j{s,r} (3)

where Wj is the input feature kernel specified for each gate in the GRU cell.

Gaze-Probe Bidirectional Adaptive Learning

During US scanning, the gaze and probe movements of sonographers are generally heterogeneous, i.e., they do not move at the same pace. Upon approaching the SP, the gaze is more prone to rapid eye movements between anatomical structures while the probe remains steady. We account for this effect by enclosing a bidirectional inverse adaptive pathway between hidden states of s and r in the time domain. Let hts, htr, h˜ts, h˜tr and zts, ztr refer to the hidden state, candidate activation, and update gate of s and r in GRU at time t respectively, we replace the original hidden state update (1zt)ht1+zth˜t with:

hts=α(1zts)ht1s+αztsh˜tsupdatefromgaze+(1α)ztrht1s+(1α)(1ztr)h˜tsinverseupdatefromprobehtr=β(1ztr)ht1r+βztrh˜trupdatefromprobe+(1β)ztsht1r+(1β)(1zts)h˜trinverseupdatefromgaze (4)

where α, β are the adaptive channel-wise weights for zts and ztr, respectively, and ⊙ is element-wise product. The number of hidden channels is set to 128 which is the same as α and β. With the proposed bidirectional pathway, the gaze and probe signals will adapt the domain-specific representation from each other to generate a more accurate scanning path. Other than the input operation for all gates (Eq. 3) and an adaptive hidden state (Eq. 4) for the output, we follow the operations in a standard GRU [4] to transfer temporal information.

3. Experiments

3.1. Data

The data used in this study were acquired from the PULSE (Perception Ultrasound by Learning Sonographic Experience) project [7]. The clinical fetal ultrasound scans were conducted on a GE Voluson E8 scanner (General Electric, USA) and the video signal was collected lossless at 30 Hz. The corresponding gaze tracking data was simultaneously recorded as (x, y) coordinates at 90 Hz with a Tobii Eye Tracker 4C (Tobii, Sweden). The probe motion was recorded with an IMU (x-io Technologies Ltd., UK) attached to the probe cable outlet as shown in Fig. 1(a). Approval from UK Research Ethics Committee was obtained for this study and written informed consent was also given by all participating sonographers and pregnant women. In total, there are 551 2nd and 3rd trimester scans carried out by 17 qualified sonographers. All three-modality data were downsampled to 6 Hz to reduce the time complexity.

3.2. Experimental Settings

The video frames were cropped to 224×288 and irrelevant graphical user itnerface information was discarded. To facilitate image representation learning, we pre-train MNetV2 with a large number of cropped US frames under the 14 SonoNet standard plane classifier [1] following the processing step of [6]. The clinical SP type is recognised automatically by Optical Character Recognition (OCR) with a total of 2121 eligible acquisitions labelled. For each acquisition, a multimodal data sample is selected 10s before the SP, which is the time for probe refinement. The raw gaze point is scaled to (—0.5,0.5) with the image center kept invariant, and the predicted μts is also normalized to the same range by sigmoid activation and a shift factor 0.5 before the minimization of multitask objective. The ratio of train:test is 4:1. In the training stage, we randomly select 32 continuous frames in each sample. The model is evaluated for three biometry SPs which are trans-ventricular plane (TVP), abdominal circumference plane (ACP), and femur standard plane (FSP) [15]. The AdamW optimizer is adopted with an initial learning rate of 1e-3 decayed by 1e-2 every 8 epochs. The whole network is first trained on all 14 classes of SPs for 20 epochs and separately fine-tuned for TVP, ACP, and FSP for 16 epochs.

3.3. Metrics and Baselines

We evaluate two probe scenarios: Coarse Adjustmeynt where probe rotation angle to SP ≥10°, and Fine Adjustmeynt ≤10°. The ratio of the two stages may vary from sample to sample and thus prediction performance is averaged among all frames in the same stage. For our method, we randomly sample 100 trajectories from the predicted distribution and average them as a final prediction r^ and s^. The two tasks are separately evaluated with different metrics: Probe movement is considered as correctly predicted if it is rotating towards the next target plane, i.e., (qt1r^t,qt)(qt1,qt); The predicted gaze point g^t=gt1+s^t is evaluated by pixel 12 norm error. We compare our multitask model with two baselines and two single-task architectures: Baseline (r), continuing the previous probe rotation at the current time step; Baseline (g), using the previous gaze point at the current time step; US-GuideNet [6], single-task learning approach for probe guidance, where only probe motion is modeled and predicted from US video; Gaze-GuideNet, single-task learning approach for gaze prediction, where only gaze information is modeled and predicted from US video by discarding the probe stream from Multimodal-GuideNet.

4. Results and Discussion

4.1. Probe Motion Guidance

A detailed performance comparison for the probe guidance task is presented in Fig. 3. Multimodal-GuideNet achieves an overall consistent improvement over the single task-based US-GuideNet [6] for the two adjustment stages, which indicates that simultaneously learning the gaze patterns benefits the probe motion planning. The probe rotation for the femur (FSP) is difficult to predict when it gets close to the SP (at 0°). Different from a steady probe movement to achieve TVP and ACP, the probe manipulation close to FSP requires complicated twisting actions [15]. This also explains why incorporating gaze contributes more in the coarse adjustment (as at 30°) to locate the femur but not in the fine stage (as at 10°). Moreover, the flexible movements of fetal limbs increase the diversity in FSP detection, which explains why there is a slightly higher standard deviation observed for this plane. In the 5th subplot (w/vs. w/o bi-path), the improvements indicate that the pathway between gaze and probe stabilizes the probe movement with a more accurate prediction, especially in fine adjustment.

Fig. 3.

Fig. 3

Probe rotation accuracy (the higher the better) on the 3 evaluated standard planes and the overall prediction with ablations. The shaded area indicates the standard deviation of our model across all 100 samplings. bi-path signifies bidirectional pathway.

4.2. Gaze Trajectory Prediction

Figure 4 shows the prediction results for the gaze task. A common observation among all three planes is that for gaze prediction, the error of fine adjustment is generally larger than coarse adjustment. This is because in contrast to the fine-grained probe motion, eye gaze movement during that time is quite rapid, flitting between the observed anatomical structures. When comparing the three planes, the error ranges are the lowest for ACP and the highest for FSP especially for the fine adjustment. Since the key anatomical structures in ACP are relatively close to each other, the sonographer requires a smaller change in gaze. For FSP, sonographers switch focus between both femur ends which increases the uncertainty of the gaze state in the next time step. Comparing between methods, Multimodal-GuideNet reduces the error of Gaze-GuideNet for all cases, which demonstrates the effectiveness of multitask learning over single-task learning in gaze prediction. The bidirectional pathway also slightly improves the gaze prediction as compared in the 5th subplot. As a common evaluation in sampling-based generative models [9], we also report the performance of our best gaze point prediction among all samplings in Gaze-GuideNet* and Multimodal-GuideNet*. Their errors are within 10 pixels which shows the feasibility of the learned distribution in generating a plausible gaze trajectory. Practically, Multimodal-GuideNet* could be useful when a precise gaze is needed such as when the sonographer focuses over a small range of underlying anatomical structure, and its improvement over Gaze-GuideNet* indicates probe guidance could potentially help locate such a fixation point.

Fig. 4.

Fig. 4

Gaze prediction error (the lower the better) on the 3 evaluated standard planes and the overall prediction with ablations. The error of the best-generated gaze point that is closest to ground truth is reported in Gaze-GuideNet* and Multimodal-GuideNet*, respectively.

Figure 5 shows an example of predicted visual saliency and gaze point deduced from the generated gaze shift distribution. The predictions are highly accurate in all timestamps except for a significant gaze shift at frame tf. However, the predicted saliency map at tf correctly estimates the orientation of gaze shift. Saliency map-based numerical metrics are also evaluated in the supplementary material, where the multitask model generally outperforms the single-task one. In general, modeling the gaze information as a bi-variate distribution is technically advantageous over a saliency map-based predictor, as the problem complexity is reduced from optimizing a large feature map to only a few parameters for probability density estimation. The flexibility in gaze sampling also preserves the variety in gaze movements.

Fig. 5.

Fig. 5

Visualization of predicted saliency map (top row), gaze point (bottom row, red star), and corresponding ground truth gaze point (green star) for an ACP searching sequence. tc and tf are timestamps for coarse and fine adjustment, respectively.

5. Conclusion

We have presented a novel multimodal framework for bidirectional guidance between probe motion and eye-tracking data in routine US scanning. We have explored multitask learning by jointly predicting the probe rotation and gaze trajectory from US video via a shared modality-aware graph structure. The performance gains over single-task predictions suggest that the two-modality signals complement each other to reach the scanning target, while ignoring any of them will lead to a biased guidance. The learned guidance signals with probability distributions also allow for diversity between individual scans in a practical environment.

Supplementary Material

Supplementary Material

Acknowledgements

We acknowledge the ERC (ERC-ADG-2015 694581, project PULSE), the EP-SRC (EP/MO13774/1, EP/R013853/1), and the NIHR Oxford Biomedical Research Centre.

References

  • 1.Baumgartner CF, Kamnitsas K, Matthew J, Fletcher TP, Smith S, Koch LM, Kainz B, Rueckert D. Sononet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans Med Imaging. 2017;36(11):2204–2215. doi: 10.1109/TMI.2017.2712367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Cai Y, Sharma H, Chatelain P, Noble JA. Multi-task sonoeyenet: detection of fetal standardized planes assisted by generated sonographer attention maps; International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); 2018. pp. 871–879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cai Y, Sharma H, Chatelain P, Noble JA. Sonoeyenet: Standardized fetal ultrasound plane detection informed by eye tracking; IEEE International Symposium on Biomedical Imaging (ISBI); 2018. pp. 1475–1478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation; Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. [Google Scholar]
  • 5.Droste R, Cai Y, Sharma H, Chatelain P, Drukker L, Papageorghiou AT, Noble JA. Ultrasound image representation learning by modeling sonographer visual attention; International Conference on Information Processing in Medical Imaging; 2019. pp. 592–604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Droste R, Drukker L, Papageorghiou AT, Noble JA. Automatic probe movement guidance for freehand obstetric ultrasound; International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); 2020. pp. 583–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Drukker L, Sharma H, Droste R, Alsharid M, Chatelain P, Noble JA, Papageorghiou AT. Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video. Sci Rep. 2021;11(1):1–12. doi: 10.1038/s41598-021-92829-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Graves A. Generating sequences with recurrent neural networks. 2013:arXiv:1308 0850 [Google Scholar]
  • 9.Gupta A, Johnson J, Fei-Fei L, Savarese S, Alahi A. Social gan: Socially acceptable trajectories with generative adversarial networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. pp. 2255–2264. [Google Scholar]
  • 10.Housden RJ, Treece GM, Gee AH, Prager RW. Calibration of an orientation sensor for freehand 3d ultrasound and its use in a hybrid acquisition system. Biomed Eng Online. 2008;7(1):1–13. doi: 10.1186/1475-925X-7-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li K, Wang J, Xu Y, Qin H, Liu D, Liu L, Meng MQH. Autonomous navigation of an ultrasound probe towards standard scan planes with deep reinforcement learning; IEEE International Conference on Robotics and Automation (ICRA); 2021. pp. 8302–8308. [Google Scholar]
  • 12.Li Y, Zemel R, Brockschmidt M, Tarlow D. Gated graph sequence neural networks; International Conference on Learning Representations (ICLR); 2016. [Google Scholar]
  • 13.Mustafa ASB, Ishii T, Matsunaga Y, Nakadate R, Ishii H, Ogawa K, Saito A, Sugawara M, Niki K, Takanishi A. Development of robotic system for autonomous liver screening using ultrasound scanning device; IEEE International Conference on Robotics and Biomimetics (ROBIO); 2013. pp. 804–809. [Google Scholar]
  • 14.Prevost R, Salehi M, Jagoda S, Kumar N, Sprung J, Ladikos A, Bauer R, Zettinig O, Wein W. 3d freehand ultrasound without external tracking using deep learning. Med Image Anal. 2018;48:187–202. doi: 10.1016/j.media.2018.06.003. [DOI] [PubMed] [Google Scholar]
  • 15.Salomon LJ, Alfirevic Z, Berghella V, Bilardo C, Hernandez-Andrade E, Johnsen S, Kalache K, Leung KY, Malinger G, Munoz H, et al. Practice guidelines for performance of the routine mid-trimester fetal ultrasound scan. Ultrasound Obstet Gynecol. 2011;37(1):116–126. doi: 10.1002/uog.8831. [DOI] [PubMed] [Google Scholar]
  • 16.Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. Mobilenetv2: Inverted residuals and linear bottlenecks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. pp. 4510–4520. [Google Scholar]
  • 17.Teng C, Sharma H, Drukker L, Papageorghiou AT, Noble JA. Towards scale and position invariant task classification using normalised visual scanpaths in clinical fetal ultrasound; International Workshop on Advances in Simplifying Medical Ultrasound; 2021. pp. 129–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Toporek G, Wang H, Balicki M, Xie H. Autonomous image-based ultrasound probe positioning via deep learning; Hamlyn Symposium on Medical Robotics; 2018. [Google Scholar]
  • 19.Wang S, Housden J, Noh Y, Singh D, Singh A, Skelton E, Matthew J, Tan C, Back J, Lindenroth L, et al. Robotic-assisted ultrasound for fetal imaging: evolution from single-arm to dual-arm system; Annual Conference Towards Autonomous Robotic Systems; 2019. pp. 27–38. [Google Scholar]
  • 20.Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition; Thirty-second AAAI Conference on Artificial Intelligence; 2018. [Google Scholar]
  • 21.Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N. Semantics-guided neural networks for efficient skeleton-based human action recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2020. pp. 1112–1121. [Google Scholar]
  • 22.Zhao C, Droste R, Drukker L, Papageorghiou AT, Noble JA. Visual-assisted probe movement guidance for obstetric ultrasound scanning using landmark retrieval; International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); 2021. pp. 670–679. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES