Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 27.
Published in final edited form as: Simpl Med Ultrasound (2021). 2021 Sep 21;12967:129–138. doi: 10.1007/978-3-030-87583-1_13

Towards Scale and Position Invariant Task Classification using Normalised Visual Scanpaths in Clinical Fetal Ultrasound

Clare Teng 1, Harshita Sharma 1, Lior Drukker 2, Aris T Papageorghiou 2, J Alison Noble 1
PMCID: PMC7612565  EMSID: EMS141982  PMID: 35368447

Abstract

We present a method for classifying tasks in fetal ultrasound scans using the eye-tracking data of sonographers. The visual attention of a sonographer captured by eye-tracking data over time is defined by a scanpath. In routine fetal ultrasound, the captured standard imaging planes are visually inconsistent due to fetal position, movements, and sonographer scanning experience. To address this challenge, we propose a scale and position invariant task classification method using normalised visual scanpaths. We describe a normalisation method that uses bounding boxes to provide the gaze with a reference to the position and scale of the imaging plane and use the normalised scanpath sequences to train machine learning models for discriminating between ultrasound tasks. We compare the proposed method to existing work considering raw eyetracking data. The best performing model achieves the F1-score of 84% and outperforms existing models.

Keywords: Eye-tracking, fetal ultrasound, time-series classification, visual scanpath

1. Introduction

During routine fetal ultrasound scans, sonographers are required to capture and store standard imaging planes of fetal anatomy [15]. These imaging planes are referred to as anatomy planes. Each anatomy plane is considered a separate task, for example brain and heart. To distinguish between the tasks, we use eye-tracking data of the sonographers recorded while they performed the scan. The eye-tracking data contains gaze information that allows us to analyse the sonographer’s visual attention and scanpath during different parts of the scan, where a scanpath is the path taken by the observer when observing a scene.

Using eye-tracking data for fetal ultrasound task classification is challenging for several reasons. The dynamic movement of a fetus means that there are numerous ways to find and capture an anatomy plane. As sonographers gain more experience, they typically capture anatomy planes more quickly and efficiently,resulting in fast transitions between planes. The size of the anatomy on the screen is also dependent on what scale the sonographer views the image at. Due to these changes in scale and position (Fig. 1), the scanpaths associated with different tasks are not easily separable using simple discriminatory methods. Knowing that anatomy planes have unique anatomical landmarks [5], we are motivated to understand whether we can distinguish between the visual scanpaths of different scanning tasks. When considering skill assessment of full-length scans, being able to classify different scanning tasks at a given time is important. The aim of our work is to understand if eye-tracking data is sufficient for the identification of the fetal ultrasound task being performed.

Fig. 1. Example of difference in position and scale of an abdomen plane scan, where the image in Fig. 1a and Fig. 1b differ in both scale and position.

Fig. 1

Related Work

Current works using scanpaths to classify tasks use static representations of eye-tracking data, for example number of fixations and fixation duration [9]. Other works either analyse tasks which use a single image such as reading [7] or generate a static representation by super-imposing the overall task-specific scanpath onto an image. Studies using scanpaths consider only a handful of entry points to reach their target [1] or use saccadic information for classification [8]. Such works are less suitable for our application because of the numerous ways a sonographer can capture an anatomy plane image (Fig. 1), and the uncertainty in identifying saccadic movement accurately. Other studies [2, 4, 18] which use scanpaths in videos utilise images as a data source. However, it is expensive to train models on image data, and our work only considers eye-tracking which is more computationally efficient.

Contribution

Our main contributions are the following. We propose a feature engineering method using eye-tracking data that is able to account for the change in scale and position of anatomy during the scan. We compare different time-series classification models and use the best-performing model based on Gated Recurrent Units (GRU) to perform task classification using visual scanpaths for fetal ultrasound tasks.

2. Method

We present our proposed method for normalising scanpath with respect to scale and position, and our proposed time-series classification model for differentiating scanpaths of sonographers when searching for different anatomical structures.

Scanpath Normalisation for Scale and Position Invariance

Raw gaze points recorded by the eye-tracker along the x and y axis with respect to the screen dimensions of 1980×1080 pixels are defined as Gx , Gy . Raw gaze points normalised by screen dimensions of 1980×1080 pixels are defined as Gxs , Gys , calculated as Gys , Gxs=Gx1980 , Gys=Gy1080 .

To provide the model with information of gaze point position relative to the image, we normalise the gaze points with respect to the circumference of the anatomy. We manually draw bounding boxes using [14] around the circumference of the anatomy plane on a cropped (1008×784 pixels) image (shown as the red box Fig. 2). We exclude all text and clipboard images to view the circumference clearly. Then, we normalise the gaze points with respect to the corner positions of the bounding box along the x and y axis: XL , XR , YT , YB where L, R, T and B represent left, right, top and bottom, and the X and Y offsets (shown as Xoffset , Yoffset in Fig. 2) created by using the cropped image, 427 and 66 pixels, respectively. An example of this normalisation process is shown in Fig. 2. Raw gaze points normalised by co-ordinates of a hand drawn bounding box on the image are given as GxBB , GyBB (Eq. 1). An example of drawn bounding boxes for the abdomen, brain and heart anatomical structures using the cropped image is shown in Fig. 3.

GxBB=GxXLXoffsetXRXLandGyBB=GyYBYoffsetYTYB (1)

Fig. 2.

Fig. 2

An example showing how a raw gaze point (green) with co-ordinates Gx , Gy is normalised with respect to the hand drawn bounding box (yellow). The point of origin of the bounding box is given as the bottom left corner.

Fig. 3. Example of manually drawn bounding boxes (in yellow) using [14] for cropped abdomen, brain and heart plane images.

Fig. 3

We use the bounding box to capture the difference in scale when viewing the image by calculating the ratio of the screen that the anatomy occupies; a ratio of 1 is where the anatomy image occupies the entire screen. The area of screen occupied by the bounding box (yellow box, Fig. 2) divided by the area of the cropped image (red box, Fig. 2) is given as A. Our generated features are: GxBB , GyBB , A.

2.1. Time-Series Classification of Scanpaths

Time-series classification can be generalised into several categories. We focus on generative model-based methods as a baseline based on [1], which considers the joint distribution of data such as a hidden Markov model (HMM). We also consider whole series comparisons as another baseline, where each time-series is compared to another using a chosen distance metric, for example a k-nearest neighbours time series classifier (k-NN TSC) [10]. We propose the use of a standardised deep learning GRU model [3], which is a subset of recurrent neural networks (RNN) that retains time dependencies between sequences. Long short-term memory (LSTM) models are also a subset of RNNs and are similar to GRUs. However, GRUs have been shown to return comparable performance to the LSTM while requiring less specified parameters [20].

Baseline Comparisons

We compare our method to [1], as they classify video clips of surgical tasks using eye-tracking data. They use a k-means clustering algorithm to convert raw eye-tracking data into a discrete sequence consisting of cluster membership numbers. For each task, a HMM was trained, and each test sequence is scored against task specific HMMs, where the predicted class is selected as the model which returns the highest logarithmic likelihood. We refer to this model as HMM.

We also compare a k-nearest neighbours time series classifier (k-NN TSC) [10], to investigate whether raw gaze points are better for task classification compared to the coarse representation used in HMM. k-NN TSC calculates the distance between each time series and classifies the time-series based on the class most common amongst its k neighbours.

Parameter Selection

For HMM, we use the elbow method [19] to determine the optimal number of clusters for k-means, and a Gaussian HMM with a full covariance matrix. k = 5. For k-NN TSC, we use dynamic time warping as our distance metric based on previous works [2]. k = 2. For our GRU model, we use RayTune’s [12] asynchronous successive halving algorithm (ASHA) [11] to tune our hyperparameters. The hyperparameters are: 32 hidden layers, 2 recurrent layers, 250 epochs, batch size = 4, dropout = 0.55 using an Adam optimiser with a learning rate of 0.003, and a cross entropy loss function.

3. Data

The dataset used are second-trimester manually labelled abdomen, brain, and heart planes as described in detail in [6]. The labelled data set contains the last 100 frames just before the sonographer freezes the video to take measurements of the captured anatomy plane [6]. A clip is defined as a single instance where the anatomy plane was searched for during a scan. We use sonographer eye-tracking gaze data (Tobii Eye Tracker 4C) sampled at 90Hz.

We choose the abdomen, brain, and heart plane scans because the sono-graphers spent the majority of their time viewing these planes [16, 17], and abdomen and brain planes are considered easier to search for compared to the heart due to differences in anatomy size. Hence, we would expect the scanning characteristics between these anatomies to be distinct from each other. In total, there are 84, 160, 122 abdomen, brain and heart plane clips respectively. Our dataset consisted of 10 fully qualified sonographers carrying out the ultrasound scan, and 76 unique pregnant women as participants.

Clips which were shorter than 100 frames because the difference in time between the previous frozen segment and the next was less than 100 frames were zero padded creating equal length time-series of 100 gaze points. Any missing gaze points due to sampling discrepancies or tracking errors were interpolated.

4. Results

To increase the size of our dataset, we augmented the images by flipping the images about the horizontal, vertical, and horizontal and vertical axis. We also randomly down sampled the data with respect to the minority class, to prevent bias towards the majority classes. Our final dataset has 336 abdomen, brain and heart plane clips each. For training and testing we performed a 3-fold stratified cross validation. We used 80% for cross validation and 20% for tuning model parameters.

We used different sets of features (shown in Tab. 1) to demonstrate our proposed method of using bounding boxes for normalisation and proposed model performs better than current baseline models. Our classification results are shown in Tab. 1. For ease of reference, we used an affix ‘Affix’ column of Tab. 1 to refer to the corresponding features used; raw for raw gaze points, scr for gaze points normalised by screen dimensions, bb for gaze points normalised by bounding box.

Table 1.

Comparison of weighted F1 scores and accuracies calculated using HMM [1], k-NN TSC, GRU to classify abdomen, heart and brain plane clips. Affix refers to the abbreviation for features used.

Model Affix Features Weighted-F1 Accuracy
HMM raw Gx , Gy 0.38±0.20 0.49±0.19
scr Gxs , Gys 0.38±0.07 0.45±0.09
k-NN TSC raw Gx , Gy 0.57±0.05 0.57±0.06
scr Gxs , Gys 0.55±0.04 0.55±0.04
scr+A Gxs , Gys , A 0.52±0.05 0.54±0.04
bb+A GxBB , GyBB , A 0.63±0.03 0.64±0.02
GRU raw Gx , Gy 0.56±0.05 0.57±0.05
scr Gxs , Gys 0.68±0.04 0.67±0.04
scr+A Gxs , Gys , A 0.72±0.05 0.72±0.05
bb+A GxBB , GyBB , A (ours) 0.84±0.01 0.83±0.01

Table 1 shows that our proposed feature engineering method to normalise raw eye-tracking data and model performs better than previous works [1] and several baselines, returning a weighted F1 score of 0.84.

Table 1 shows [1] is unable to classify fetal ultrasound tasks, where HMM(raw) and HMM(scr) returns score metrics between 38% and 49%. Instead, using k-NN TSC and GRU models improves the task classifier’s performance by at least 20% - HMM(raw) and HMM(scr) versus k-NN TSC(raw) and k-NN TSC(scr), GRU(raw) and GRU(scr) respectively.

Normalisation using the bounding box shows an improvement of at least 10% (Tab. 1 GRU(scr+A) versus GRU(bb+A), k-NN TSC(scr+A) versus k-NN TSC(bb+A)), returning a final F1 score of 84%. There is a slight decrease (1%-3%) in performance when including the size of the anatomy relative to the screen for models k-NN TSC, but a slight increase (4%) for GRU using scr and scr+A. GRU is better able to use the anatomy size information compared to k-NN TSC. Overall, normalising gaze points with respect to the anatomy circumference is more relevant of task type, compared to how much space the anatomy occupies on the screen.

The confusion matrix for our GRU(bb+A) model is shown in Fig. 4. Figure 4 shows that abdomen scanpaths are more likely to be confused with brain andheart scanpaths, where 13% and 20% of abdomen scanpaths are misclassified as heart and brain scanpaths. Brain scanpaths are also more likely to be confused with abdomen scanpaths, where 12% are predicted as abdomen scanpaths. Heart scanpaths are most distinct, where only 3% are misclassified.

Fig. 4. Confusion matrix of our GRU model normalised with respect to total number of clips per anatomy plane in the test set (106 clips).

Fig. 4

Class Imbalance Models

Since our initial dataset was unbalanced, where we had the most number of clips available for brain scanpaths and the least for abdomen scanpaths, we ran the GRU(bb+A) model using the focal loss [13] function which accounts for class imbalance when training the model. We compared the results using our original cross entropy loss which did not. We also compared the effect of augmenting our dataset.

Our results in Tab. 2 show that using an augmented dataset does not affect the performance when comparing the balanced (iii) and imbalanced (iv) datasets. The effect of using the downsampled dataset is seen when considering a smaller dataset (ii) (drop of 2-3%). Using focal loss returns more consistent results than that of using cross entropy, where the original dataset (i) returns a lower standard deviation across folds compared to the downsampled dataset (ii). Overall, for our application, using an augmented downsampled dataset did not affect the performance of our model negatively (iii) and (iv), but increasing the size of our dataset through augmentation improved performance by 4-5%.

Table 2.

Weighted F1 scores and accuracies using our proposed GRU model comparing the original, downsampled (DS) and augmented (Aug) datasets. (i) original, (ii) original downsampled (iii) original, augmented and downsampled (our proposed model in Tab. 1) and (iv) original, augmented datasets for classification of abdomen, brain and heart planes.

Dataset Loss function Weighted-F1 Accuracy
(i) Original Focal loss 0.81±0.01 0.81±0.02
(ii) Original + DS Cross entropy 0.79±0.04 0.78±0.05
(iii) Original + Aug + DS (proposed) Cross entropy 0.84±0.01 0.83±0.01
(iv) Original + Aug Focal loss 0.83±0.01 0.81±0.02

5. Discussion

A qualitative investigation was performed to understand why brain and heart scanpaths are more likely to be confused with abdomen scanpaths, and why abdomen and brain scanpaths are more misclassified with each other compared to the heart. We show the qualitative investigation plot for abdomen scanpaths (Fig. 5) since they show the highest percentage of misclassification. Figure 5 shows a contour density plot (left) of training and correctly classified abdomen plane gaze points GxBB , GxBB with cumulative density masses at 4 equally spaced levels 0.2, 0.4, 0.6, 0.8 where 0.2 is the outer most contour and 0.8 is the inner most contour. The bi-variate distribution was calculated by superimposing a Gaussian kernel on each gaze point and returning a normalised cumulative sum.

Fig. 5. Contour density plot of GxBB , GxBB (left), and abdomen scanpaths which were incorrectly classified as brain (orange, i and ii) or heart (blue, iii and iv) scanpath represented (right).

Fig. 5

Figure 5 also shows abdomen scanpaths (right) which were incorrectly classified as a brain (orange) or heart (blue) scanpath.

Figure 5 shows that for abdomen planes, sonographer scanpaths are concentrated within the central area of the anatomy. However, for abdomen scanpaths predicted as heart, the sonographer focused on a single area (Fig. 5, iii and iv) similar to how sonographers visually search for the heart. For scanpaths predicted as brain, the sonographer moved the probe, causing their gaze to shift accordingly with the image (ii), or had moved their gaze across the screen (i) similar to how sonographers search for the brain.

For misclassified brain scanpaths, the image was small and occupied <50% of the screen, and the sonographer did not focus along the midline horizontally but diagonally across the plane. Misclassified heart scanpaths showed that the image itself was moving, indicating that the probe was moving, causing the sonographer to shift their gaze accordingly or the sonographer was looking around the walls of the heart cavity.

6. Conclusion

In this paper, we have presented a method for normalising eye-tracking data with respect to the circumference of the anatomy which is able to account for changes in position and scale of the anatomy image during the scan. With our method, we have improved task classification using eye-tracking data score metrics by at least 15% compared to other methods. We also present a GRU model which performed better than other classification methods such as using k-means and HMM [1], or k-NN TSC, showing an improvement of at least 20% in accuracy.

Acknowledgements

We acknowledge the ERC (Project PULSE: ERC-ADG-2015 694581) and the NIHR Oxford Biomedical Research Centre.

References

  • 1.Ahmidi N, Hager GD, Ishii L, Fichtinger G, Gallia GL, Ishii M. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) LNCS PART 3).Surgical task and skill classification from eye tracking and tool motion in minimally invasive surgery. 2010;6363:295–302. doi: 10.1007/978-3-642-15711-0_37. [DOI] [PubMed] [Google Scholar]
  • 2.Cai Y, Droste R, Sharma H, Chatelain P, Drukker L, Papageorghiou AT, Noble JA. Spatio-temporal visual attention modelling of standard biometry plane-finding navigation. Medical Image Analysis. 2020;65 doi: 10.1016/j.media.2020.101762. [DOI] [PubMed] [Google Scholar]
  • 3.Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation; EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference; 2014. pp. 1724–1734. [DOI] [Google Scholar]
  • 4.Droste R, Cai Y, Sharma H, Chatelain P, Papageorghiou A, Noble J. Towards capturing sonographic experience: cognition-inspired ultrasound video saliency prediction; MIUA: Annual Conference on Medical Image Understanding and Analysis; Springer Verlag; pp. 174–186. [Google Scholar]
  • 5.Droste R, Chatelain P, Drukker L, Sharma H, Papageorghiou AT, Noble JA. Discovering Salient Anatomical Landmarks by Predicting Human Gaze; Proceedings -International Symposium on Biomedical Imaging; 2020. pp. 1711–1714. 2020-April. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Drukker L, Sharma H, Droste R, Alsharid M, Chatelain P, Noble JA, Papageorghiou AT. Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video. Scientific Reports. 2021;11(1):14109. doi: 10.1038/s41598-021-92829-1. doi: 10.1038/s41598-021-92829-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ebeid IA, Bhattacharya N, Gwizdka J, Sarkar A. Analyzing gaze transition behavior using Bayesian mixed effects markov models; Eye Tracking Research and Applications Symposium (ETRA); 2019. [DOI] [Google Scholar]
  • 8.Fuhl W, Castner N, Kübler T, Lotz A, Rosenstiel W, Kasneci E. Ferns for area of interest free scanpath classification; Eye Tracking Research and Applications Symposium (ETRA); 2019. [DOI] [Google Scholar]
  • 9.Hild J, Kühnle C, Voit M, Beyerer J. Predicting observer’s task from eye movement patterns during motion image analysis; Eye Tracking Research and Applications Symposium (ETRA); 2018. [DOI] [Google Scholar]
  • 10.Lee YH, Wei CP, Cheng TH, Yang CT. Nearest-neighbor-based ap-proach to time-series classification. Decision Support Systems. 2012;53(1):207–217. doi: 10.1016/j.dss.2011.12.014. https://www.sciencedirect.com/science/article/pii/S0167923612000097 . [DOI] [Google Scholar]
  • 11.Li L, Jamieson KG, Rostamizadeh A, Gonina E, Hardt M, Recht B, Tal-walkar A. Massively Parallel Hyperparameter Tuning. CoRR. 2018 abs/18100 http://arxiv.org/abs/1810.05934 . [Google Scholar]
  • 12.Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: A Research Platform for Distributed Model Selection and Training. 2018 [Google Scholar]
  • 13.Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection. 2018 doi: 10.1109/TPAMI.2018.2858826. [DOI] [PubMed] [Google Scholar]
  • 14.Openvinotoolkit: openvinotoolkit/cvat. https://github.com/openvinotoolkit/cvat .
  • 15.Public Health England (PHE) NHS Fetal Anomaly Screening Programme Handbook (August) 2018. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachmentdata/file/749742/NHSfetalanomalyscreeningprogrammehandbookFINAL1.218.10.18.pdf .
  • 16.Sharma H, Droste R, Chatelain P, Drukker L, Papageorghiou AT, Noble JA. Spatio-Temporal Partitioning And Description Of Full-Length Routine Fetal Anomaly Ultrasound Scans; 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019); 2019. pp. 987–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Sharma H, Drukker L, Chatelain P, Droste R, Papageorghiou AT, Noble JA. Knowledge Representation and Learning of Operator Clinical Workflow from Full-length Routine Fetal Ultrasound Scan Videos. Medical Image Analysis. 2021;69:101973. doi: 10.1016/j.media.2021.101973. http://www.sciencedirect.com/science/article/pii/S1361841521000190 . [DOI] [PubMed] [Google Scholar]
  • 18.Sharma H, Drukker L, Papageorghiou AT, Noble JA. Multi-Modal Learning from Video, Eye Tracking, and Pupillometry for Operator Skill Characterization in Clinical Fetal Ultrasound; 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); 2021. pp. 1646–1649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Thorndike RL. Who belongs in the family? Psychometrika. 1953 Dec;18(4):267–276. doi: 10.1007/bf02289263. doi: 10.1007/bf02289263. [DOI] [Google Scholar]
  • 20.Yamak PT, Yujian L, Gadosey PK. A comparison between ARIMA, LSTM, and GRU for time series forecasting; ACM International Conference Proceeding Series; 2019. pp. 49–55. [DOI] [Google Scholar]

RESOURCES