Abstract
This paper presents a novel multi-modal learning approach for automated skill characterization of obstetric ultrasound operators using heterogeneous spatio-temporal sensory cues, namely, scan video, eye-tracking data, and pupillometric data, acquired in the clinical environment. We address pertinent challenges such as combining heterogeneous, small-scale and variable-length sequential datasets, to learn deep convolutional neural networks in real-world scenarios. We propose spatial encoding for multi-modal analysis using sonography standard plane images, spatial gaze maps, gaze trajectory images, and pupillary response images. We present and compare five multi-modal learning network architectures using late, intermediate, hybrid, and tensor fusion. We build models for the Heart and the Brain scanning tasks, and performance evaluation suggests that multi-modal learning networks outperform uni-modal networks, with the best-performing model achieving accuracies of 82.4% (Brain task) and 76.4% (Heart task) for the operator skill classification problem.
Index Terms: Multi-modal learning, ultrasound, convolutional neural networks, eye tracking, pupillometry
1. Introduction
Obstetric ultrasound scanning is recognized as a highly-skilled task, requiring years to master well. Ultrasound operator skill assessment and characterization can form part of initial training on simulators, but has not been studied extensively in the clinic using objective computer-aided methods. In the emerging parallel field of surgical data science [1], such methods have been proposed for surgical skill assessment and evaluation. Our paper explores a similar perspective under sonography data science, where novel multi-modal deep learning networks are designed to automatically classify operator skills using heterogeneous spatio-temporal sensory cues acquired from routine settings, namely, the scan video recording, eye-tracking data, and pupillometric data, in the context of second-trimester fetal ultrasound scanning. The acquired video appearance provides knowledge of ‘what’ and ‘how well’ the operator captures information; the eye tracking enables the understanding of ‘where’ the operator looks, and pupillometry determines ‘how hard’ the operator concentrates during the scanning. Specifically, the quality of the captured standard plane (SP) in the scan video, determined by the appearance of certain mandatory anatomical landmarks, is a potential indicator of operator skill. Gaze has been shown to be informative to differentiate the visual expertise and behavior between experts, trainees, and novices in radiology [2], and can be indicative of ultrasound operator skill as well. Pupillometry, the study of eye pupil diameter changes, correlates with cognitive workload [3], and medical professionals with varying skill levels have shown to exhibit distinct pupillary responses, for example, emergency medicine [4] and ultrasonography [5]. Thus, we hypothesize that these novel sensory cues can help discriminate newly qualified and experienced ultrasound imaging operators.
In the literature, the most widely explored modalities for multi-modal learning include images, audio, video, and text [6]. The analysis of heterogeneous multi-modal data acquired in real-world settings, such as that used in our work, presents two unique challenges. Firstly, it is well-known that deep learning models require large-scale data for successful training. However, clinical multi-modal data acquired in specialized acquisition setups is often small-scale. Further, considering gaze and pupillary response as examples of sequential data, pre-trained models in similar domains do not exist, which means deep temporal models (e.g. LSTM, 1D CNN) need to be built from scratch. This may lead to overfitting, or even training failure (as empirically observed for this work). The problem is enhanced by the natural variable length of sequences, also found in our data, which requires length-adjustment such as zero-padding. To address this challenge, we propose spatial encoding of the limited variable-length sequential data into fixed-sized images, followed by transfer learning on pre-trained image-based CNN models for operator skill characterization. The second challenge is how to combine the sensory cues for end-to-end multi-modal learning. For instance, late fusion has shown success on benchmark datasets (e.g. Kinetics, mini-Sports), but is also associated with overfitting [7]. Late fusion models learn intra-modal dynamics; however, inter-modal interactions can allow information to be exchanged between multimodal CNN layers. In this work, we implement late fusion to explore intra-modal learning, intermediate fusion to capture inter-modal interactions, and hybrid fusion to combine both benefits. Tensor fusion was introduced to model inter-modal dynamics by explicitly aggregating uni-modal, bi-modal and tri-modal interactions between three modalities [8]. Though tensor fusion is computationally more expensive for four modalities in this work, we investigate two efficient CNN architectures using tensor fusion.
The main contributions of the paper are: 1) We propose a novel and comprehensive multi-modal analysis pipeline using heterogeneous spatio-temporal sensory cues, namely, scan video, eye-tracking data, and pupillometric data, acquired from routine obstetrics ultrasound, for automatic operator skill characterization. 2) We propose methods to encode limited-size and variable-length sequential datasets to enable transfer learning and alleviate problems associated with complex training from scratch. 3) We perform an ablation study of the uni-modal CNN models, and compare five end-to-end multi-modal CNN architectures.
2. Methods
2.1. Multi-modal Data Acquisition
The multi-modal data came from the Perception Ultrasound by Learning Sonographic Experience (PULSE) study 1. Routine full-length second-trimester ultrasound scan videos were recorded along with synchronized eye tracking of operators [9]. The full-length scan videos were temporally partitioned into variable-length video clips consisting of frames before the first automatically detected freeze until the last sequential freeze frame. The gaze and pupillary response sequences were obtained corresponding to the extracted video clips, using the spatial gaze points (relative x and y coordinates) and pupil diameters with matching timestamps as the output of the eye tracker [10].
We selected 370 scans undertaken by 12 operators for this study. Operators were identified as newly qualified (NQ, 3 operators, ≤ 2 years’ experience, 225 scans) or experienced (XP, 9 operators, > 2 years’ experience, 145 scans). We selected the ‘Brain’ and the ‘Heart’ scanning tasks in the ultrasound scans for further analysis [11], because these tasks require the assessment of the fetal brain using two SPs and the fetal heart using five SPs, respectively, which can prove challenging for ultrasound operators, thereby allowing differentiation of their skills. These were found to be the most commonly occurring tasks [12, 13], indicating that the operators spent most time on them. A total of 2,309 video clips and sequences (732 Brain, 1,577 Heart) were extracted from the full-length scan videos and eye-tracking data, respectively.
2.2. Spatial Data Encoding
The raw multi-modal data consists of N video clips, and corresponding gaze sequences and pupillary response sequences. The nth video clip with K video frames is , where kth frame is , (H = 224, W = 288, C = 3) after cropping and scaling the relevant imaging area from the ultrasound screen RGB image. From the multi-modal sensory cues, we generate four types of data inputs to train the multi-modal skill characterization models, which are described and spatially encoded as follows.
Standard Plane (SP) Image: The nth SP image is , where the frame is randomly extracted from the freeze part of Vn, as the appearance of the captured image shows negligible variance during the interpretation of the frozen segment. The quality of appearance of the captured SP image is indicative of operator skill.
- Spatial Gaze Map: The nth gaze point sequence is , where is the kth gaze point coordinate. Gaze points in Pn are temporally accumulated to obtain a spatial point map and the spatial gaze map as defined in Eqn. 2.
(1)
K(σ) is a 2D Gaussian Kernel with σ = 1.5° visual angle corresponding to the foveal spread of the human eye [10]. Gn encodes spatial aspects of operator gaze in the cumulative dwell time, such as locations of visual attention, number of regions of interest, and spatial dispersion of gaze.(2) Gaze Trajectory Image: From Pn, a weighted line graph with edges is generated, where represents an edge between successive gaze points and . Edge is weighted by given by . The corresponding gaze trajectory image consists of all edges Ln with weighted gray levels. The weights give higher importance to gaze points traversed relatively later in time, when the operator is interpreting frozen frames. The image encodes spatiotemporal gaze information such as the scan-path pattern and the fraction of cumulative gaze that was fixated (ratio of closer to dispersed edges).
- Pupillary Response Image: The nth pupil diameter sequence is where is the kth mean pupil diameter of left and right eye. The sequences are first pre-processed according to bespoke guidelines [14] to filter noisy and artifact samples. Then, a task-evoked pupillary response (TEPR) sequence is computed from the processed Dn[3].
where dr represents a rest pupil diameter, which is the minimum pupil diameter for a given scan. The TEPR sequence En is encoded into image using summation and difference Gramian Angular Fields (GAF) and Markov Transition Field (MTF) [15] as the RGB channels, where the GAF encodes static information, while MTF depicts the dynamics embedded in the raw time-series.(3)
Hence, the resulting nth input data instance is a 4-tuple In = {Sn, Gn, Tn, Rn}. A ResNet-18 CNN architecture [16] was selected for transfer learning via fine-tuning of pre-trained models to perform operator skill characterization, as it provides a good balance between computational complexity and classification accuracy. The resulting ensemble of uni-modal CNNs for J data sources is .
2.3. Multi-modal Learning
Five multi-modal fusion CNN architectures are explored to learn skill characterization models. Consider feature extractor corresponding to layer X of a uni-modal CNN Mj. The five multi-modal CNN architectures are defined as below.
Late fusion CNN (LF-CNN): Here, X is the ‘pool5’ (average pool) layer of ResNet-18 CNN. The LF-CNN model consists of J uni-modal feature extractors followed by a fusion layer , where c() represents concatenation operation. Resulting feature vector is input to dropout, fully connected and softmax layers. This CNN architecture provides intra-modal interactions before a late fusion during the training.
Intermediate fusion CNN (IF-CNN): Here, X is the ‘res3b-relu’ layer (last layer of third convolutional block) of ResNet-18 CNN. The IF-CNN model consists of unimodal feature extractors followed by a fusion layer , where CD() is a depth-concatenation operation. Resulting intermediate feature vector is input to lightweight randomly initialized CNN layers including a convolution layer (7 × 7 × 512), batch-normalization and ReLU, followed by global average pooling (GAP), dropout, fully connected and softmax layers. This CNN architecture offers inter-modal interactions during training due to an intermediate feature fusion and depth-concatenation.
Hybrid fusion CNN (HF-CNN): In this CNN, fusion layers are first obtained for late and intermediate fusion as and where X 1 and X 2 represent CNN layers ‘pool5’ and ‘GAP’ of the two fusion CNN models described above. The following fusion layer is . Resulting hybrid feature vector is input to dropout, fully connected and softmax layers. This CNN architecture combines the benefits of intra-modal and inter-modal interactions during training.
4-way Tensor fusion CNN (TF-4M-CNN): Here, X is the ‘pool5’ layer of ResNet-18 CNN. The TF-4M-CNN model consists of J uni-modal feature extractors, each followed by randomly initialized uni-modal fully connected layers of 32 neurons, and a 4-input fusion layer where C4T() represents a 4-way tensor fusion operation. The tensor fusion layer computes the outer product between the individual representations [8], and captures uni-modal, bi-modal, tri-modal and quadri-modal dynamics. Resulting 4-D feature tensor is input to GAP, dropout, fully connected and softmax layers.
3-way Tensor fusion CNN (TF-3M-CNN): Here, X is the ‘pool5’ layer of ResNet-18 CNN. The TF-3M-CNN model consists of J uni-modal feature extractors, each followed by randomly initialized uni-modal fully connected layers of 16 neurons, and four 3-input tensor fusion layers, each combining three uni-modal inputs, given by , where C3T() represents a 3-way tensor fusion operation. This setting leads to four 3-D tensor cubes, similar to the interactions in [8], representing uni-modal, bi-modal and tri-modal interactions between three modalities. Resulting 3-D feature tensors are each input to four fully connected layers (16 neurons), followed by concatenation, dropout, fully connected and softmax layers.
3. Experiments And Results
The proposed models were evaluated through five-fold cross-validation experiments for the Brain and the Heart tasks. A scan-wise holdout was implemented in each round of crossvalidation. The reported metrics are the mean and standard deviation of the sensitivity (reference: NQ group), specificity, and accuracy for binary classification of operator experience group, computed over the cross-validation rounds. An ablation study was performed to test uni-modal CNN models. The experimental results for the Brain and Heart tasks are presented in Table 1 and Table 2, respectively.
Table 1. Experimental results for the Brain task.
| Uni-modal CNNs | Parameters | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| Image-CNN | 11.18 M | 0.81 ± 0.05 | 0.52 ± 0.10 | 0.71 ± 0.02 |
| Gaze-CNN | 11.18 M | 0.78 ± 0.05 | 0.54 ± 0.06 | 0.69 ± 0.02 |
| Trajectory-CNN | 11.18 M | 0.62 ± 0.06 | 0.78 ± 0.03 | 0.68 ± 0.03 |
| Pupillary-CNN | 11.18 M | 0.80 ± 0.06 | 0.64 ± 0.04 | 0.74 ± 0.04 |
| Multi-modal CNNs | Parameters | Sensitivity | Specificity | Accuracy |
| LF-CNN | 44.72 M | 0.82 ± 0.04 | 0.79 ± 0.04 | 0.81 ± 0.02 |
| IF-CNN | 15.58 M | 0.82 ± 0.03 | 0.83 ± 0.10 | 0.82 ± 0.02 |
| HF-CNN | 57.57 M | 0.83 ± 0.06 | 0.72 ± 0.07 | 0.79 ± 0.03 |
| TF-4M-CNN | 44.79 M | 0.66 ± 0.16 | 0.69 ± 0.07 | 0.67 ± 0.08 |
| TF-3M-CNN | 45.07 M | 0.79 ± 0.04 | 0.44 ± 0.09 | 0.67 ± 0.08 |
Table 2. Experimental results for the Heart task.
| Uni-modal CNNs | Sensitivity | Specificity | Accuracy |
|---|---|---|---|
| Image-CNN | 0.73 ± 0.04 | 0.61 ± 0.04 | 0.69 ± 0.03 |
| Gaze-CNN | 0.85 ± 0.08 | 0.46 ± 0.11 | 0.71 ± 0.02 |
| Trajectory-CNN | 0.72 ± 0.07 | 0.71 ± 0.04 | 0.71 ± 0.04 |
| Pupillary-CNN | 0.77 ± 0.07 | 0.61 ± 0.03 | 0.71 ± 0.05 |
| Multi-modal CNNs | Sensitivity | Specificity | Accuracy |
| LF-CNN | 0.74 ± 0.06 | 0.74 ± 0.07 | 0.73 ± 0.06 |
| IF-CNN | 0.81 ± 0.06 | 0.69 ± 0.06 | 0.76 ± 0.04 |
| HF-CNN | 0.72 ± 0.03 | 0.78 ± 0.07 | 0.74 ± 0.03 |
| TF-4M-CNN | 0.62 ± 0.12 | 0.64 ± 0.20 | 0.63 ± 0.06 |
| TF-3M-CNN | 0.84 ± 0.05 | 0.31 ± 0.06 | 0.65 ± 0.05 |
Firstly, we observe that the overall cross-validation performance for the Heart task is slightly lower for most models compared to the Brain task. This shows a higher difficulty to discriminate operator skills from inspection of the heart, possibly due to heart being a smaller structure with higher number of standard planes to find, leading to more complex search and interpretation. Further, for both tasks, uni-modal CNNs achieve good results, suggesting value in each single modality to classify operator skills. There is no clear winner over all metrics for the uni-modal CNNs, as pupillary-CNN and trajectory-CNN are overall more accurate with higher specificities, but image-CNN and gaze-CNN show higher sensitivities in the two tasks respectively. A higher sensitivity can be considered clinically more valuable, as misclassifying a newly qualified operator as an expert can have a more serious consequence than vice-versa. We observe that most multimodal fusion CNNs outperform the uni-modal CNNs. Intermediate fusion CNN, followed by late fusion CNN, show promising results for the Brain task, and hybrid fusion CNN shows a balanced performance for the Heart task. Lastly, for our classification problem, the two tensor fusion CNNs are not as accurate as the other fusions. Among tensor fusions, the 4-way fusion CNN outperforms the 3-way fusion CNN.
4. Conclusion
The paper describes a novel multi-modal learning framework to automatically predict operator expertise using heterogeneous spatio-temporal sensory cues, namely, acquired video, eye-tracking data, and pupillary response, in clinical obstetric ultrasound. Our preliminary findings, including an ablation study, confirm that the fusion models are more accurate compared to models built with single modalities, and we achieve reasonable performance using intermediate, late and hybrid fusion methods, modelling intra- and inter-modal dynamics.
Fig. 1. Overview of the proposed multi-modal learning method for operator skill characterization.
Acknowledgements
We acknowledge the ERC (ERC-ADG-2015 694581 project PULSE), the EPSRC (EP/MO13774/1), and the NIHR Oxford Biomedical Research Centre (BRC). We thank Pierre Chatelain and Richard Droste for their enabling contributions in developing the acquisition system and parameter extraction tools for video and raw eye-tracker data, respectively. We are not aware of any financial conflicts of interest to be disclosed.
Footnotes
Project PULSE, funded by the European Research Council (grant ERC ADG-2015 694581) https://www.eng.ox.ac.uk/pulse
Compliance With Ethical Standards
This study was approved by the UK Research Ethics Committee (Reference 18/WS/0051) and the ERC ethics committee.
References
- [1].Maier-Hein L, Vedula SS, Speidel S, Navab N, et al. Surgical data science for next-generation interventions. Nat Biomed Eng. 2017;1(9):691–696. doi: 10.1038/s41551-017-0132-7. [DOI] [PubMed] [Google Scholar]
- [2].Drew T, Evans K, Vo ML-H, Jacobson FL, et al. Informatics in radiology: what can you see in a single glance and how might this guide visual search in medical images? Radiographics. 2013;33(1):263–274. doi: 10.1148/rg.331125023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].van der Wel P, van Steenbergen H. Pupil dilation as an index of effort in cognitive control tasks: A review. Psychon Bull Rev. 2018;25(6):2005–2015. doi: 10.3758/s13423-018-1432-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Szulewski A, Kelton D, Howes D. Pupillometry as a tool to study expertise in medicine. Frontline Learning Research. 2017;5(3):55–65. [Google Scholar]
- [5].Sharma H, Drukker L, Droste R, Chatelain P, et al. OC10.02: Task-evoked pupillary response as an index of cognitive workload of sonologists undertaking fetal ultrasound. Ultrasound Obstet Gynecol. 2020;56(S1):28–28. [Google Scholar]
- [6].Baltrusaitis T, Ahuja C, Morency L-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–443. doi: 10.1109/TPAMI.2018.2798607. [DOI] [PubMed] [Google Scholar]
- [7].Wang W, Tran D, Feiszli M. What Makes Training Multi-Modal Classification Networks Hard?; Proc IEEE/CVF CVPR 2020; 2020. pp. 12692–12702. [Google Scholar]
- [8].Zadeh A, Chen M, Poria S, Cambria E, et al. Tensor Fusion Network for Multimodal Sentiment Analysis; Proc EMNLP 2017; 2017. pp. 1103–1114. [Google Scholar]
- [9].Chatelain P, Sharma H, Drukker L, Papageorghiou AT, et al. Evaluation of Gaze Tracking Calibration for Longitudinal Biomedical Imaging Studies. IEEE Trans Cybern. 2020;50(1):153–163. doi: 10.1109/TCYB.2018.2866274. [DOI] [PubMed] [Google Scholar]
- [10].Cai Y, Droste R, Sharma H, Chatelain P, et al. Spatiotemporal visual attention modelling of standard biometry plane-finding navigation. Med Image Anal. 2020;65:101762. doi: 10.1016/j.media.2020.101762. [DOI] [PubMed] [Google Scholar]
- [11].Wang Y, Droste R, Jiao J, Sharma H, et al. Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis. Springer; 2020. Differentiating operator skill during routine fetal ultrasound scanning using probe motion tracking; pp. 180–188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Sharma H, Droste R, Chatelain P, Drukker L, et al. Spatiotemporal partitioning and description of full-length routine fetal anomaly ultrasound scans; Proc IEEE ISBI 2019; 2019. pp. 987–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Sharma H, Drukker L, Chatelain P, Droste R, et al. Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos. Med Image Anal. 2021:101973. doi: 10.1016/j.media.2021.101973. [DOI] [PubMed] [Google Scholar]
- [14].Kret ME, Sjak-Shie EE. Preprocessing pupil size data: Guidelines and code. Behav Res Methods. 2019;51(3):1336–1342. doi: 10.3758/s13428-018-1075-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Wang Z, Oates T. Imaging time-series to improve classification and imputation; Proc ICAI’15; 2015. pp. 3939–45. [Google Scholar]
- [16].He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition; Proc IEEE/CVF CVPR 2016; 2016. pp. 770–778. [Google Scholar]

