Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Sep 17.
Published in final edited form as: Med Image Comput Comput Assist Interv. 2022 Sep 17;2022:104–114. doi: 10.1007/978-3-031-16449-1_11

USPoint: Self-Supervised Interest Point Detection and Description for Ultrasound-Probe Motion Estimation During Fine-Adjustment Standard Fetal Plane Finding

Cheng Zhao 1,, Richard Droste 1, Lior Drukker 2, Aris T Papageorghiou 2, J Alison Noble 1
PMCID: PMC7614558  EMSID: EMS159397  PMID: 37223131

Abstract

Ultrasound (US)-probe motion estimation is a fundamental problem in automated standard plane locating during obstetric US diagnosis. Most recent existing recent works employ deep neural network (DNN) to regress the probe motion. However, these deep regressionbased methods leverage the DNN to overfit on the specific training data, which is naturally lack of generalization ability for the clinical application. In this paper, we are back to generalized US feature learning rather than deep parameter regression. We propose a self-supervised learned local detector and descriptor, named USPoint, for US-probe motion estimation during the fine-adjustment phase of fetal plane acquisition. Specifically, a hybrid neural architecture is designed to simultaneously extract a local feature, and further estimate the probe motion. By embedding a differentiable USPoint-based motion estimation inside the proposed network architecture, the USPoint learns the keypoint detector, scores and descriptors from motion error alone, which doesn’t require expensive human-annotation of local features. The two tasks, local feature learning and motion estimation, are jointly learned in a unified framework to enable collaborative learning with the aim of mutual benefit. To the best of our knowledge, it is the first learned local detector and descriptor tailored for the US image. Experimental evaluation on real clinical data demonstrates the resultant performance improvement on feature matching and motion estimation for potential clinical value. A video demo can be found online: https://youtu.be/JGzHuTQVlBs.

Keywords: Obstetric US, Probe motion, Local detector and descriptor

1. Introduction

Motivation

Ultrasound (US) scan is an indispensable diagnostic tool in obstetrics care because of its safety, real-time nature and relatively low cost. However, obstetric US scanning is highly experienced-operator dependent. Automatic probe movement prediction for acquiring standard imaging planes might assist less-experienced users to perform scanning confidently. In the case of obstetric US scanning, it is typically not difficult for an operator to find a coarse approximate position for a standard imaging plane (searching); However, locating the probe position to accurately locate a diagnostically acceptable standard plane (finetuning) is not easy. Searching probe motion involves sharply translation and rotation, while fine-tuning probe motion is predominately rotation. Most of the time, probe movements in the fine-adjustment stage are rotation with very small translation. The proposed method focuses on the latter problem.

During the training of the trainee sonographer, the high-quality standard plane is stored in advance captured by expert sonographer. When close to the target plane of interest, it translates to US image-probe motion prediction between the current and target US image-pair. The predicted motion guides the trainee to capture the saved standard/target plane more efficiently to increase their practical scan ability. For the situation without pre-captured high-quality standard plane, the learned local feature can be inserted as a feature extraction frontend with a time-series model, e.g. LSTM and Transformer, for a series of motion predictions. The time-series predicted motions could guide the non-expert to choose the optimal action to approach the standard/target plane.

Related Work

Existing methods of US-probe motion estimation can be grouped into heuristic methods and data-driven deep regression methods. The former inferences the probe motion from elaborate hand-designed heuristics such as speckle decorrelation [2,7,14] extracted from the US image. The latter [6,8,12] takes advantage of the powerful non-linearity of DNN to regress the probe motion from the large-scale US-probe data. These learning-based regression methods leverage the DNN to overfit on the training data of some specific users. As mentioned in [17], the main limitation is naturally the lack of generalization ability in real clinical applications. So our proposed method is back to feature extraction and description. Due to the peculiar data structure, it is difficult to extract local features directly from the US image. Our target is to learn a general US local feature from the large-scale US-probe data recorded by experienced sonographers. The classic local features from computer vision such as SIFT [10], ORB [15] are not tailored for US images result in inferior performance of interest points detection and description on US image due to the domain variance between natural grey-style images and US images. Recently, the deep learned local feature [3,5,16] significantly improve the local feature extraction and matching on the natural images. However, these approaches are difficult to apply to US images due to the lack of geometry information (feature correspondence, known pose and depth) for the network training.

Contribution

Instead of considering local feature extraction and probe motion estimation separately, we present a new way of jointly thinking about these two problems on US images. We present a self-supervised local feature, e.g. USPoint to predict keypoint locations, scores and descriptors for US-probe motion estimation during fine-adjustment standard plane finding. It is achieved by embedding a differentiable USPoint-based motion estimation inside the proposed network architecture, supervised only by automatically obtained ground truth (GT) probe motion information. The reason why we named it self-supervised learning is that similar to the stereo image or audio signals, the position information can be captured by the motion sensor naturally during clinical data collection. As Fig.1 shows, US-probe motion estimation between the current and the target US images is achieved by learned local feature extraction and further deep motion estimation. Compared to conventional, hand-designed heuristics, our method learns priors over geometric transformations and regularities from data directly. Compared to conventional, deep parameter regression, our method is back to generalized US feature learning. In summary, our proposed network includes the following new features: I) a tailored US image-based local detector and descriptor are learned in a self-supervised manner without expensive human-annotation of local feature; II) the probe motion estimation is achieved by USPoint-based regression from the learned local features with attentional graph matches; III) these two tasks are jointly learned in a unified framework to enable collaborative learning for mutual benefit.

Fig. 1. The pipeline of the proposed approach.

Fig. 1

2. Methodology

Overview

As Fig.2 shows, the proposed network architecture consists of four stages: feature encoder, local detector and descriptor, attentional graph matching and motion estimation. Stage one, feature encoder is composed of a series of convolution stacks, which transforms the US image to a dense high-dimensional feature representation. Stage two, local detector and descriptor, is a three-stream architecture designed to jointly learn the pixel-wise position and feature vector of interesting points. It extracts the sparse feature representation of detected points from the dense feature representation of stage one. Stage three, attentional graph matching aggregates the feature representation of each interesting point from the other salient points, and further finds the correspondence according to the descriptors. Stage four, motion is estimated via the differentiable singular value decomposition (SVD) from the matched interesting point pairs. During training, the motion sensor (IMU) attached on the probe provides a motion signal as the supervision, while it is not required in the deployment. The complete network is trained in an end-to-end way without expensive human-annotations.

Fig. 2.

Fig. 2

The four-stage network architecture: feature encoder, local detector and descriptor, attentional graph matching and motion estimation. The IMU is only used to provide GT for training, while it is not required during inference.

Feature Encoder

The feature encoder is a VGG-style [18] sub-network, which takes an US image as input and generates a dense, low spatial and highdimensional feature representation. Each pixel in the feature map refers to an 8×8 pixel patch in the original US image. This dense feature representation is then fed to the local detector and descriptor to extract the sparse feature representation of each interesting point. The specific architecture and parameter are provided in the Supplementary Material.

Local Detector and Descriptor

As shown in Fig.2, the local detector and descriptor jointly learn the position and feature vector of each interesting point, which is composed of three head-branches for score (middle), detection (top) and description (bottom). The score-branch (middle) includes 2 convolutional layers with 256 and 1 channels followed by a sigmoid activation which bounds score predictions S in the interval (0,1). It regresses the probability of interesting points in each 8×8 pixel patch, so that the top N interesting points can be selected from the original US image according to the scores.

The detection-branch (top) includes 2 convolutional layers with 256 and 2 channels followed by a sigmoid activation. Following [3], it regresses the relative position Preiative(x,y) of each interesting point in the corresponding 8×8 patch, where x ∈ (0,1) and y ∈ (0,1) after the sigmoid operation. The position/index of each 8×8 patch is defined as Ppatch(w,h), where (w,h) is the pixel coordinate value in the feature map. The interesting point position Ppoint(X,Y ) of each patch in the original US image is calculated from P reiative(x,y) and Ppatch(w,h),

Ppoint(X,Y)=((w+x)sn,(h+y)sn), (1)

where s = 2 is the stride of the pooling layer, and n = 3 is the number of pooling layers in the feature encoder. The descriptor-branch (bottom) includes 2 convolutional layers with 256 and 256 channels respectively, which generates a semi-dense grid of the descriptor for each 8×8 patch.

To achieve a real-time performance and reduce the computational memory, up-sampling back to the full resolution through deconvolution operation is abandoned. Instead, bi-cubic interpolation is used to interpolate descriptors according to the positions of interesting points. The top N interesting points are selected according to their detection scores S from the score branch. And their position (X,Y) and associated descriptor D ∈ R are further extracted by the detection and descriptor brunches respectively. This sub-network transforms dense feature representation to sparse feature representation for the next stage.

Attentional Graph Matching

Because the local feature description of each interesting point is extracted from a small patch in the original US image, the associated feature representation suffers from ambiguities due to lack of contextual cues. Inspired by [16], an Attentional Graph Neural Network (AGNN) [20] is employed to aggregate the contextual cues from the other interesting points to increase the distinctiveness. To integrate both position and appearance information of an interesting point i,iN into the AGNN, the interesting point position information (Xi,Yi,Si) is embedded into a high-dimensional feature representation via a Multilayer Perception (MLP). This feature representation is further combined with the initial visual descriptor Di to define a new feature vector Fi as,

Fi=Di+MLP(Xi,Yi,Si). (2)

The interesting points in both the source and the target images are considered as nodes of the complete graph. This graph includes two types of undirected edges, intra-image edges and extra-image edges. Intra-image edges connect the current interesting point i with all the other interesting points j in the same image, while extra-image edges connect the current interesting point i with all the other interesting points j in the other image. A message passing formulation [1] M is used for information propagation along both the intra-image and extra-image edges. The high-dimensional feature vector of each graph node is updated as,

Fi=Fi+MLP(FiMi,jintra)+MLP(FiMi,jextra), (3)

where ⊕ refers to the concatenation. The intra-image message Mi,jintra and extraimage message Mi,jextra are calculated via the self-attention mechanism [19],

Mi,jintra=j(intra)softmax(FiFjTd)Fj,Mi,jextra=j(extra)softmax(FiFjTd)Fj. (4)

Here d denotes the dimension of the feature vector. The AGNN can automatically build connections between a current interesting point and nearby or similar salient interesting points, because both position and appearance information are encoded into the feature representation of each graph node.

Further, the pairwise scores γi,j are calculated via the similarity of matching the descriptor Fisource and Fjtarget of the interesting point i and j in the source and target images,

γi,j=Fisource,Fjtarget,(i,j)source×target, (5)

where , refers to inner product. The pair-matching is optimized by the Sinkhorn algorithm [4] to select the matched interesting point pairs.

Motion Estimation

Given the matched correspondence pairs Pisource,Pjtarget with the pairwise score γi,j, the 3D relative motion between the source and the target US images is regressed by the differentiable SVD. Firstly, the matched pixel positions P are feed into a Transform Net [13] to obtain a matrix T ∈ R3×3. Then the pixel coordinate valuessource,ptargetj through Pisourcep,P=jtargetT × are transformed to three-P. T is a dynamic and dimensional values pi deterministic matrix, which is directly regressed by the Transform Net from data. Their centroid values -psource,p-target are further calculated as,

p¯source=1Ni=1Npisource,p¯target=1Nj=1Npjtarget, (6)

where N is the number of matched pairs. Then the cross-covariance matrix Σ is given as,

Σ=i,j=1N(pisourcep¯source)γi,j(pjtargetp¯target). (7)

Σ is further decomposed by SVD, and then the relative motion between source and target US images is obtained as,

U,Δ,V=SVD(Σ),R=VU,t=p¯targetRp¯source. (8)

U,VSO(3) and Δ is the diagonal but potentially signed. RSO(3) is the rotation matrix, and is further transformed to a quaternion Q. t ∈ R3 is the translation vector.

Crucially, the motion estimation is differentiable, which backpropagates the motion error to the local detector, descriptor and matching sub-networks. This guarantees that the whole network is trainable in an end-to-end fashion. Most importantly, it builds a relationship between the local feature learning and the supervision signal from the IMU sensor, so that the local detector and descriptor are learned in a self-supervised way without expensive human-annotations. Moreover, the learned pairwise scores allow the matched interesting point pairs to contribute differently according to their confidences in the differentiable SVD. Thus, jointly learning the feature and motion enables collaborative learning between two sub-networks during backpropagation to minimise the loss function.

Loss Function

The US image pairs, i.e., source and target US images, are input to the network, and their relative 3D motion, i.e., translation t and quaternion Q, is predicted. The corresponding GT, i.e., translation tGT and quaternion QGT measured by the IMU sensor is provided as the supervision for the network training. The loss function L is defined as,

(Isource,Itarget)=ttGT2+λ(1|QQGT|), (9)

where λ is the scale factor.

3. Experiments

Data Acquisition and Processing

The data used in this paper came from a large-scale multi-modal fetal US dataset Anonymous including synchronized US video and IMU signal. We consider the standard biometry planes of the 2nd and 3rd trimester, i.e. head circumference/trans-ventricular (HC/TV) plane, abdominal circumference (AC) plane and femur length (FL) plane. 47589 US image-pairs extracted from the video clips are then used for the proposed method training and testing, divided into five-fold cross validation. Please find more details on data processing and network training in the Supplementary Material.

Performance Evaluation

Typical learned local detector, descriptor and matches on the US image pairs are shown in Fig.3. The colourized dots and lines refer to the detected interesting points and matched correspondences. The Jet colourmap is used to represent their scores i.e. red refers to a high score while blue refers to a low score. We provide qualitative results on different anatomy planes such as the abdomen, heart, head, arm, foot, lip-nose and spine. Most of the matched correspondences maintain a good degree of global consistency with a few wrong lines crossing. The proposed method consistently predicts a high number of correct matches even within a lot of repeated texture. In order to demonstrate the robustness of learned local feature, the results for US image-pairs within different settings such as scale, crop and zoom settings are provided. We use four standard evaluation metrics: Matching Precision (MP), Matching Score (MS), Positional Drift Percentage (PDP) and Angular Offset (AO) used in the most popular benchmark datasets to analyse the experimental results. According to the metrics definitions, the MP, defined as (correct matches)/(predicted matches) is used for the learned descriptor evaluation, while the MS, defined as (correct matches)/(detected interesting points) is used for the learned detector evaluation. As GT for local feature evaluation, we manually annotated the correct matched correspondences on 300 US image pairs. It is very difficult to annotate the interesting points with matches directly in the US images. Our strategy is to manually select the corrected matches from the predicted matches by the proposed and baseline methods. In this case, manually annotating the interesting points in the US images is not required. For the probe 2(t^tGT)/2(tGT translation evaluation, we use the positional drift percentage, defined as), instead of the absolute positional error. Because the translation is usually very small during the US-probe fine-tuning, a small absolute positional error doesn’t mean a good performance. For the probe rotation evaluation, we use the quaternion rather than Euler angle to represent 3D rotation because the Euler angle has a arccos(Q^QGT well-known Gimbal lock problem. Following PoseNet [9], the AO is defined as 2 ). The probe motion GT is directly obtained from the motion sensor (IMU) or US-Simulator. Note that the IMU and US-Simulator are only used to provide GT for network training and performance evaluation, while they are not required in practical inference.

Fig. 3. Typical learned local detector, descriptor and matches for US image pairs.

Fig. 3

The color indicates the matching score (Jet colormap, blue=low, red=high). (Color figure online)

As the baseline methods, we chose geometry-style SIFT [10] with FLANN [11], and learning-style SuperPoint [5] with SuperGlue [16] for the local feature matching comparison. We chose deep regression [12] method for the motion estimation comparison between two US images. The only difference is that we only use US image for the motion estimation for a fair comparison due to the proposed method only uses US image during inference. The original method [12] use both US image and IMU signal to predict the motion. From Table1, we can see that the MP and MS of proposed method outperform both the geometry and learning based methods. The main reason is that our proposed local detector and descriptor are tailored for US image while the baselines are designed for natural images. When comparing with the MP and MS of the image-pairs from different anatomy standard planes, we did not find significant differences. From Table2, it can be seen that the average PDP and AO of the proposed method outperform the baseline with different frame-intervals. For deep motion regression, elaborate matched local features provide more geoemtry cues than the original US images.

Table 1. Performance comparison between the proposed method and baselines for local feature matching evaluation.

Method MP (%) MS (%)
SIFT [10] + FLANN [11] 45.06 6.58
SuperPoint [5] + SuperGlue [16] 66.57 33.16
Proposed Method 74.72 53.98

MP and MS refer to Matching Precision and Matching Score separately by percentage (%).

Table 2. Performance comparison between the proposed method and baseline for probe motion evaluation.

Method 50-frames 100-frames 150-frames 250-frames
PDP(%) AO(°) PDP(%) AO(°) PDP(%) AO(°) PDP(%) AO(°)
Deep regression [12] 8.14 2.54 7.92 4.23 9.21 5.46 10.79 7.98
Proposed method 4.76 1.99 4.83 2.91 5.46 3.23 6.27 5.45

N-frames refers to the interval, i.e., number of frames between the US image pair. The average PDP and AO are evaluated by percentage (%) and degree (°).

4. Conclusions

In this paper, we present a novel approach for jointly local feature extraction and motion estimation during the fine-adjustment stage of standard fetal plane finding in obstetric US scan. A novel local feature USPoint, including interest point detector, score and descriptor, tailored for US image is learned in a selfsupervised manner. The probe motion is further estimated based on the learned local feature detection, description and matching. The collaborative learning between the local feature learning and motion estimation benefits both subnetworks. We will further verify the generalization ability of the USPoint applied in other US-based tasks in the next step. The future work will combine the US landmark retrieval [21] with this USPoint-based motion estimation to achieve a course-to-fine visual-assisted intervention [22] to guide US-probe movement.

Supplementary Material

Supplementary Materials

Acknowledgments

We acknowledge the ERC (ERC-ADG-2015 694581, project PULSE), the EPSRC (EP/MO13774/1, EP/R013853/1), and the NIHR Biomedical Research Centre funding scheme.

References

  • 1.Battaglia PW, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint. 2018:arXiv:1806.01261 [Google Scholar]
  • 2.Chen JF, Fowlkes JB, Carson PL, Rubin JM. Determination of scan-planemotion using speckle decorrelation: theoretical considerations and initial test. Int J Imaging Syst Technol. 1997;8(1):38–44. [Google Scholar]
  • 3.Christiansen PH, Kragh MF, Brodskiy Y, Karstoft H. Unsuperpoint: end-to-end unsupervised interest point detector and descriptor. arXiv preprint. 2019:arXiv:1907.04011 [Google Scholar]
  • 4.Cuturi M. Sinkhorn distances: lightspeed computation of optimal transport. Advances in Neural Information Processing Systems. 2013:2292–2300. [Google Scholar]
  • 5.DeTone D, Malisiewicz T, Rabinovich A. Superpoint: self-supervised interestpoint detection and description; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; 2018. Jun, [Google Scholar]
  • 6.Droste R, Drukker L, Papageorghiou AT, Noble JA. In: MICCAI 2020 LNCS. Martel AL, et al., editors. Vol. 12263. Springer; Cham: 2020. Automatic probe movement guidance for freehand obstetric ultrasound; pp. 583–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gee AH, Housden RJ, Hassenpflug P, Treece GM, Prager RW. Sensorlessfreehand 3D ultrasound in real tissue: speckle decorrelation without fully developed speckle. Med Image Anal. 2006;10(2):137–149. doi: 10.1016/j.media.2005.08.001. [DOI] [PubMed] [Google Scholar]
  • 8.Guo H, Xu S, Wood B, Yan P. In: MICCAI 2020 LNCS. Martel AL, et al., editors. Vol. 12263. Springer; Cham: 2020. Sensorless freehand 3D ultrasound reconstruction via deep contextual learning; pp. 463–472. [DOI] [Google Scholar]
  • 9.Kendall A, Grimes M, Cipolla R. PoseNet: a convolutional network for realtime 6-DOF camera relocalization; Proceedings of the IEEE International Conference on Computer Vision; 2015. pp. 2938–2946. [Google Scholar]
  • 10.Lowe DG. Object recognition from local scale-invariant features; Proceedingsof the Seventh IEEE International Conference on Computer Vision; 1999. pp. 1150–1157. [Google Scholar]
  • 11.Muja M, Lowe D. Flann-fast library for approximate nearest neighbors usermanual. Computer Science Department, University of British Columbia, Vancouver, BC, Canada. 2009;5 [Google Scholar]
  • 12.Prevost R, et al. 3D freehand ultrasound without external tracking using deeplearning. Med Image Anal. 2018;48:187–202. doi: 10.1016/j.media.2018.06.003. [DOI] [PubMed] [Google Scholar]
  • 13.Qi CR, Su H, Mo K, Guibas LJ. PointNet: deep learning on point sets for3D classification and segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. Jul, [Google Scholar]
  • 14.Rivaz H, Zellars R, Hager G, Fichtinger G, Boctor E. 9C-1 beam steeringapproach for speckle characterization and out-of-plane motion estimation in real tissue; 2007 IEEE Ultrasonics Symposium Proceedings; 2007. pp. 781–784. [Google Scholar]
  • 15.Rublee E, Rabaud V, Konolige K, Bradski G. ORB: an efficient alternative tosift or surf; 2011 International Conference on Computer Vision; 2011. pp. 2564–2571. [Google Scholar]
  • 16.Sarlin PE, DeTone D, Malisiewicz T, Rabinovich A. Superglue: learning feature matching with graph neural networks; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2020. [Google Scholar]
  • 17.Sattler T, Zhou Q, Pollefeys M, Leal-Taixe L. Understanding the limitations of CNN-based absolute camera pose regression; CVPR; 2019. pp. 3302–3312. [Google Scholar]
  • 18.Simonyan K, Zisserman A. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint. 2014:arXiv:1409.1556 [Google Scholar]
  • 19.Vaswani A, et al. Attention is all you need. Advances in Neural InformationProcessing Systems. 2017:5998–6008. [Google Scholar]
  • 20.Velfckovi’c P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graphattention networks. arXiv preprint. 2017:arXiv:1710.10903 [Google Scholar]
  • 21.Zhao C, Droste R, Drukker L, Papageorghiou AT, Noble JA. In: MICCAI 2021 LNCS. de Bruijne M, et al., editors. Vol. 12908. Springer; Cham: 2021. Visualassisted probe movement guidance for obstetric ultrasound scanning using landmark retrieval; pp. 670–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhao C, Shen M, Sun L, Yang GZ. Generative localization with uncertaintyestimation through video-CT data for bronchoscopic biopsy. IEEE Robot Autom Lett. 2019;5(1):258–265. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES