Abstract
Transrectal ultrasound (TRUS) is widely used for guiding prostate biopsy due to its real-time imaging capabilities. However, ultrasound (US) lacks sensitivity for detecting prostate cancer, necessitating the integration of preoperative magnetic resonance imaging (MRI) to offer superior soft tissue contrast. To enable MRI-ultrasound fusion during interventions, an accurate 3D reconstruction of freehand TRUS is essential. Existing reconstruction methods typically rely on sequentially estimating interframe transformations, resulting in no explainability and accumulated errors and drift over time. In this paper, we present a framework that leverages preoperative MRI and supervised contrastive learning to reconstruct 3D ultrasound volumes directly from 2D frames. By aligning ultrasound images with corresponding MRI slices based on anatomical similarity, our method bypasses sequential estimation, avoids drift, and improves tracking accuracy. The approach was trained and validated on a large clinical data set of over 500 prostate biopsy cases and demonstrated over 50% improvement in drifting errors. By enhancing both precision and interpretability, our algorithm supports more reliable MRI-ultrasound fusion and holds the potential for improving the diagnostic accuracy of prostate cancer interventions.
Keywords: transrectal ultrasound, prostate biopsy, preoperative MRI, convolution neural network, electromagnetic tracking
Introduction
Ultrasound (US) is widely used for guiding prostate biopsies and ablations since both the prostate gland and the procedural needles can be visualized in real time. However, B-mode ultrasound is not sensitive to prostate cancer. Preoperative magnetic resonance imaging (MRI) provides much better details of the soft tissue, but it is cost-prohibitive to perform in real time and brings significant constraints to the equipment due to the strong magnetic field in the room. In order to bring MRI information into the procedure suite, software-based MRI-US fusion is highly desirable, and 3D ultrasound reconstruction is often used as the first step to establish baseline MRI-to-ultrasound registration. −
The reconstruction of ultrasound volume maps the 2D ultrasound frames into 3D space, where frame-tracking information provides the position of each frame. Tracking information could come from either hardware or algorithms. Hardware solutions for tracking 2D transrectal ultrasound (TRUS) can be classified into two categories: mechanical and electromagnetic (EM) tracking. The former relies on a robotic arm to operate the ultrasound probe and the biopsy gun, thereby severely limiting the liberty of urologists. The EM-tracking solution sets up a signal transmitter in the room and a magnetic sensor on the ultrasound transducer to record the physical position of each frame. The extra devices make the scan more cumbersome, and the overall system is extremely sensitive to ferromagnetic materials.
Therefore, algorithmic solutions have emerged to enable true freehand ultrasound by deriving the spatial relationship between 2D frames purely based on image content. Unlike image registration, which relies on similarity metrics, early works proposed quantifying the decorrelation between image content. , Intuitively, the lower the correlation between frames, the larger the interframe motion. However, this statement assumes a consistent relation between image content and physical distance, which is illogical for out-of-plane movements. Both anatomical differences and physical displacements could cause decorrelation between frames. These two entangling factors, along with random noise and attenuation artifacts, render the decorrelation approach an oversimplification of the task.
In the past few years, deep learning (DL) approaches have been proposed as a possible solution. With these methods, tracked transrectal ultrasound (TRUS) scans serve as training data to optimize models that predict interframe motion based on image inputs. Prevost et al. first proposed to use a convolutional neural network (CNN) that takes two neighboring frames as input and regresses 6 transformation parameters to describe the rigid movement between the input neighboring frames. Guo et al. refined the CNN with self-attention modules and used the margin ranking loss to guide the training process. Luo et al. , proposed method uses a long short-term memory (LSTM) network and a contextual loss function to pick up both temporal and spatial features. These deep learning methods quantitatively outperform previous frame-tracking algorithms but are far from being clinically applicable due to their lack of explainability and accuracy. It is unclear what features from ultrasound frames are indicative of in-plane and out-of-plane movements. Deep neural networks are renowned for their programmable fitting capability. However, for tasks that are too challenging even for humans, such as frame displacement estimation in this task, it is hard to fathom what sort of imaging feature the neural networks are leveraging for their estimation. For example, the network could be merely capturing the statistical patterns of a urologist’s scan habit.
Moreover, these algorithms are devised with a general computational task in mind but with little clinical application-specific consideration. For many interventional applications, preoperative high-quality imaging is performed, − but none of the previous methods made use of such data. Ultrasound scans for different applications could also follow certain scan trajectory patterns, which is also rarely utilized prior knowledge. These DL approaches also struggle with accuracy due to accumulated errors. The neural networks perform tracking estimation at each frame gap and sew the predictions sequentially for reconstruction. When random and systematic errors accumulate at each frame gap, the resulting trajectory of the reconstructed ultrasound volume drifts far from the ground truth.
To resolve the aforementioned issues with DL-based ultrasound frame-tracking methods, we propose an anatomy-based algorithm to provide explainable, accurate, and stable estimations. In MRI-US fusion-guided prostate biopsy, a preoperative MRI is performed. We first train a set of feature encoders to match the anatomical structures between 2D ultrasound frames and MRI slices. The proposed framework then leverages the 3D anatomical information from MRI to guide the tracking and thus the reconstruction of transrectal ultrasound. By matching ultrasound frames to MRI slices, we disentangle the anatomical differences and physical displacements between slices and provide a self-explainable ultrasound tracking framework. Our contributions in this paper are 3-fold: (1) a preoperative image-based ultrasound frame-tracking framework that is self-explainable and free of accumulated error; (2) a contrastive training method that encodes for corresponding anatomical features in different imaging modalities; and (3) benchmarking with existing DL ultrasound frame-tracking methods and demonstrating superior performance.
Methods
In this work, we propose to directly find the location of each ultrasound frame in preoperative MRI to estimate the interframe transformation. Mapping cross-modality images is difficult due to differences in image contrast and texture. To do so, we designed a DL-based US-MRI feature encoder that finds the best matching MRI slice for each ultrasound frame. Figure shows the overall framework of the proposed method. We trained two separate 2D feature encoders to extract modality-independent features from ultrasound frames and MRI slices. The training is guided by the proposed supervised contrastive loss that maximizes the similarity between MRI and TRUS feature vectors corresponding to similar anatomical structures. Details of feature embedding and contrastive learning are explained in the Supervise Contrastive Learning of Multi-Modal Anatomy section. At the inference time, random MRI slices are sampled from the volume as candidate slices. The similarity between all candidate slices and the ultrasound frame is used to determine the location of the ultrasound frame. The sampling and matching process is explained in the Inference-Time Candidate Trajectory Retrieval section and the Local Perturbation for Frame Matching section.
1.

Overview of the proposed framework.
Supervise Contrastive Learning of Multi-Modal Anatomy
Medical images from different modalities could have drastically different intensities, textures, and levels of detail. Conventional image similarity measurements, such as the sum of squared differences (SSD), cross-correlation (CC), and mutual information (MI), perform poorly when describing the similarities between images from very different modalities, e.g., MRI and ultrasound. Song et al. and Haskins et al. proved that for describing the similarity between MRI and TRUS, DL-based methods outperform MI and handcrafted features. Radford et al. have also demonstrated that the combination of DL-based feature vectors and contrastive learning could bridge modality differences that are as large as text vs image. The idea is very similar to our application, in which we aim to match corresponding ultrasound frames and MRI slices via deep learning-based embeddings. In CLIP, matching text and image pairs are considered positive pairs, where mismatched pairs are considered negative pairs. Similarly, we consider ultrasound frames and their physically corresponding MRI slice to be positive pairs and frame-slice pairs that are far away as negative pairs.
We use two separate standard ResNet-18s to encode any given MRI slice and ultrasound frame into feature vectors z and u of length 256, respectively. The feature vectors predicted by the MRI and US encoders for a given set of slices and frames can be denoted as Z MR = {z i |i = 1, ···, k} and U US = {u i |i = 1, ···, k}. During training, we sample N evenly spaced out frames from the ultrasound video and their corresponding MRI slices based on the ground truth ultrasound tracking information and the predefined MRI-TRUS registration. The set would be encoded into Z MR and U US of dimensions N × 256. After normalizing each feature vector, we obtain the similarity matrix S = U US·Z MR = {z i u j |i, j ∈ {1, ···, n}} of shape N × N. This training process is illustrated in the Cross-Modality Alignment portion of Figure .
| 1 |
where
| 2 |
and
| 3 |
Our goal is to maximize the similarity of feature embeddings between corresponding US frames and MRI slices. Supervision of feature embeddings falls into the category of feature representation learning, for which the most acclaimed method is contrastive learning. − The contrastive loss function aims at maximizing the similarity between positive sample pairs and minimizing the similarity between negative sample pairs in the embedding space. Since the corresponding ultrasound frames and MRI slices are aligned in the same order, all the diagonal elements of S represent the similarities of positive image pairs, and the rest are negative. The contrastive loss function is formalized as eq . Specifically, for each row and column, we computed the negative log-likelihood of the elements after applying softmax. In Figure , we show some examples of the similarity matrix S at different training stages. As training epochs proceed, we observe the matrix diagonal being enhanced and the negative samples being suppressed.
2.
Examples of similarity matrix S at different stages of training.
Although in this work the TRUS is mapped to the MRI with predefined registration, the registration can never be perfect and correct for all the deformations and misalignments. It is also illogical for mismatched adjacent slices/frames to be penalized as much as slices/frames that are physically far away. Therefore, the supervision provided by eq is too sharp for our task. To soften the constraint of the loss function, gradually guide the training process, and reduce overfitting on the training set, we apply nonoverlapping average pooling to the similarity matrix S to perform contrastive learning on different scales of tolerance. Since adjacent rows and columns also represent adjacent frames and slices, performing contrastive learning on the pooled similarity matrix adds a distance constraint to the training process. This process is described in eq , where k stands for both the kernel size and the stride of the pooling layers.
| 4 |
Inference-Time Candidate Trajectory Retrieval
Searching for 2D slices in 3D space is a problem with infinite complexity. Therefore, we initialized the search space with a similar MR-ultrasound sample from a predefined support set. The support set consists of ground truth MRI-US pairs available during the test time, and each case includes an MRI volume, a segmented prostate mask, and an associated ultrasound scan trajectory.
To retrieve the most similar scanning trajectory, we compared the segmented prostate volumes across cases. Specifically, we evaluate similarity based on the prostate segmentation mask shape, dimensions, and bounding box in the MRI space. This helps to identify a support case whose anatomical context closely resembles that of the test case. We performed rigid MRI-to-MRI registration between the support case and the inference case. This registration estimates a transformation that aligns the MRI volume (and corresponding prostate mask) of the support case with that of the test case. We then apply this transformation to each frame in the ultrasound trajectory of the selected support case, effectively mapping the entire trajectory into the coordinate space of the current test subject. This process is illustrated in the inference portion of Figure .
The rigid MRI-to-MRI registration is performed by first estimating an initial alignment using the spatial centers of the two volumes. A mutual-information–based similarity metric (implemented with 50 histogram bins and evaluated on a 1% random sample of voxels) guides the optimization. We use a gradient-descent optimizer with a fixed learning rate of 1.0 and 100 iterations, employing physical-shift–based scaling and halting when the parameter updates fall below 1 × 10–6 for 10 consecutive iterations. Linear interpolation ensures smooth intensity resampling during transformation, and the resulting rigid transformation is applied to resample the moving image into a fixed image space.
Local Perturbation for Frame Matching
In our data set, the z-axis represents the superior-inferior direction, and the y-axis represents the anterior-posterior direction. The rigid transformation between ultrasound frames can be described with six transformation parameters: Δt x , Δt y , Δt z , Δα x , Δα y , and Δα z . The mean and standard deviation of all interframe transformations of the data set are as follows: {Δt x , Δt y , Δt z , Δα x , Δα y , Δα z } = {−0.24 ± 1.2, 0.50 ± 1.3, −5.59 ± 4.3, 0.59 ± 0.40, −0.02 ± 0.22, −0.017 ± 0.13}. The first three values represent millimeters of translation, while the latter three represent rotation in degrees along the x, y, and z axes. The statistics demonstrate that most of the variance in the transrectal ultrasound scan trajectory results from z-axis translation and x-axis rotation. To ensure robustness and account for possible MR-US registration error, for each location in the candidate trajectory, we sample 25 additional local perturbed candidates for each ultrasound frame within a range of 2× the standard deviation of each transformation parameter.
Finally, we compute the feature cosine similarity between one query frame and all candidate MRI slices and find the candidate MRI slices with the highest similarity. The physical location of the query frame is assigned directly to the weighted average position of the top 3 candidate slices.
Experiments and Results
Data Set
This study is performed with an MRI-TRUS data set of 535 cases in total. The data set is partitioned into 425 training cases, 56 validation cases, and 54 test cases. The data are collected from clinical trial MRI-TRUS fusion-guided biopsy procedures. Each case contains a T2-weighted MRI volume, a one-pass transrectal ultrasound video, and the EM-tracking data of each ultrasound video frame. As part of the clinical procedure, manual registration between the TRUS and MRI volumes has also been performed.
Implementation Details and Evaluation Metrics
The feature encoders of the proposed method are trained with a batch size of 16 on a single GPU with 32 GB VRAM. The learning rate starts from 1 × 10–4 and then decays to ×0.7 after every 10 epochs. For each training case, we sample 30 frames from the entire ultrasound video. We follow to use a learnable temperature τ in the contrastive loss function to avoid hyperparameter tuning.
We used three metrics to evaluate the accuracy of ultrasound frame-tracking results: the distance error, final drift, and drift rate. All three metrics use the corner points of the predicted and ground truth ultrasound frames. The distance error measures the mean Euclidean distance between all predicted and ground truth corner points in a single case. The final drift compares only the corner points of the last frame of the ultrasound video to evaluate the accumulated error of the entire reconstructed volume. The drift rate is the final drift divided by the ground truth length of the video to account for the size difference between the scans.
Benchmarking Results
We benchmark the performance of our method to that of Prevost et al., Guo et al., and Luo et al. The quantitative results are listed in Table . For all evaluation metrics, the proposed method significantly outperforms benchmarked methods (p < 0.001 under paired t-test). The final drift measures the accumulated error at the end of the predicted scan trajectory. The longer and larger the scan is, the higher the accumulated error. Since the proposed method is free from error accumulation, final drift is the metric in which our method surpasses conventional methods the most. The distance error evaluates the accuracy of all frames in the trajectory. Conventional methods perform less unsatisfying in distance error, since their error tends to be smaller at the beginning of each scan. Nevertheless, the proposed method is able to deliver a superior performance that is consistent from the first frame to the last.
1. Performance Comparison between the Proposed and Other Freehand Ultrasound Reconstruction Methods.
| method | distance error | final drift | drift rate (%) |
|---|---|---|---|
| Prevost et al. | 24.36 ± 9.1 mm | 44.12 ± 22.6 mm | 20.99 ± 9.9 |
| Guo et al. | 17.2 ± 9.5 mm | 30.7 ± 26.3 mm | 14.1 ± 9.5 |
| Luo et al. | 20.66 ± 11.2 mm | 36.85 ± 26.3 mm | 17.22 ± 10.5 |
| proposed | 5.76 ± 2.2 mm | 11.2 ± 5.1 mm | 5.29 ± 2.3 |
In Figure , we show two examples of the inference time ultrasound query frame to candidate MRI slice-matching process. Although in experiments we sample 50 slices for each query frame, here we only show 10 for conservation of space. The s stands for the similarity between the query frame and the corresponding candidate slice after softmax. The figure shows that the similarity peaks when the candidate slice is closest to the target slice and fades as the candidate becomes physically further from the target. The difference between the similarity metrics is small even after softmax due to the normalization of feature vectors.
3.

Examples of query ultrasound frames and their similarity s with 10 candidate MRI slices at the inference time. Subfigures (a, b) are two separate cases for visualizations.
The reconstructed ultrasound scan trajectory is depicted in Figure . The dark red wireframe represents the prediction of our method, and the black wireframe represents the ground truth. We can see that while the estimated trajectories of most other methods deviate from the ground truth, the red trajectory stays close to the ground truth, resulting in a low drift error. The full reconstructed volumes are visualized in Figure . In general, we can observe that conventional methods tend to produce overly smooth trajectories even when the ground truth scan contains some turbulence, as shown in examples Figure a,b. On the other hand, in example Figure c, where the ground truth volume is indeed relatively smooth, we can see that the proposed method overestimated a small shake. In example Figure d, the ground truth volume follows a smooth arc, but our method predicted a straighter volume of similar size, while other methods reconstructed volumes with longer drifts. To the contrary, in Figure e, some methods predict significantly shorter volumes, while our method remains stable in size.
4.

Visualizations of ground truth vs predicted trajectory of ultrasound scans. Subfigures (a, b) are two separate cases for visualizations.
5.

Visualization of example reconstructed ultrasound volumes. Rows (a–e) are separated cases for visualization. Each column represents the result from different methods.
Ablation Studies
While the features encoded by our trained networks reliably pair ultrasound frames with corresponding MRI slices, the process of retrieving the MRI slices is also important. In Table , we show the results of two sets of ablation studies. First, we replace the candidate trajectory retrieval with a random selection from the support set during inference. Compared with the full pipeline, we observe a roughly 50% increase in error. This shows that during the inference process, it is important to start with a seed trajectory that better fits the inference case, for which our trajectory selection method is fully capable. However, this observation also indicates that if the inference case is unique in anatomical structure and scan pattern, for example, with a rare abnormality or deformation unseen to the support set, the reconstruction result may be prone to larger error. Second, we tested removing the local perturbation process and observed a slight decrease in accuracy. This may be caused by limited slice choices for the matching algorithm. The impact is not as significant as that of the former, but it still shows that the pipeline benefits from the extra slice choices.
2. Performance Comparison for Ablation Studies.
| method | distance error | final drift | drift rate (%) |
|---|---|---|---|
| random retrieval | 8.31 ± 4.12 mm | 16.19 ± 9.0 mm | 10.56 ± 4.4 |
| no local perturb | 6.59 ± 2.6 mm | 12.70 ± 6.2 mm | 6.79 ± 3.0 |
| proposed | 5.76 ± 2.2 mm | 11.2 ± 5.1 mm | 5.29 ± 2.3 |
Discussion and Future Direction
The authors acknowledge that the current implementation for retrieving candidate trajectories relies on the assumption that when performing one-pass TRUS scans, clinicians usually adhere to a smooth scan with degrees of freedom limited by the anatomical structure itself. If the inference cases’ ground truth trajectory is an outlier to the support set, the current method is prone to larger errors.
Another future direction could address the lack of smoothness in the predicted trajectories. Such turbulence in the predictions is expected since the prediction of each frame is independent of all of the other frames. In future works, we could apply forecasting methods such as the Kalman Filter to narrow down the search range, thereby smoothing the predicted trajectory.
In addition, future work may also explore the generalizability of the proposed method. Addressing domain shifts by combining images from diverse sources and demonstrating robustness across varying conditions remains a critical challenge for real-world deployment.
Conclusions
This paper introduces a freehand ultrasound volume reconstruction framework that leverages preoperative MRI for anatomical guidance. The proposed method revolutionizes the existing freehand ultrasound reconstruction framework, providing a solution that rids the reconstruction of accumulated errors. We bridged the modality difference between MRI and ultrasound through contrastive representation learning, thereby performing interframe motion estimation in an accurate and self-explanatory manner. With the clinical workflow in mind, the algorithm makes the maximum usage of clinically available resources at the time of operation and elegantly performs both ultrasound reconstruction and US-MRI registration in one step.
Acknowledgments
Philips and NIH have a patent licensing agreement under which NIH receives royalties, a portion of which is then given to B.J.W. This work was supported in part by the NIH Center for Interventional Oncology and the Intramural Research Program of the National Institutes of Health, via intramural NIH Grants Z1A CL040015, 1ZIDBC011242. NIH may have intellectual property in the field. B.J.W. is Principal Investigator on the following Cooperative Research and Development Agreements (CRADAs) between NIH and industry: Philips, Celsion Corp/Imunon, Siemens Healthineers/Varian Interventional Systems, NVIDIA, and ProMaxo.
The views, information or content, and conclusions presented do not necessarily represent the official position or policy of, nor should any official endorsement be inferred on the part of, the Clinical Center, the National Institutes of Health, or the Department of Health and Human Services.
The authors declare the following competing financial interest(s): Philips and NIH have a patent licensing agreement under which NIH receives royalties, a portion of which are then given to BW. This work was supported in part by the NIH Center for Interventional Oncology and the Intramural Research Program of the National Institutes of Health, via intramural NIH Grants Z1A CL040015, 1ZIDBC011242. NIH may have intellectual property in the field. BW is Principal Investigator on the following Cooperative Research and Development Agreements (CRADAs) between NIH and industry: Philips, Celsion Corp/Imunon, Siemens Healthineers/Varian Interventional Systems, NVIDIA, ProMaxo.
Published as part of Chemical & Biomedical Imaging special issue “AI for Chemical and Biomedical Imaging”.
References
- Xu S., Kruecker J., Turkbey B., Glossop N., Singh A. K., Choyke P., Pinto P., Wood B. J.. Real–time MRI-TRUS fusion for guidance of targeted prostate biopsies. Comput. Aided Surg. 2008;13:255–264. doi: 10.3109/10929080802364645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo H., Chao H., Xu S., Wood B. J., Wang J., Yan P.. Ultrasound Volume Reconstruction From Freehand Scans Without Tracking. IEEE Trans. Biomed. Eng. 2022;70(3):970–979. doi: 10.1109/TBME.2022.3206596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logan J. K., Rais-Bahrami S., Turkbey B., Gomella A., Amalou H., Choyke P. L., Wood B. J., Pinto P. A.. Current status of magnetic resonance imaging (MRI) and ultrasonography fusion software platforms for guidance of prostate biopsies. BJU Int. 2014;114:641–652. doi: 10.1111/bju.12593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siddiqui M. M., Rais-Bahrami S., Turkbey B., George A. K., Rothwax J., Shakir N., Okoro C., Raskolnikov D., Parnes H. L., Linehan W. M.. et al. Comparison of MR/ultrasound fusion-guided biopsy with ultrasound-guided biopsy for the diagnosis of prostate cancer. JAMA. 2015;313:390–397. doi: 10.1001/jama.2014.17942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J.-F., Fowlkes J. B., Carson P. L., Rubin J. M.. Determination of scan-plane motion using speckle decorrelation: Theoretical considerations and initial test. Int. J. Imaging Syst. Technol. 1997;8:38–44. doi: 10.1002/(SICI)1098-1098(1997)8:1<38::AID-IMA5>3.0.CO;2-U. [DOI] [Google Scholar]
- Tuthill T. A., Krücker J., Fowlkes J. B., Carson P. L.. Automated three-dimensional US frame positioning computed from elevational speckle decorrelation. Radiology. 1998;209:575–582. doi: 10.1148/radiology.209.2.9807593. [DOI] [PubMed] [Google Scholar]
- Chang R.-F., Wu W.-J., Chen D.-R., Chen W.-M., Shu W., Lee J.-H., Jeng L.-B.. 3-D US frame positioning using speckle decorrelation and image registration. Ultrasound Med. Biol. 2003;29:801–812. doi: 10.1016/S0301-5629(03)00036-X. [DOI] [PubMed] [Google Scholar]
- Prevost R., Salehi M., Jagoda S., Kumar N., Sprung J., Ladikos A., Bauer R., Zettinig O., Wein W.. 3D freehand ultrasound without external tracking using deep learning. Med. Image Anal. 2018;48:187–202. doi: 10.1016/j.media.2018.06.003. [DOI] [PubMed] [Google Scholar]
- Luo, M. ; Yang, X. ; Huang, X. ; Huang, Y. ; Zou, Y. ; Hu, X. ; Ravikumar, N. ; Frangi, A. F. ; Ni, D. In Self Context and Shape Prior for Sensorless Freehand 3D Ultrasound Reconstruction, Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, 2021; pp 201–210. [Google Scholar]
- Luo, M. ; Yang, X. ; Wang, H. ; Du, L. ; Ni, D. In Deep Motion Network for Freehand 3D Ultrasound Reconstruction, International Conference on Medical Image Computing and Computer-Assisted Intervention, 2022; pp 290–299. [Google Scholar]
- Wei W., Haishan X., Alpers J., Rak M., Hansen C.. A deep learning approach for 2D ultrasound and 3D CT/MR image registration in liver tumor ablation. Comput. Methods Programs Biomed. 2021;206:106117. doi: 10.1016/j.cmpb.2021.106117. [DOI] [PubMed] [Google Scholar]
- Sonn G. A., Chang E., Natarajan S., Margolis D. J., Macairan M., Lieu P., Huang J., Dorey F. J., Reiter R. E., Marks L. S.. Value of targeted prostate biopsy using magnetic resonance-ultrasound fusion in men with prior negative biopsy and elevated prostate-specific antigen. Eur. Urol. 2014;65:809–815. doi: 10.1016/j.eururo.2013.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scheltema M. J., Tay K., Postema A., De Bruin D., Feller J., Futterer J., George A., Gupta R., Kahmann F., Kastner C.. et al. Utilization of multiparametric prostate magnetic resonance imaging in clinical practice and focal therapy: report from a Delphi consensus project. World J. Urol. 2017;35:695–701. doi: 10.1007/s00345-016-1932-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song X., Chao H., Xu X., Guo H., Xu S., Turkbey B., Wood B. J., Sanford T., Wang G., Yan P.. Cross-modal attention for multi-modal image registration. Med. Image Anal. 2022;82:102612. doi: 10.1016/j.media.2022.102612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haskins G., Kruecker J., Kruger U., Xu S., Pinto P. A., Wood B. J., Yan P.. Learning deep similarity metric for 3D MR-TRUS image registration. Int. J. Comput. Assist. Radiol. Surg. 2019;14:417–425. doi: 10.1007/s11548-018-1875-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinrich M. P., Jenkinson M., Bhushan M., Matin T., Gleeson F. V., Brady S. M., Schnabel J. A.. MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration. Med. Image Anal. 2012;16:1423–1435. doi: 10.1016/j.media.2012.05.008. [DOI] [PubMed] [Google Scholar]
- Radford, A. ; Kim, J. W. ; Hallacy, C. ; Ramesh, A. ; Goh, G. ; Agarwal, S. ; Sastry, G. ; Askell, A. ; Mishkin, P. ; Clark, J. . et al. In Learning Transferable Visual Models from Natural Language Supervision, International Conference on Machine Learning, 2021; pp 8748–8763. [Google Scholar]
- Liu X., Zhang F., Hou Z., Mian L., Wang Z., Zhang J., Tang J.. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021;35:857–876. doi: 10.1109/TKDE.2021.3090866. [DOI] [Google Scholar]
- He, K. ; Zhang, X. ; Ren, S. ; Sun, J. In Deep Residual Learning for Image Recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016; pp 770–778. [Google Scholar]
- Wu, Z. ; Xiong, Y. ; Yu, S. X. ; Lin, D. In Unsupervised Feature Learning via Non-Parametric Instance Discrimination, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp 3733–3742. [Google Scholar]
- Khosla, P. ; Teterwak, P. ; Wang, C. ; Sarna, A. ; Tian, Y. ; Isola, P. ; Maschinot, A. ; Liu, C. ; Krishnan, D. In Supervised Contrastive Learning, Advances in Neural Information Processing Systems 33, 2020; pp 18661–18673. [Google Scholar]
- Chen, T. ; Kornblith, S. ; Norouzi, M. ; Hinton, G. In A Simple Framework for Contrastive Learning of Visual Representations, International Conference on Machine Learning, 2020; pp 1597–1607. [Google Scholar]



