Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: IEEE Trans Biomed Eng. 2024 Feb 26;71(3):1084–1091. doi: 10.1109/TBME.2023.3326799

Reliable Prostate Cancer Risk Mapping from MRI Using Targeted and Systematic Core Needle Biopsy Histopathology

Tal Zeevi 1, Michael S Leapman 2, Preston C Sprenkle 3, Rajesh Venkataraman 4, Lawrence H Staib 5, John A Onofrey 6
PMCID: PMC10901528  NIHMSID: NIHMS1949922  PMID: 37874731

Abstract

Objective:

To compute a dense prostate cancer risk map for the individual patient post-biopsy from magnetic resonance imaging (MRI) and to provide a more reliable evaluation of its fitness in prostate regions that were not identified as suspicious for cancer by a human-reader in pre- and intra-biopsy imaging analysis.

Methods:

Low-level pre-biopsy MRI biomarkers from targeted and non-targeted biopsy locations were extracted and statistically tested for representativeness against biomarkers from non-biopsied prostate regions. A probabilistic machine learning classifier was optimized to map biomarkers to their core-level pathology, followed by extrapolation of pathology scores to non-biopsied prostate regions. Goodness-of-fit was assessed at targeted and non-targeted biopsy locations for the post-biopsy individual patient.

Results:

Our experiments showed high predictability of imaging biomarkers in differentiating histopathology scores in thousands of non-targeted core-biopsy locations (ROC-AUCs: 0.85-0.88), but also high variability between patients (Median ROC-AUC [IQR]: 0.81-0.89 [0.29-0.40]).

Conclusion:

The sparseness of prostate biopsy data makes the validation of a whole gland risk mapping a non-trivial task. Previous studies (i) focused on targeted-biopsy locations although biopsy-specimens drawn from systematically scattered locations across the prostate constitute a more representative sample to non-biopsied regions, and (ii) estimated prediction-power across predicted instances (e.g., biopsy specimens) with no patient distinction, which may lead to unreliable estimation of model fitness to the individual patient due to variation between patients in instance count, imaging characteristics, and pathologies.

Significance:

This study proposes a personalized whole-gland prostate cancer risk mapping post-biopsy to allow clinicians to better stage and personalize focal therapy treatment plans.

Keywords: Prostate Cancer, Core Needle Biopsy, Cancer Risk Mapping, Magnetic Resonance Imaging, Focal Therapy, Personalized Medicine, Machine Learning Reliability

I. Introduction

PROSTATE biopsy is commonly performed under the guidance of transrectal ultrasound (TRUS). During this procedure, a thin hollow needle is used to draw a sample of tubular tissue cores from locations: (i) suspected to be malignant as identified on a pre-biopsy prostate magnetic resonance imaging (MRI) scan; (ii) suspected to be malignant as identified in real-time on the ultrasound scan; and (iii) systematically scattered across the prostate gland [1], [2]. Beyond being an invasive and unpleasant medical procedure, biopsy allows only a sparse mapping of prostate cells with the sampled tissues making up less than 0.5% of the prostate volume [3]. This limited tissue sampling leads to an imperfect sensitivity of histopathology in the detection of prostate cancer (PCa) [4].

Image-guided biopsy software systems use image registration to provide a two-way information exchange between the TRUS and pre-biopsy MRI. Regions suspected of being cancerous as segmented by a radiologist on the pre-biopsy MRI are localized within the ultrasound space – allowing the physician to navigate and target these specific locations during the procedure. Biopsy sampling sites as recorded on the ultrasound scan are projected back onto the MRI space. Later, tissue-core pathologies are assigned to their corresponding sampling sites – generating a potential information route that could bypass biopsy and allow further study of the relationship between imaging and pathology [5], [6].

Multiple attempts have been made to learn the mapping function from pre- and intra-biopsy imaging voxels to post-biopsy pathology using machine learning (ML) [7]. Although these studies shared a similar goal, their data and the manner in which they were processed differed in a number of key aspects: (i) ground truth annotation method (visual analysis provided by a radiologist, histopathology of core-biopsy specimens, histopathology of prostatectomy sections); (ii) annotated volume of interest (VOI) (gland region, lesion, tissue-core); (iii) resolution and scale of the predicted variable (binary, e.g. benign/cancerous, ordinal, e.g. Gleason scores); (iv) pathology-imaging registration method (human-based, software-based); and (v) imaging modalities and sequences (bi-/multi-parametric MRI (bp/mpMRI) and US). Table I classifies selected studies according to these five aspects.

TABLE I.

ML Studies Correlating MRI and Prostate Cancer Pathology

Ground Truth
Imaging Visual Inspection Histopathology
Biopsy Core Prostatectomy
T2W Alkadi et al. [8]b,l,h Khosravi et al. [9]o,l,h
bpMRI Tsehay et al. [10]b,l,h De Vente et al. [13]o,l,h Cao et al. [15]o,l,s
Kohl et al. [11]b,l,h Kwak et al. [14]b,l,s
Algohary et al. [12]b,l,h
Schelb et al. [16]b,r,s
mpMRI Kiraly et al. [17]b,l,h
Liu et al. [18]b,l,h
Antonelli et al. [19]b,l,h
Mehralivand et al. [20]o,c,h
Mehrtash et al. [21]b,c,s
Litjens et al. [22]b,c,s
Orczyk et al. [23]b,r,h

Studies are categorized by: predicted variable scale (Binary, Ordinal), volume of interest (Region, Lesion, Core), and registration method (Human-based, Software-based). E.g. [#]O,L,H indicates a study in which an ordinal level pathology score was assigned by a radiologist to a voxel cluster comprising a lesion.

To the best of our knowledge, none of the MRI studies that used core-tissues as their predicted VOI had non-targeted core-biopsies included in their datasets. Similarly, among the studies that based their ground truth on core-biopsy pathology, only Orczyk et al. [23] (n=20 test subjects), Schelb et al. [16] (n=62 test subjects) and Kwak et al. [14] (n=136 test subjects) considered systematic cores, however, in Schelb et al. [16] and Orczyk et al. [23] histopathology scores of systematic and targeted cores were aggregated together within pre-defined gland regions which were then used as the target VOI. In Kwak et al. [14], only 117 systematically sampled lesions were included - all of which were benign. In all cases, the ability to assess the predictive power in regions unsuspected for PCa was very limited. Furthermore, the use of quantitatively few VOIs per patient (such as gland regions, lesions, and targeted core-biopsies) in previous studies limited their ability to provide a predictive assessment at the individual patient level. Instead, predictive power was assessed at the VOI-level across the entire cohort with no patient distinction, which may be very biased due to differences in pathology distribution, imaging and targeting inaccuracies, and sample size between patients. Therefore, the question of how applicable these models are to individual patients remains unknown.

To address the limitations mentioned above, we sought to: (i) assess the predictive power of ML in mapping pre-biopsy imaging to histopathology in prostate regions that were not identified as suspicious for cancer by a human-reader in pre- and intra-biopsy imaging analysis; (ii) evaluate ML performance for the individual post-biopsy patient in the pre-treatment decision step. To achieve these aims, we explored a large public dataset of tracked biopsy sessions that includes pre-biopsy MRI scans annotated with core-level histopathology scores for both targeted and systematic biopsy samples. To the best of our knowledge, this is the first time this dataset has been used for such a purpose. We employed a two-step machine learning process consisting of unsupervised feature extraction followed by a classification step for mapping voxel clusters on pre-biopsy MRI to core-biopsy histopathology for each biopsy session in our cohort. We evaluated the performance of the models in predicting core-level pathology for each biopsy session individually and for the entire cohort.

II. Methods

A. The Data

A publicly available dataset of tracked prostate biopsy sessions performed at the University of California, Los Angeles, during the years 2004-2011 [24]-[26] was used. The following data objects were included: (i) a pre-biopsy prostate gland axial T2-Weighted MRI (T2W) acquired by 3T Trio, Verio or Skyra scanner (Siemens, Erlangen, Germany); (ii) binary segmentation masks of the prostate gland and for lesions identified on a pre-biopsy mpMRI using ProFuseCAD software (Eigen Health, Grassy Valley, CA); (iii) two spatial coordinates representing the base and the tip of the biopsy needle for every biopsy sampling site recorded from the Artemis biopsy device in both T2W and TRUS image spaces; (iv) sampling approach (systematic/targeted) and Gleason PCa grade groups (GG) (benign, GG=1, GG≥2) for each sampling site; and (v) patient- and lesion level data such as prostate specific antigen (PSA) levels, prostate and lesion volumes, MRI acquisition and biopsy session dates.

B. Data Preparation

Pre-biopsy T2W MRI scans and prostate segmentation masks were resampled to a common voxel spacing and the intensities of each MRI scan were then individually standardized. Adjacent non-overlapping rectangular cuboid patches were extracted from the MRI’s prostate gland region. Patches whose center was located onto or next to the prostate borders were allowed to include the surrounding areas outside the gland. Needle base and tip coordinates were projected onto the new resampled voxel space. Bresenham’s line algorithm was used to interpolate the voxels that form the straight-line path of the biopsy needle between the two points. Each patch was labeled with the highest GG core-tissue pathology across all needles that passed through it. Patches that represented regions that were not biopsied were excluded from the dataset. If a patient went through multiple biopsy sessions that were associated with a single MRI scan, only the closest biopsy session following the MRI acquisition time was considered.

Imaging biomarkers summarizing low-level voxel patterns in each patch were extracted from the latent space of two independently optimized neural networks: (i) an autoencoder network that we optimized to reconstruct pre-biopsy T2W MRI patches (a detailed description of the optimization pipeline is given in the Supplement); and (ii) a publicly available residual convolutional neural network (ResNet-50) that was pre-optimized on a very large diagnostic image-set to classify cell types in single channel microscopic scans of tissue sections of the human kidney cortex [27].

C. Assessing Biopsy Sample Representativeness

To better appreciate the importance of evaluating predictability beyond targeted biopsy locations when assessing the quality of a whole-gland cancer risk map, a random sample of imaging biomarkers from patches in non-biopsied, systematic, and targeted biopsied locations were projected into two dimensions using the Uniform Manifold Approximation and Projection (UMAP) approach [28]. The projected two-dimensional data in all three locations for an individual biopsy-session were compared to each other to assess the representativeness of the labeled data sample in targeted and systematic biopsy locations to the non-biopsied unlabeled region data. The multivariate nonparametric Cramár-test for the two-sample-problem [29] was used to test three hypotheses per biopsy-session: (i) targeted biopsy and non-biopsied biomarker samples are differently distributed (ii); systematic biopsy and non-biopsied biomarker samples are differently distributed; and (iii) biopsied and non-biopsied biomarker samples are differently distributed. Bonferroni correction was used to correct for multiple comparisons.

D. Whole Gland Prostate Cancer Risk Mapping

Imaging biomarkers from each of the neural networks and their corresponding labels were then used independently to optimize a probabilistic classifier to predict the PCa histopathology scores assigned to each patch.

1). Model Training and Validation

Stratified k-fold cross-validation (CV) was used to sample each biopsy session data into k pairs of independent train and test sets while preserving a similar label distribution in both sets. To allow the representation of all pathology categories in each of the k-folds, biopsy sessions with less than k patches per-category were excluded from the analysis. In each CV iteration, the Pearson correlation coefficient was calculated for feature pairs based on their training-set values and one feature of every pair that exceeded a predefined threshold was randomly excluded from the feature set. The reduced training set was used for optimizing the parameters of a probabilistic classifier for mapping patch-level feature instances to one of the discrete pathology categories.

2). Core-level Predictions

Core-level predictions were estimated using an aggregation function over patch-level predictions. To increase the statistical power in our experiments while reducing the adverse impact of outlier patch-level predictions, core-level predictions were estimated using the 90th percentile of their associated patch-level predictions.

3). Predictive Performance Evaluation

The area under the receiver operating characteristic curve (AUC) was calculated for core-level predictions across the entire cohort without patient distinction (cohort-level), as done in previous studies, as well as for each individual biopsy session (session-level). DeLong’s method was used to compute 95% confidence intervals for the cohort-level AUCs, and the median and interquartile range (IQR) were used to summarize the distribution of the individual session-level AUCs. AUCs were stratified by biopsy-type (systematic/targeted). Since the AUC cannot be estimated given a single classification category, post-stratification biopsy sessions for which all cores were labeled with the same pathology – for example, a biopsy session in which all systematic cores were benign - were excluded from the analysis for the specific stratification.

4). Spatial Proximity-based Prediction Benchmark

To better assess the contribution of core-level imaging biomarkers, we applied a spatial distance-based nearest-neighbors approach towards core-tissue classification. Following this approach, each core is assigned with classification probability proportionate to the frequency of pathology scores among its k nearest neighbors (KNN), i.e.: its closest cores, weighted by their distance from it. In our experiments, the distance between two cores was considered to be the shortest Euclidean distance among every two points on the cores’ needle paths within the prostate region. In sessions for which the number of cores was smaller than k, all cores were considered in the calculation. Appendix A defines the spatial proximity metric and classification approach.

5). Combined Imaging- and Proximity-based Prediction

To take advantage of the complementary information embedded in the imaging and proximity feature categories, a combined feature set (imaging + proximity) was created. This combined feature-set includes both imaging-biomarkers and a weighted-average of pathology scores of the K nearest cores multiplied by their spatial Euclidean distance and was used to optimize a probabilistic classifier to predict core pathology.

III. Experimental Evaluation

The complete dataset included 962 biopsy sessions and 14,197 core-tissues (min=2, median=16, max=23, IQR=3 per session) obtained from 795 patients. Median voxel spacing of the pre-biopsy T2W MRI scans were: 0.664mm (min=0.224, max=0.859, IQR=0) in the axial (XY) plane and 1.0mm (min=1.0, max=4.6, IQR=0) in the z direction. MRI volumes were resampled to a common spacing of 0.664 mm × 0.664 mm × 1.000 mm using bi-linear interpolation. Binary masks were re-sampled to the new T2W space using nearest-neighbor interpolation. Patches (9 × 9 × 3 voxels) were extracted from the T2W scans (see Fig. 1). Patches were resized to 28 × 28 × 3 voxels using area interpolation to fit the input dimensions of the pre-trained ResNet-50. For each patch, imaging biomarkers were extracted from the latent spaces of the pre-optimized autoencoder (m=72) and ResNet-50 (m=2048). A Random Forest (RF) classifier of 300 decision trees, each with a maximum depth of 5, was used to classify imaging biomarkers to histopathology categories. A 5-fold cross-validation policy was applied and an absolute Pearson coefficient of 0.7 was used to threshold correlated features. Three versions of spatial proximity-based KNN with k=1,3,5 were considered as benchmarks.

Fig. 1.

Fig. 1.

Patch Extraction and Labeling: (a) Grid of patch centers (shown in gray square markers) laid out on a single slice of a pre-biopsy T2W prostate MRI with gland contours (shown in light blue). (b) Extracted patches colored by histopathology score: Benign (blue), GG=1 (yellow), GG≥2 (red), and non-biopsied (dark gray).

A. Mapping Imaging to PCa Pathology

A multiclass classification task for differentiating three categories of core-level pathologies was considered: benign vs. GG=1 vs GG≥2. AUCs were evaluated for three derived binary classification tasks: (i) Benign vs. rest; (ii) GG=1 vs. rest and (iii) GG≥2 vs. rest. Number of biopsy-sessions, sampled core tissues, and patches that were included in each task are given in Table II. Goodness of fit of the best performing models under three approaches (imaging, proximity, imaging + proximity), stratified by biopsy sampling type are given in Table III and presented in Fig. 2. A complete performance assessment for all models tested under each approach is given in Tables IV and V in the Supplement. Optimized models were then used to generate a whole gland risk assessment by extrapolating for every voxel on the T2W scan; see example in Fig. 3.

TABLE II.

Instance quantities (% label distribution)

Task #Session #Core #Patch
#Systematic #Targeted
Benign vs. rest 650 5,832 (82% / 18%) 3,687 (46% / 54%) 42,268
GG=1 vs. rest 472 4,198 (16% / 84%) 2,698 (32% / 68%) 30,931
GG≥2 vs. rest 391 3,540 (10% / 90%) 2,239 (49% / 51%) 25,623

TABLE III.

Best AUCs in each modeling approach stratified by biopsy type: systematic, targeted

Biopsy Type Modeling Approach Task
Benign vs. rest GG=1 vs. rest GG≥2 vs. rest
Cohort-level AUC (95% CI) across cores Systematic Imaging-based 0.79 (0.77, 0.80) 0.75 (0.73, 0.77) 0.81 (0.79, 0.84)
Proximity-based 0.73 (0.71, 0.74) 0.63 (0.61, 0.65) 0.76 (0.74, 0.79)
Imaging + Proximity 0.85 (0.84, 0.86) * 0.81 (0.80, 0.83) * 0.88 (0.86, 0.90) *
Targeted Imaging-based 0.74(0.73, 0.76) 0.73 (0.71, 0.75) 0.73 (0.71, 0.75)
Proximity-based 0.76(0.75, 0.78) 0.69 (0.67, 0.71) 0.73 (0.71, 0.75)
Imaging + Proximity 0.81 (0.80, 0.83) * 0.79 (0.78, 0.81) * 0.79 (0.77, 0.81) *
Session-level Median AUC (IQR) across individual biopsy-sessions Systematic Imaging-based 0.67 (0.43) 0.70 (0.44) 0.71 (0.39)
Proximity-based 0.62 (0.43) 0.46 (0.21) 0.65 (0.45)
Imaging + Proximity 0.86 (0.37) 0.81 (0.40) 0.89 (0.29)
Targeted Imaging-based 0.60 (0.52) 0.56 (0.48) 0.67 (0.54)
Proximity-based 0.50 (0.35) 0.48 (0.25) 0.50 (0.34)
Imaging + Proximity 0.75 (0.50) 0.71 (0.50) 0.75 (0.56)
*

p < 0.05: DeLong test for a difference in Area Under the Receiver Operating Characteristics (AUC-ROC) Curves. Comparisons are done between approaches at the cohort-level evaluation for each biopsy-type. Bonferroni correction was used to correct for multiple comparisons.

Fig 2.

Fig 2.

Best classification performance in predicting core-tissue histopathology scores based on core-biopsy data (imaging, proximity) stratified by biopsy type: systematic (left subplot), targeted (right subplot). The horizontal axis represents the cohort-level AUCs across core tissues, and the vertical axis represents the median AUC across individual biopsy sessions. The horizontal error bars represent the 95% confidence interval for the cohort-level AUC and the vertical error bars represent the 1st and 3rd quartiles (IQR) of individual biopsy sessions’ AUCs. The 45-degree line represents an equilibrium between the cohort-level and the individual biopsy sessions’ performances, i.e., markers below this line indicate an over-optimistic cohort-level assessment for the individual biopsy session and vice-versa.

Fig. 3.

Fig. 3.

Extrapolated PCa Mapping: ML-generated whole prostate risk assessment for identifying PCa in 3D (top subfigure) and three 2D orientations (bottom subfigures). Blue and red shading indicates low and high risk respectively. Also illustrated are the interpolated biopsy needle paths colored by pathology score: benign (blue), GG≥2 (red), and pre-biopsy identified lesions (dark gray). Bounded by a light blue circle on the left-bottom subplot is a high-risk region with no previous clinical indication.

B. Assessment of Mapping at Non-Targeted Biopsy Locations

In all three tasks, the best imaging-based approach yielded more accurate cohort-level predictions for systematic-cores than the proximity-based classification (p<0.05). No significant difference at the cohort-level was found between the two approaches for targeted cores. The combined imaging+proximity model performed significantly better (p<0.05) than the single imaging or proximity models in all tasks for both systematic and targeted cores. In 50% of the individual biopsy-sessions, the AUC-ROC for the imaging-based approach in systematic-cores was smaller than 0.67, 0.70 and 0.71 for the tasks (i), (ii) and (iii) respectively. The interquartile range (IQR) AUC-ROC across individual biopsy-sessions ranged between 0.35-0.56 in all experiments. For all approaches, the cohort-level evaluation provides an over-optimistic estimate to the median performance across individual-sessions, except for the combined approach for which the cohort- and session level performances on systematic cores were the same.

C. Biomarkers’ Representativeness Hypothesis Testing

Comparison of biomarker distributions in biopsied and non-biopsied locations was done for sessions with at least five systematic and targeted cores (N=562). The null hypothesis that the distributions are the same was rejected (Bonferroni adjusted p<0.05) in: N=50 sessions for both systematic and targeted locations, N=79 sessions for systematic locations only, and N=151 sessions for targeted locations only. In N=282 sessions, the null hypothesis could not be rejected for either systematic or targeted samples. In N=435 sessions, the combined sample of targeted and systematic biopsy could not be determined to be differently distributed than the non-biopsied sample. Visual examples of the different scenarios are given in Fig. 4.

Fig. 4.

Fig. 4.

Assessment of Biopsied Tissues Data Sample Representativeness to Data in Non-biopsied Regions. Summary of the multivariate nonparametric Cramér-test for difference in imaging-biomarker distributions in biopsied locations (targeted, non-targeted/systematic) and non-biopsied locations. Exemplary scatter plots show the spread of biomarkers in the reduced UMAP dimensions stratified by their location (targeted, systematic, non-biopsied) of four different biopsy-sessions under different scenarios: representative targeted- and systematic biopsy samples (upper-left), non-representative targeted-biopsy sample and representative systematic biopsy sample (upper-right), representative targeted-biopsy sample and non-representative systematic biopsy sample (lower-left), non-representative targeted- and systematic biopsy samples (lower-right).

IV. Discussion

A. Clinical Significance

While prostate biopsy is the most accurate tool we currently have for mapping the prostate for cancer, it is not without limitations. The limited tissue sample collected during biopsy provides only a coarse localization of cancer pathology in the prostate. Thus, at times the procedure results in mis- [4] or over- [30] diagnosis of cancer which affects post-biopsy treatment decisions. A comprehensive and denser mapping of the prostate may provide a more descriptive picture of the distribution of cancer, thus improving the localization of PCa, and at the same time enhancing the planning of post-biopsy focal therapy treatments, such as: high-intensity focused ultrasound, transurethral ultrasound ablation, and radiation therapy.

B. Biomarkers at Systematic locations May Constitute a More Representative Validation Sample to Non-biopsied regions.

The sparseness of biopsy sampling makes the validation of a whole-gland cancer-risk estimation map a non-trivial task as the tissue specimens do not necessarily constitute a representative sample for the non-biopsied prostate regions. Our investigations indicated that imaging biomarkers extracted at targeted-biopsy locations tend to be less representative of those extracted at non-biopsied regions compared to biomarkers from systematic-biopsy locations (Fig. 4). Although having a representative validation set is critical for accurately estimating the model generalization bias, especially in tasks where the labeled data are sparse, this aspect has hardly been discussed in previous studies where most attention was given to predicting pathologies in targeted locations.

C. Spatial Proximity of Core-tissues is Informative yet Unstable as a Stand-Alone Indicator

Naturally, spatial proximity-based classification approaches are greatly influenced by the choices made by the clinician regarding biopsy sampling locations, number of samples, and the distribution of pathology across samples. This limitation was reflected to some extent in the poor performance these approaches demonstrated in predicting pathologies of targeted-cores at the individual biopsy session level. This performance could be explained by the small number of targeted-cores sampled in an individual biopsy session, and thus their predicted labels were dependent on the labeling of non-targeted cores which were characterized by a different pathology distribution. The cohort-level performance assessment for these approaches and for targeted cores in particular were overly optimistic (Fig. 2). Thus, the ROC obtained for the cohort level reflects a sensitivity-specificity tradeoff that does not necessarily apply to the individual patient, and thus strengthens the understanding that such cohort-level evaluation approaches are inapplicable in practice for such tasks.

D. Previous Studies’ Performance Evaluation Approach was Prone to Bias

The multiplicity of prediction instances per biopsy session allowed us to assess the model performance at the individual patient level post-biopsy. This assessment is of practical importance as it provides clinicians with an indication of the fitness of the model to the individual patient at the pre-decision phase. Our results show a relatively big decline in performance when the calculations are made for each biopsy session individually compared to when they are made across core-tissues for the entire cohort as a whole. This difference may be explained by the variation in the amount and distribution of histopathology in the biopsy samples among the sessions. This session level evaluation contrasts with previous work that takes on a cohort level analysis which can obscure the individual performance. This evaluation highlights the importance of creating and choosing suitable performance metrics for tasks of a similar type, in which several sparse points of interest are predicted for individual patients.

E. Limitations

A major source of uncertainty lies in the software-based localization of the sampled core-tissues. Needle paths denoted by the line-segment connecting the needle base and tip points provide a wide bound to the true sampling site. Thus, patches which were labeled with their corresponding core-tissue pathology may represent a different pathology than the one assigned to them. This uncertainty introduced noise which we mitigated by using the 90th percentile over patch predictions to construct the core-level prediction.

The generation of a patient-specific imaging-pathology mapping function based only on the patient’s sparsely sampled biopsy data, may not provide a complete coverage of all possible disease outcomes. That is, if the biopsy sample includes only benign specimens, the mapping-function would not be able to identify anything other than benign results. However, if at least one biopsy specimen was identified as significant disease, then it has the potential to identify such disease elsewhere in the prostate that may have otherwise been missed.

The lack of availability of common pre-biopsy MRI sequences, such as DWI and DCE, posed a limitation on our ability to further exploit the information embedded in pre-biopsy imaging. Nonetheless, a good performance in predicting histopathology of non-targeted core-tissues was obtained for the T2W scan only (Table III). This result could possibly indicate good predictably for other unsampled gland regions and may generate a higher confidence among clinicians in ML-based whole gland risk assessment thus increasing the chance of clinical adoption of such a solution.

In the future, we aim to utilize complementary mpMRI as input and to further explore approaches for assessing model and data uncertainty to provide clinicians with a quantitative assessment of the suitability of the model for their patient.

V. Conclusion

This study presents an MRI-biomarker-driven ML approach to whole-gland prostate cancer risk map generation using core-biopsy histopathology scores in regions that were not recognized as malignant on pre- and intra biopsy imaging. We utilized a unique, publicly available dataset of tracked biopsy sessions with thousands of targeted and non-targeted core tissues annotated and localized on a pre-biopsy T2W MRI. Our experiments focused on three aspects of model reliability which have been overlooked in previous studies: (i) representativeness of the biopsy validation data; (ii) quantification of model fitness at the individual patient level; and (iii) whole-gland risk extrapolation based on the patient’s post-biopsy data.

Data sparsity is a big challenge in predictive modeling validation. Biopsy data are collected sparingly and partly in a targeted manner. Nonetheless, unlike many other biopsy procedures, prostate biopsy includes exploratory tissue sampling. These samples are more indicative of the non-biopsied regions and thus of great importance in validating risk assessment in locations that were not identified as cancer on imaging.

Typically, classification tasks include a single region of interest (ROI) per patient for which the prediction is made. Prostate cancer patients, however, may differ in the number and distribution of ROIs (lesion, biopsy specimen, etc.). Thus, evaluating model predictions solely across the ROIs with no patient distinction may produce a biased estimate for the individual patient.

Whole gland risk assessment has clinical significance both in the pre- and post-biopsy phases. The focus on the pre- biopsy phase requires inducing between patient’s MRIs which may suffer from instabilities due to the unstandardized nature of MRI. Meanwhile, post-biopsy extrapolation allows leveraging the individual patient biopsy data, thus reducing the chance of covariate bias and provides clinicians with a valuable continuous localization of cancer across the prostate gland to further personalize their post-biopsy treatment plan.

Supplementary Material

Supplement

Acknowledgments

This work was supported by National Institute of Health (NIH) National Cancer Institute (NCI) R42 CA224888.

Appendix

A. Spatial Proximity Core Classification

For each patient, we represent a biopsy session 𝒮 as a collection of ns needle biopsy cores 𝒮={(Ci,yi)i=1,,ns} where each biopsy core Ci is defined as a set of nc points. Ci={xi,jxi,jR3,j=1,,nc} representing the sampling site coordinates localized on the pre-biopsy MRI, and yi is a pathology grade variable assigned from a finite set of discrete values {g1,g2,,gng}𝒢. Note that ns can be different across patients and nc can be different for each needle core.

The classification probabilities vector of any location xR3 in the prostate is estimated by:

P(y𝒮,x)f(W)Y

Where:

Wns is a vector of weights that represents the spatial proximity of x from each Ci𝒮, whose entries are given by:

wi=11+minxi,jCid(x,xi,j)

d(x,x) is a distance function defined here as Euclidean distance x,x; f() is a nonlinear k-nearest neighbors function that thresholds W to keep its k largest weights wiW and zero out the others:

f(W)i={wi,withekthlargestelementinW0,otherwise}

and Yns×ng is a matrix of one-hot encoded core labels yi𝒮.

B. System Configuration & Data Modules

Prostate-gland segmentation Standard Triangle Language (STL) volumes were converted to binary masks using 3D-Slicer (Version 5.0.3). Data analysis and modeling were implemented in Python (version 3.7.4) and R (version 3.3.3). Deep-learning optimization and feature extraction were performed using the Medical Open Network for Artificial Intelligence (MONAI) framework (version 0.8.1), PyTorch Lightning (version 1.6.2) and Torch (version 1.11.0). Scikit-learn (version 1.0.2) was used for implementing the Random-Forest classifier and cross-validation schemes. Confidence Intervals were calculated using the pROC R package and the Cramér-test was implemented using the cramer R package. The workstation used included 64-bit Ubuntu 18.04.6 LTS with two Intel Xeon Gold 5218 2.30 GHz central processing units and three Nvidia Quadro RTX 8000 graphics processing units.

C. Source Code

The source code for the different modeling methods described in the paper can be found at: https://github.com/talze/PCa-Risk-Map

Contributor Information

Tal Zeevi, Department of Biomedical Engineering, Yale University, New Haven, CT 06520 USA..

Michael S. Leapman, Department of Urology, Yale University, New Haven, CT 06520 USA.

Preston C. Sprenkle, Department of Urology, Yale University, New Haven, CT 06520 USA.

Rajesh Venkataraman, Eigen Health, Grass Valley, CA 95945 USA..

Lawrence H. Staib, Departments of Biomedical Engineering, Radiology and Biomedical Imaging, and Electrical Engineering, Yale University, New Haven, CT 06520 USA.

John A. Onofrey, Departments of Biomedical Engineering, Radiology and Biomedical Imaging, and Urology, Yale University, New Haven, CT 06520 USA..

References

  • [1].Mottet N, et al. , “EAU-ESTRO-SIOG guidelines on prostate cancer. part 1: Screening, diagnosis, and local treatment with curative intent,” European urology, vol. 71, no. 4, pp. 618–629, 2017. [DOI] [PubMed] [Google Scholar]
  • [2].Baco E, et al. , “A randomized controlled trial to assess and compare the outcomes of two-core prostate biopsy guided by fused magnetic resonance and transrectal ultrasound images and traditional 12-core systematic biopsy,” European urology, vol. 69, no. 1, pp. 149–156, 2016. [DOI] [PubMed] [Google Scholar]
  • [3].Ganz PA, et al. , “National institutes of health state-of-the-science conference: Role of active surveillance in the management of men with localized prostate cancer,” Annals of internal medicine, vol. 156, no. 8, pp. 591–595, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Abraham NE, et al. , “Patterns of repeat prostate biopsy in contemporary clinical practice,” The Journal of urology, vol. 193, no. 4, pp. 1178–1184, 2015. [DOI] [PubMed] [Google Scholar]
  • [5].Sonn GA, et al. , “Target detection: Magnetic resonance imaging-ultrasound fusion–guided prostate biopsy,” in Urologic Oncology: Seminars and Original Investigations, Elsevier, vol. 32, 2014, pp. 903–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Natarajan S, et al. , “Clinical application of a 3D ultrasound-guided prostate biopsy system,” in Urologic oncology: seminars and original investigations, Elsevier, vol. 29, 2011, pp. 334–342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Twilt JJ, et al. , “Artificial intelligence based algorithms for prostate cancer classification and detection on magnetic resonance imaging: A narrative review,” Diagnostics, vol. 11, no. 6, p. 959, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Alkadi R, et al. , “A deep learning-based approach for the detection and localization of prostate cancer in T2 magnetic resonance images,” Journal of digital imaging, vol. 32, pp. 793–807, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Khosravi P, et al. , “A deep learning approach to diagnostic classification of prostate cancer using pathology–radiology fusion,” Journal of Magnetic Resonance Imaging, vol. 54, no. 2, pp. 462–471, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Tsehay YK, et al. , “Convolutional neural network based deep-learning architecture for prostate cancer detection on multiparametric magnetic resonance images,” in Medical imaging 2017: Computer-aided diagnosis, SPIE, vol. 10134, 2017, pp. 20–30. [Google Scholar]
  • [11].Kohl S, et al. , “Adversarial networks for the detection of aggressive prostate cancer,” arXiv preprint arXiv:1702.08014, 2017. [Google Scholar]
  • [12].Algohary A, et al. , “Combination of peri-tumoral and intra-tumoral radiomic features on bi-parametric MRI accurately stratifies prostate cancer risk: A multi-site study,” Cancers, vol. 12, no. 8, p. 2200, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].De Vente C, et al. , “Deep learning regression for prostate cancer detection and grading in bi-parametric MRI,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 2, pp. 374–383, 2020. [DOI] [PubMed] [Google Scholar]
  • [14].Cao R, et al. , “Joint prostate cancer detection and gleason score prediction in mp-MRI via FocalNet,” IEEE transactions on medical imaging, vol. 38, no. 11, pp. 2496–2506, 2019. [DOI] [PubMed] [Google Scholar]
  • [15].Schelb P, et al. , “Classification of cancer at prostate MRI: Deep learning versus clinical PI-RADS assessment,” Radiology, vol. 293, no. 3, pp. 607–617, 2019. [DOI] [PubMed] [Google Scholar]
  • [16].Kiraly AP, et al. , “Deep convolutional encoder-decoders for prostate cancer detection and classification,” in Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11–13, 2017, Proceedings, Part III 20, Springer, 2017, pp. 489–497. [Google Scholar]
  • [17].Liu S, et al. , “Prostate cancer diagnosis using deep learning with 3d multiparametric MRI” in Medical imaging 2017: computer-aided diagnosis, SPIE, vol. 10134, 2017, pp. 581–584. [Google Scholar]
  • [18].Antonelli M, et al. , “Machine learning classifiers can predict Gleason pattern 4 prostate cancer with greater accuracy than experienced radiologists,” European radiology, vol. 29, pp. 4754–4764, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Mehralivand S, et al. , “Deep learning-based artificial intelligence for prostate cancer detection at biparametric MRI,” Abdominal Radiology, vol. 47, no. 4, pp. 1425–1434, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Mehrtash A, et al. , “Prostate cancer diagnosis with sparse biopsy data and in presence of location uncertainty,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), IEEE, 2021, pp. 443–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Litjens G, et al. , “Computer-aided detection of prostate cancer in MRI,” IEEE transactions on medical imaging, vol. 33, no. 5, pp. 1083–1092, 2014. [DOI] [PubMed] [Google Scholar]
  • [22].Orczyk C, et al. , “Prostate cancer heterogeneity: Texture analysis score based on multiple magnetic resonance imaging sequences for detection, stratification and selection of lesions at time of biopsy,” BJU international, vol. 124, no. 1, pp. 76–86, 2019. [DOI] [PubMed] [Google Scholar]
  • [23].Kwak JT, et al. , “Automated prostate cancer detection using t2-weighted and high-b- value diffusion-weighted magnetic resonance imaging,” Medical physics, vol. 42, no. 5, pp. 2368–2378, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Natarajan S, et al. , “Prostate MRI and ultrasound with pathology and coordinates of tracked biopsy (prostate-MRI-US-biopsy),” Cancer Imaging Arch, vol. 10, p. 7937, 2020. [Google Scholar]
  • [25].Sonn GA, et al. , “Targeted biopsy in the detection of prostate cancer using an office based magnetic resonance ultrasound fusion device,” The Journal of urology, vol. 189, no. 1, pp. 86–92, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Clark K, et al. , “The cancer imaging archive (TCIA): Maintaining and operating a public information repository,” Journal of digital imaging, vol. 26, pp. 1045–1057, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Yang J, et al. , “Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, p. 41, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].McInnes L, et al. , “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]
  • [29].Baringhaus L and Franz C, “On a new multivariate two-sample test,” Journal of multivariate analysis, vol. 88, no. 1, pp. 190–206, 2004. [Google Scholar]
  • [30].Hugosson J, et al. , “Prostate cancer screening with PSA and MRI followed by targeted biopsy only,” New England Journal of Medicine, vol. 387, no. 23, pp. 2126–2137, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES