Abstract
Purpose
This study aimed to extract radiomics features from prostate magnetic resonance imaging (MRI), evaluate their reproducibility, and determine whether machine learning (ML) models built on reproducible features can noninvasively diagnose prostate cancer (PCa).
Methods
We analyzed prostate MRI from 82 subjects (41 PCa and 41 controls) in the public PROSTATEx dataset. From T2-weighted imaging (T2WI) and apparent diffusion coefficient (ADC) maps, 215 features per sequence were extracted (T2WI 215; ADC 215; total 430). Reproducibility within each sequence was quantified after repeated segmentation using the intraclass correlation coefficient (ICC) with a 2-way random-effects, absolute-agreement model. Only shared features with ICC≥0.75 in both T2WI and ADC were retained. Selected features were normalized and combined via early fusion into a single input vector. Redundant features were eliminated by Pearson correlation analysis (|r|>0.9).
Results
Reproducible radiomics features (ICC≥0.75) were key contributors to model performance. Using these features, support vector machine, neural network, and logistic regression models achieved accuracies of 80%–84% and a maximum area under the receiver operating characteristic curve of 0.85 under 5-fold cross-validation. Principal component analysis yielded the most consistent results, whereas several nonlinear dimensionality reduction methods produced variable outcomes across classifiers.
Conclusions
Combining reproducible MRI radiomics features with dimensionality reduction and ML offers a robust noninvasive approach for PCa diagnosis. Emphasizing reproducibility enhances model performance and reliability, supporting potential clinical translation.
Keywords: Prostatic neoplasms, Magnetic resonance imaging, Machine learning, Reproducibility of results, Radiomics
INTRODUCTION
Prostate cancer (PCa) is among the most common malignancies in men worldwide and the second leading cause of cancer-related death. Conventional screening and diagnostic methods, including prostate-specific antigen testing, digital rectal examination, and transrectal ultrasound-guided biopsy, are either invasive or limited in accuracy. Multiparametric magnetic resonance imaging (mpMRI) has been introduced as a noninvasive tool for evaluating PCa with improved diagnostic performance. However, interpretation of lesion aggressiveness on mpMRI remains dependent on reader expertise and is subject to interobserver variability.
The integration of artificial intelligence and machine learning (ML) into medical imaging has accelerated advances in mpMRI-based PCa assessment. Computer-aided diagnosis systems can standardize interpretation, and deep learning approaches further reduce reading time while enhancing accuracy. Concurrently, radiomics (i.e., the high-throughput extraction of quantitative imaging features) has emerged as a cornerstone of precision oncology, as this approach captures subvisual tumor characteristics that may correlate with disease trajectory or treatment response.
Despite its promise, radiomics is highly sensitive to variability in image acquisition, preprocessing, and segmentation, which can compromise reproducibility and hinder clinical implementation [1-3]. The intraclass correlation coefficient (ICC) is a widely recognized metric used to quantify feature reliability across repeated measurements or varying conditions [4]. Selecting features with sufficiently high ICC values can improve model stability and strengthen interpretability.
Using the public PROSTATEx dataset, we systematically evaluated the reproducibility of prostate MRI radiomics features and assessed their influence on ML-based PCa classification. We first validated feature reliability with ICC, then trained multiple classifiers exclusively on reproducible features, while comparing several dimensionality reduction strategies. Our objective was to enhance diagnostic accuracy and consistency and to evaluate the feasibility of a reproducibility-centered pipeline for clinical translation.
MATERIALS AND METHODS
Study Cohort and Image Acquisition
This retrospective study used the public PROSTATEx dataset, which includes 82 subjects (41 PCa and 41 non-cancer controls) [5-7]. All subjects underwent T2WI and ADC imaging acquired on either 1.5T or 3.0T MRI scanners. T2WI provides detailed anatomic delineation of the prostate and surrounding soft tissues, whereas ADC reflects water diffusion, indirectly indicating tumor cellularity and infiltration. Preprocessing involved Gaussian denoising, intensity range standardization, and resizing to 256×256 pixels to ensure inter-case consistency. An overview of the workflow is presented in Fig. 1.
Fig. 1.
Workflow of the radiomics-based prostate cancer classification pipeline. Steps include SERA feature extraction, intraclass correlation coefficient (ICC≥0.75) filtering, Pearson correlation pruning (|r|>0.9), dimensionality reduction, classification, and nested 5-fold cross-validation (CV) with inner and outer loops. SERA, standardized environment for radiomics analysis; MRI, magnetic resonance imaging; T2WI, T2-weighted imaging; ADC, apparent diffusion coefficient; VOI/ROI, volume of interest/region of interest; PCA, principal component analysis; KPCA, kernel PCA; LDA, linear discriminant analysis; LLE, locally linear embedding; AE, autoencoder; SVM, support vector machine; LR, logistic regression; NN, nearest neighbors; RF, random forest; k-NN, k-nearest neighbors; XGB, XGBoost; LGBM, LightGBM; CatB, CatBoost; AdaB, AdaBoost; DT, decision tree.
Radiomics Feature Extraction and Processing
Radiomics features were extracted from both T2WI and ADC using standardized environment for radiomics analysis (SERA) (Table 1). From tumor regions of interest, we computed 215 features per sequence (430 total), encompassing first-order statistics, morphology, and multiple texture families (e.g., GLCM [gray level co-occurrence matrix], GLRLM [gray level run length matrix], GLSZM [gray level size zone matrix], NGTDM [neighborhood gray tone difference matrix], GLDM [gray level dependence matrix]), along with filtered features derived from wavelet and Laplacian-of-Gaussian transforms. The feature categories and counts are summarized below. Features from T2WI and ADC were normalized and concatenated via early fusion.
Table 1.
Radiomics feature composition (215 per sequence; total 430)
| Feature category | No. of features |
|---|---|
| First-order | 19 |
| Shape-based | 14 |
| GLCM | 24 |
| GLRLM | 16 |
| GLSZM | 22 |
| NGTDM | 5 |
| GLDM | 14 |
| Filtered (Wavelet/LoG-derived) | 101 |
GLCM, gray level co-occurrence matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; NGTDM, neighborhood gray tone difference matrix; GLDM, gray level dependence matrix.
Reproducibility Assessment and Feature Selection
Within each sequence (T2WI and ADC), we performed repeated segmentation and calculated ICC using a 2-way random-effects, absolute-agreement model [4]. Features with ICC≥0.75 in both sequences were retained as the core reproducible feature set. ICC values were classified as poor (<0.50), moderate (0.50–0.75), good (0.75–0.90), and excellent (0.90) [4]. Fig. 2 shows the ICC distributions before filtering by category, and Table 2 summarizes counts and percentages. To reduce redundancy, we excluded one feature from any pair with |r|>0.9 based on Pearson correlation.
Fig. 2.
Prefiltering ICC distributions of radiomics features. (A) T2-weighted imaging (T2WI) and (B) apparent diffusion coefficient (ADC) maps, displayed as boxplots grouped by feature category. ICC, intraclass correlation coefficient; GLCM, gray level co-occurrence matrix; GLRLM, gray level run length matrix; GLSZM, gray level size zone matrix; NGTDM, neighborhood gray tone difference matrix; GLDM, gray level dependence matrix.
Table 2.
ICC grade distribution (prefiltering) per sequence (n=215 each)
| ICC grade | No. of T2WI images (n) | Proportion of T2WI (%) | No. of ADC images (n) | Proportion of ADC (%) | Total (n) |
|---|---|---|---|---|---|
| Poor (< 0.50) | 66 | 30.7 | 67 | 31.2 | 133 |
| Moderate (0.50–0.75) | 34 | 15.8 | 67 | 31.2 | 101 |
| Good (0.75–0.90) | 63 | 29.3 | 55 | 25.6 | 118 |
| Excellent (≥ 0.90) | 52 | 24.2 | 26 | 12.1 | 78 |
ICC, intraclass correlation coefficient; T2WI, T2-weighted imaging; ADC, apparent diffusion coefficient.
Classifier Development and Evaluation
Using the selected feature set, we evaluated 7 dimensionality reduction techniques: principal component analysis (PCA), kernel PCA, linear discriminant analysis, Isomap, locally linear embedding (LLE), Laplacian eigenmaps, and an autoencoder. The reduced features were then used to train 10 classifiers: support vector machine (SVM), random forest, k-nearest neighbors, decision tree, logistic regression, neural network, XGBoost, LightGBM, CatBoost, and AdaBoost.
To avoid data leakage, all preprocessing steps, including correlation filtering, normalization, dimensionality reduction fitting, and hyperparameter tuning, were performed solely within the training folds, with transformations subsequently applied to validation folds. Model performance was estimated using nested cross-validation (inner and outer 5-fold) to minimize selection bias. The primary evaluation metrics were accuracy and area under the receiver operating characteristic curve (AUC), averaged over the outer folds. Comparative results are presented in Figs. 3 and 4 and summarized in Table 3.
Fig. 3.

Mean±standard deviation for accuracy by dimensionality reduction across different dimensionality reduction methods, including principal component analysis (PCA), linear discriminant analysis (LDA), and kernel PCA. SVM, support vector machine.
Fig. 4.

Mean±standard deviation for the area under the curve by classifier (max 0.85): support vector machine (SVM), logistic regression, neural network, random forest, and CatBoost. AUC, area under the receiver operating characteristic curve.
Table 3.
Nested 5-fold CV performance (top combinations; reproducible features only)
| Model | Best dimensionality reduction | Accuracy (%) | AUC |
|---|---|---|---|
| SVM | PCA | 84 | 0.85 |
| Logistic regression | PCA | 82.1 | 0.83 |
| Neural network | PCA | 83.4 | 0.84 |
| Random forest | Kernel PCA | 80.5 | 0.82 |
| CatBoost | LDA | 81.2 | 0.83 |
CV, cross-validation; AUC, area under the receiver operating characteristic curve; SVM, support vector machine; PCA, principal component analysis; LDA, linear discriminant analysis.
RESULTS
Reproducibility of Radiomics Features
Of the 215 T2WI features, 66 (30.7%) demonstrated poor reproducibility (ICC<0.5), 34 (15.8%) were moderate, 63 (29.3%) were good, and 52 (24.2%) were excellent. Among the 215 ADC features, 67 (31.2%) were poor, 67 (31.2%) were moderate, 55 (25.6%) were good, and 26 (12.1%) were excellent. Only features with ICC≥0.75 in both T2WI and ADC were used in downstream analysis, resulting in 115 good/excellent T2WI features and 81 good/excellent ADC features. Fig. 2 presents the prefiltering ICC distributions (category boxplots), and Table 2 summarizes the corresponding counts and percentages.
Classification Performance
Across the dimensionality reduction methods, SVM, nearest neighbors (NN), and logistic regression exhibited relatively stable performance, achieving accuracies of 80%–84% (Table 3; Figs. 3 and 4). PCA provided the most consistent gains across classifiers, while kernel PCA and Laplacian eigenmaps also performed favorably. In contrast, Isomap and LLE occasionally degraded performance in certain ensemble models (e.g., LightGBM, AdaBoost), where accuracy dropped to 0.65–0.75. The k-NN classifier was particularly sensitive when trained with autoencoder-derived features.
AUC patterns closely mirrored accuracy trends. SVM maintained an AUC of approximately 0.83 across most reductions and achieved a maximum AUC of 0.85 in its best configuration (Fig. 4). Logistic regression and CatBoost also reached a peak AUC of 0.85, which is notable given the dataset size. NN achieved an AUC of 0.85 with PCA but fell toward 0.5 with LLE, indicating sensitivity to manifold representation. LightGBM occasionally produced near-random performance (AUC≈0.5) under certain nonlinear reductions.
DISCUSSION
Using the PROSTATEx dataset, we developed ML models for PCa diagnosis based on T2WI and ADC radiomics while rigorously quantifying feature reproducibility. Restricting model training to ICC-validated features yielded moderate-to-high diagnostic performance (accuracy 80%–84%; AUC ≤0.85) despite the modest cohort size, supporting the effectiveness of a reproducibility-first approach over aggressive numerical optimization in small datasets.
Li et al. [8] integrated biparametric radiomics with clinical variables in a larger cohort using logistic regression as the primary classifier. In contrast, our balanced 41/41 cohort systematically evaluated a broader range of dimensionality reduction and classifier combinations, identifying PCA as a consistently robust option. Liu et al. [9] examined 1,576 mpMRI features using SelectKBest (Select K best features) and LASSO (least absolute shrinkage and selection operator) to distinguish indolent from aggressive disease. While their work emphasized risk stratification, our study focused on binary lesion detection at the diagnostic stage and, crucially, isolated the direct effect of reproducibility filtering. Prior studies have consistently reported that consensus segmentation and the exclusion of unstable features enhance reliability [3, 4]; our findings extend this evidence, showing that emphasizing feature stability alone can substantially influence performance, even without multimodal or clinical data integration.
The study has several limitations. The sample size was relatively small (n=82), and the analysis was limited to T2WI and ADC sequences, excluding modalities such as dynamic contrast-enhanced T1-weighted imaging. Furthermore, external validation beyond nested cross-validation was not conducted. Future research will integrate automated segmentation, clinical and molecular covariates, and deep learning-based representations to improve generalizability. Validation across multi-institutional cohorts under a standardized, reproducibility-oriented pipeline will also be critical for translation into clinical practice.
Footnotes
Grant/Fund Support
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Research Ethics
This study used publicly available deidentified datasets and did not involve human participants or animals; therefore, ethical approval was not required.
Conflict of Interest
The authors declare that they have no conflict of interest.
ACKNOWLEDGEMENTS
The authors thank the PROSTATEx organizers and contributors for curating and maintaining the dataset used in this study.
AUTHOR CONTRIBUTION STATEMENT
· Conceptualization: SJ
· Data curation: SJ
· Formal analysis: SJ
· Methodology: SJ, JSK
· Project administration: JSK
· Visualization: SJ
· Writing – original draft: SJ
· Writing – review & editing: SJ, JSK
REFERENCES
- 1.Rodrigues A, Rodrigues N, Santinha J, Lisitskaya MV, Uysal A, Matos C, et al. Value of handcrafted and deep radiomic features towards training robust machine learning classifiers for prediction of prostate cancer disease aggressiveness. Sci Rep. 2023;13:6206. doi: 10.1038/s41598-023-33339-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pasini G, Russo G, Mantarro C, Bini F, Richiusa S, Morgante L, et al. A critical analysis of the robustness of radiomics to variations in segmentation methods in 18F-PSMA-1007 PET images of patients affected by prostate cancer. Diagnostics (Basel) 2023;13:3640. doi: 10.3390/diagnostics13243640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Namdar K, Wagner MW, Ertl-Wagner BB, Khalvati F. Open-radiomics: a collection of standardized datasets and a technical protocol for reproducible radiomics machine learning pipelines. BMC Med Imaging. 2025;25:312. doi: 10.1186/s12880-025-01855-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15:155–63. doi: 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Armato SG 3rd, Huisman H, Drukker K, Hadjiiski L, Kirby JS, Petrick N, et al. PROSTATEx challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J Med Imaging (Bellingham) 2018;5:044501. doi: 10.1117/1.JMI.5.4.044501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. PROSTATEx-SEG-ZONES [Internet]. Frederick (MD): The Cancer Imaging Archive; 2020 [cited 2025 Nov 2]. Available from: https://www.cancerimagingarchive.net/analysis-result/prostatex-segzones/
- 7.Armato SG, 3rd, Huisman H, Drukker K, Hadjiiski L, Kirby JS, Petrick N, et al. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J Med Imaging (Bellingham) 2018;5:044501. doi: 10.1117/1.JMI.5.4.044501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li M, Chen T, Zhao W, Wei C, Li X, Duan S, et al. Radiomics prediction model for the improved diagnosis of clinically significant prostate cancer on biparametric MRI. Quant Imaging Med Surg. 2020;10:368–79. doi: 10.21037/qims.2019.12.06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liu Y, Wu J, Ni X, Zheng Q, Wang J, Shen H, et al. Machine learning based on automated 3D radiomics features to classify prostate cancer in patients with prostate-specific antigen levels of 4-10 ng/mL. Transl Androl Urol. 2025;14:1025–35. doi: 10.21037/tau-2024-731. [DOI] [PMC free article] [PubMed] [Google Scholar]


