Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

Betul Akdal Dolek; Muhammed Said Besler

doi:10.1186/s12894-025-02038-5

. 2026 Jan 2;26:27. doi: 10.1186/s12894-025-02038-5

Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

Betul Akdal Dolek ^1,^✉, Muhammed Said Besler ²

PMCID: PMC12866120 PMID: 41484874

Abstract

Background

This study aimed to evaluate the performance of large language models (LLMs) in classifying prostate MRI reports according to the Prostate Imaging–Reporting and Data System (PIRADS) version 2.1, and to validate their use in supporting clinical decisions in prostate cancer treatment.

Methods

This retrospective study included 146 patients. Four LLMs — GPT-4o, GPT-o1, Google Gemini 1.5 Pro and Google Gemini 2.0 Experimental Advanced — were tested on standardised, structured prostate MRI reports. A two-radiologist consensus reference standard was used to compare model performance. Agreement was measured using weighted Cohen’s kappa, and accuracy and F1 scores were calculated for three PI-RADS risk groups: low (1–2), intermediate (3) and high (4–5).

Results

Performance varied by model. GPT-o1 achieved the highest level of agreement with radiologists (κ = 0.867), followed by GPT-4o (κ = 0.743), Gemini 1.5 Pro (κ = 0.728) and Gemini 2.0 Experimental Advanced (κ = 0.664). GPT-o1 achieved the highest F1 scores for the low-risk (0.93) and high-risk (1.00) groups, demonstrating moderate performance for the PI-RADS 3 group (0.75). All models showed weak performance for PI-RADS 3 (F1 range: 0.54–0.75). Most importantly, none of the models produced invalid results outside the target PI-RADS 1–5 range.

Conclusion

LLMs show potential for automating PI-RADS classification from MRI reports, with GPT-o1 demonstrating the best overall performance. However, their failure in PI-RADS 3 lesions indicates that multicentre validation, larger datasets and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making.

Trial registration

Not applicable. This retrospective study did not involve a clinical trial.

Keywords: Artificial intelligence, Large language model, Prostate imaging, Magnetic resonance imaging, Prostate cancer, Urology

Introduction

Prostate cancer is one of the leading causes of cancer-related illness and death among men worldwide. This necessitates accurate diagnostic methods for appropriate risk stratification [1, 2]. The Prostate Imaging-Reporting and Data System (PI-RADS) version 2.1 is a standardized system for reporting prostate magnetic resonance imaging (MRI) that aids in the detection and risk stratification of clinically significant prostate cancer and directly influences urological decision-making [3]. Beyond radiology, PI-RADS has become an important decision-support tool in urology, directly impacting biopsy recommendations, treatment plans and follow-up strategies. PI-RADS 3 represents the most challenging category, as its equivocal nature may lead to unnecessary biopsies or delayed diagnosis, with important clinical implications.

Recent advances in artificial intelligence (AI) and large language models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Gemini, have sparked significant optimism regarding their potential to automate and standardise radiological reporting procedures [4–7].

Several recent studies have observed the use of LLMs for *RADS classification, such as breast imaging with the BI-RADS system, lung cancer screening with Lung-RADS and liver imaging with LI-RADS [8–11].

Identifying PI-RADS categories directly from MRI reports through generative AI could potentially streamline clinical workflows, ensure reporting consistency, and reduce interobserver variability. Although Lee et al. demonstrated the feasibility of LLM-based PI-RADS classification, their study was limited by a small sample size and earlier-generation models [12]. In contrast, our work involved newer generation LLMs from a larger patient population. Furthermore, our study was conducted at a different institution, providing external validation and enhancing the generalisability of the results to various clinical settings.

This study aimed to evaluate the performance of state-of-the-art LLMs in assigning PI-RADS v2.1 categories from structured prostate MRI reports, using expert radiologist consensus as the reference standard. By examining diagnostic accuracy, error patterns, and model-driven biases, we intend to enhance our comprehension of the clinical application of these tools and their potential influence on urological management pathways.

Methods

Study design and acquisition of data

This retrospective analysis was done under the ethical standards and with the institutional review board approval, which waived informed consent because of anonymized data. Prostate MRI reports were collected in a single tertiary care center between October 2023 and October 2024. Biparametric MRI examinations and MRIs of post-treatment follow-up patients were excluded. 146 reports were included in the study (Fig. 1).

Fig. 1 — Study design. DCE: dynamic contrast enhanced images; PI-RADS: Prostate Imaging- Reporting and Data System; PSA: prostate specific antigen; PSA density: prostate specific antigen density

A radiologist with 5 years of experience in prostate MRI and a uroradiologist with 10 years of experience independently reviewed prostate MRI reports, which were structured in a standardized format and authored by different radiologists in the clinic. They assigned PI-RADS categories by consensus.

Reports written during this period alone were utilized to maintain consistency in imaging protocols and adherence to the latest PI-RADS v2.1 guidelines [3]. Clinical history and PI-RADS categories were removed from reports to present a blinded and standardized assessment process. All reports were structured similarly, containing detailed descriptions of T2-weighted imaging (T2WI), diffusion-weighted imaging (DWI) with ADC values, dynamic contrast-enhanced (DCE) apperances, prostate volume, PSA density and lesion size (Fig. 1). Structured MRI reports included numerical PSA values (ng/mL) and PSA density values (ng/mL/cm³) presented in a dedicated clinical information section. Descriptive qualifiers such as ‘high PSA’ were not used. These parameters were provided as contextual clinical data and were not formally incorporated into PI-RADS scoring by radiologists.

Large language models and assessment protocol

The LLMs evaluated in this study included GPT-4o, GPT-o1, Google Gemini 1.5 Pro, and Google Gemini 2.0 Experimental Advanced.

In December 2024, all models were given the same standardised prompt to ensure consistency in task interpretation: You are an expert diagnostic radiologist. Evaluate the following prostate MRI report in strict accordance with PI-RADS v2.1 guidelines, assigning a definitive PI-RADS category (1–5). Avoid speculation and provide a clear justification based solely on the given information. Do not assume medico-legal responsibility. All GPT-based models were accessed using their default automatic response mode without enabling extended reasoning or step-by-step explanation features. Model outputs reflected a single-pass response to the standardized prompt, without iterative refinement or user feedback.

MRI reports were sequentially input to each model in randomized order to eliminate potential bias. The order was neither categorized by PI-RADS category nor influenced by clinical variables. Findings were recorded separately. MRI reports were uploaded as plain text files (.txt) to ensure uniformity across the models. To ensure reproducibility, all prompts and interactions were logged, and results were independently verified by two researchers.

Statistical analysis

SPSS 26.0 was used to perform statistical analyses (IBM Corp., Armonk, NY, USA). Inter-rater agreement between the LLM-assigned PI-RADS categories and the gold standard was quantified using weighted Cohen’s Kappa, which accounts for the ordinal nature of the PI-RADS scoring system. The following criteria were used to interpret the strength of agreement: <0.20 (poor), 0.21–0.40 (fair), 0.41–0.60 (moderate), 0.61–0.80 (substantial), and > 0.80 (almost perfect) [13].

Accuracy and F1 scores were derived for three stratified PI-RADS risk groups: low-risk (1–2), intermediate-risk (3) and high-risk (4–5).

Results

The data included MRI reports labelled according to PI-RADS, with 22, 53, 28, 25 and 18 reports for categories 1, 2, 3, 4 and 5 respectively. Concordance between the categories assigned by the models and the gold standard varied among the LLMs tested. As shown in Table 1, GPT-o1 achieved the highest overall agreement, with a weighted Cohen’s kappa of 0.867 (p < 0.001). This was followed by GPT-4o (κ = 0.743, p < 0.001), Google Gemini 1.5 Pro (κ = 0.728, p < 0.001) and Google Gemini 2.0 Experimental Advanced (κ = 0.664, p < 0.001). GPT-o1 performed best among the other models in terms of accuracy and F1 performance by PI-RADS risk category.

Table 1.

Performance metrics of large Language models in assigning PI-RADS categories

Model	PI-RADS 1–2 Accuracy (%)	PI-RADS 3 Accuracy (%)	PI-RADS 4–5 Accuracy (%)	PI-RADS 1–2 F1 Score	PI-RADS 3 F1 Score	PI-RADS 4–5 F1 Score	Weighted Kappa
GPT-o1	93.3	75	100	0,93	0,75	1	0,867
GPT-4o	92	53.6	90.7	0,92	0,54	0,91	0,743
Google Gemini 1.5 Pro	93.3	53.6	86	0,93	0,54	0,86	0,728
Google Gemini 2.0 Exp. Advanced	81.3	57.1	88.4	0,81	0,57	0,88	0,664

Open in a new tab

For low-risk cases (PI-RADS 1–2), GPT-o1 reached 93.3% accuracy (95% CI: 81.7%–98.6%) and an F1 score of 0.93, followed by Google Gemini 1.5 Pro (accuracy: 93.3%, 95% CI: 81.7%–98.6%, F1: 0.93), GPT-4o (accuracy: 92.0%, 95% CI: 79.6%–97.6%, F1: 0.92), and Gemini 2.0 (accuracy: 81.3%, 95% CI: 67.4%–90.3%, F1: 0.81) (Fig. 2). GPT-4o misclassified 2.7% of these cases as high-risk (PI-RADS 4–5), while other models had a lower misclassification rate of 1.3%.

Fig. 2 — Comparison of accuracy (%) among large language models (GPT-4o, Google Gemini 1.5 Pro, GPT-o1, and Google Gemini 2.0 Experimental Advanced) across different PI-RADS v2.1 categories (PI-RADS 1–2: low-risk, PI-RADS 3: intermediate-risk, PI-RADS 4–5: high-risk)

Performance dropped notably across all models in intermediate-risk cases (PI-RADS 3). GPT-o1 achieved the highest accuracy (accuracy: 75.0%, 95% CI: 55.1%–89.3%) and an F1 score of 0.75, whereas Google Gemini 2.0 Experimental Advanced reached 57.1% accuracy (95% CI: 37.2%–75.5%, F1: 0.57), and both GPT-4o and Gemini 1.5 Pro recorded 53.6% accuracy (95% CI: 33.9%–72.5%) and an F1 score of 0.54.

In the high-risk group (PI-RADS 4–5), GPT-o1 performed best once more, achieving 100% accuracy (95% CI: 86.3%–100%) and an F1 score of 1.00. Next was GPT-4o with 90.7% accuracy (95% CI: 73.8%–97.5%, F1 score: 0.91), followed by Gemini 2.0 with 88.4% accuracy (95% CI: 70.0%–96.4%, F1 score: 0.88), and then Gemini 1.5 Pro with 86.0% accuracy (95% CI: 67.0%–95.5%, F1 score: 0.86). Only Gemini 1.5 Pro misclassified 4.7% of high-risk reports as low-risk (PI-RADS 1–2), an error with potential clinical significance.

In addition to accuracy, sensitivity and specificity metrics with 95% CI were calculated for each model across the three PI-RADS risk categories (1–2, 3, and 4–5). GPT-o1 showed the most balanced overall performance:

Low-risk (PI-RADS 1–2): Sensitivity 93.3% (95% CI: 81.7–98.6%), Specificity 100% (95% CI: 94.6–100%).
Intermediate-risk (PI-RADS 3): Sensitivity 75.0% (95% CI: 55.1–89.3%), Specificity 84.0% (95% CI: 73.3–91.8%).
High-risk (PI-RADS 4–5): Sensitivity 100% (95% CI: 86.3–100%), Specificity 84.0% (95% CI: 68.9–93.0%).

GPT-4o followed with:

Low-risk: Sensitivity 92.0% (95% CI: 79.6–98.4%), Specificity 95.8% (95% CI: 89.6–98.9%).
Intermediate-risk: Sensitivity 53.6% (95% CI: 33.9–72.5%), Specificity 65.2% (95% CI: 52.4–76.5%).
High-risk: Sensitivity 90.5% (95% CI: 77.4–97.3%), Specificity 76.0% (95% CI: 60.6–87.9%).

Google Gemini 1.5 Pro demonstrated:

Low-risk: Sensitivity 93.3% (95% CI: 81.7–98.6%), Specificity 93.8% (95% CI: 86.0–98.0%).
Intermediate-risk: Sensitivity 53.6% (95% CI: 33.9–72.5%), Specificity 65.2% (95% CI: 52.4–76.5%).
High-risk: Sensitivity 86.0% (95% CI: 71.3–94.2%), Specificity 72.0% (95% CI: 56.3–84.7%).

Google Gemini 2.0 Experimental Advanced showed the lowest overall performance:

Low-risk: Sensitivity 81.3% (95% CI: 67.4–91.1%), Specificity 85.4% (95% CI: 75.7–92.3%).
Intermediate-risk: Sensitivity 57.1% (95% CI: 36.5–75.5%), Specificity 76.8% (95% CI: 65.4–85.8%).
High-risk: Sensitivity 88.4% (95% CI: 74.9–96.1%), Specificity 68.0% (95% CI: 52.4–81.4%).

A notable observation was the tendency of Google Gemini 2.0 Experimental Advanced to upgrade PI-RADS 1–2 cases to PI-RADS 3, particularly when prostate-specific antigen (PSA) levels were elevated and moderate diffusion restriction was present. Nevertheless, none of the models produced invalid outputs outside the defined PI-RADS 1–5 range.

Discussion

The findings of this study highlight the variable diagnostic accuracy of LLMs in assigning PI-RADS categories from text-based prostate MRI reports. Among all of the models evaluated, GPT-o1 had the highest diagnostic agreement with radiologists, with a weighted Cohen’s Kappa of 0.867 and higher F1 scores in all PI-RADS categories: 0.93 (PI-RADS 1–2), 0.75 (PI-RADS 3), and 1.00 (PI-RADS 4–5). These results suggest that GPT-o1 not only achieved high concordance with radiologists overall but also maintained consistent precision and recall across strata of risk. By contrast, GPT-4o and Google Gemini 1.5 Pro, while very good in low-risk categories (F1 = 0.92 and 0.93 respectively), exhibited limited effectiveness in intermediate-risk lesions (both F1 = 0.54). These misclassifications can have significant clinical repercussions. For instance, upgrading from PI-RADS 1–2 to 4–5 due to false-positive results can lead to unnecessary biopsies, increased healthcare expenditure, and anxiety in patients [14, 15]. An accurate PI-RADS classification is crucial for making the right decision about whether to perform a biopsy. Misclassifying low-risk lesions as high-risk can lead to unnecessary procedures, while misclassifying high-risk lesions as low-risk can result in a delayed diagnosis of clinically relevant prostate cancer. GPT-4o and Gemini 1.5 Pro have false-positive rates of 2.7% and 4.7%, respectively. These over classifications are probably due to the models’ inability to distinguish subtle imaging characteristics, such as benign focal diffusion restriction and mild T2 hypointense, which are common in benign prostatic hyperplasia.

Notably, Google Gemini 2.0 Experimental Advanced had the weakest overall agreement (Kappa = 0.664) and the lowest F1 performance in low-risk lesions (F1 = 0.81). The model showed a clear tendency to upgrade PI-RADS 1–2 cases to PI-RADS 3, particularly in cases of high PSA density and moderate diffusion restriction. Although PSA is not an official component of the PI-RADS v2.1 scoring system, its incorporation into structured reports is recommended for providing clinical context [3]. Although PSA is a key marker for prostate cancer screening, its use in PI-RADS classification is controversial due to the risk of overdiagnosis and overtreatment of indolent lesions [16–19]. Conversely, Gemini 1.5 Pro occasionally downgraded some high-risk (PI-RADS 4–5) lesions to low-risk lesions. These biases most likely reflect elements of the models’ training data or prompt interpretation.

Overall, underperformance was observed for PI-RADS 3 lesions, with F1 scores ranging from 0.54 to 0.75 across all models. Underperformance in this category is to be expected due to its inherent ambiguity, which is always associated with the lowest level of agreement among radiologists [20]. Typical misclassified PI-RADS 3 reports included statements such as mild diffusion restriction without corresponding low ADC values, or focal T2 hypointensity with equivocal enhancement on DCE. In these cases, models frequently upgraded lesions to PI-RADS 4, particularly when PSA density values were elevated. Qualitative analysis of misclassified instances revealed recurring failure modes. The models tended to overestimate PSA values or misinterpret descriptive phrases such as ‘equivocal findings’ or ‘borderline restriction’. In some cases, there was a tendency to upgrade PI-RADS 3 reports to higher categories if PSA levels were elevated. Conversely, equivocal diffusion findings were sometimes downgraded, which could lead to a delayed diagnosis. These results suggest that LLMs rely too heavily on contextual clinical information and linguistic cues, rather than strictly adhering to imaging-based language. This issue could be addressed through advanced prompt engineering, providing explicit instructions to ignore PSA when assigning categories, or fine-tuning using domain-specific corpora [12, 21].

Despite these limitations, LLMs are set to be integrated into radiology workflows. Applications include summarizing key findings [22, 23],, answering image-related questions [24], converting free text into structured reports [25, 26] and offering differential diagnoses [27, 28]. However, LLMs are known to generate false or misleading information through internal reasoning, a phenomenon commonly referred to as ‘hallucinations’ [4, 29–31]. The Lee et al. experiment showed that Bard produced irrelevant categories, such as PI-RADS 6 [12]. However, none of the models we used produced outputs outside the valid range of PI-RADS 1–5. This demonstrates continued consistency with the diagnostic system, which is clinically relevant.

This study has some limitations. Firstly, the single-centre retrospective design may restrict generalisability. Variation in terminology, reporting style and imaging protocols across institutions may affect LLM performance. Therefore, multicentre studies are required in future to determine robustness in different clinical settings and reduce institutional bias. Secondly, the small sample size (n = 146) and the unbalanced distribution of categories, particularly for PI-RADS 5, may limit statistical power. Future analyses would be improved by more datasets and stratified sampling. Thirdly, the study focused only on text reports of MRI scans, not the imaging data itself. Multimodal AI approaches that integrate imaging features with text-based data are likely to improve performance, particularly for indeterminate PI-RADS 3 cases.

In conclusion, while LLMs are not yet ready to complement radiologists, they could serve as helpful decision aids in prostate MRI interpretation. Addressing their current limitations through multicentre validation, larger and more representative datasets, and multimodality integration is the key to ensuring their safe, accurate, and clinically effective application in prostate cancer diagnosis.

Acknowledgments

Informed consent

Waiver of informed consent was granted by institution's Review Board as the study met the criteria for such a waiver according to Declaration of Helsinki

Abbreviations

AI: Artificial Intelligence
ADC: Apparent Diffusion Coefficient
BI: RADS–Breast Imaging Reporting and Data System
DCE: Dynamic Contrast–Enhanced
DWI: Diffusion–Weighted Imaging
F1: F1 Score
GPT: Generative Pre–trained Transformer
Kappa (κ): Cohen’s Kappa Coefficient
LI: RADS–Liver Imaging Reporting and Data System
LLM: Large Language Model
Lung: RADS–Lung Imaging Reporting and Data System
MRI: Magnetic Resonance Imaging
PI: RADS–Prostate Imaging Reporting and Data System
PSA: Prostate–Specific Antigen
PSAD: Prostate–Specific Antigen Density

Authors’ contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Betül Akdal Dölek, and Muhammed Said Besler. The first draft of the manuscript was written by Betül Akdal Dölek and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

The authors have no financial or proprietary interests in any material discussed in this article.

Data availability

The data that support the findings of this study are available on request from the corresponding author.

Declarations

Ethics approval

This study protocol was reviewed and approved by the Ankara Bilkent City Hospital Scientific Research Evaluation Board (Approval No: TABED 1-25-1085).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73:17–48. [DOI] [PubMed] [Google Scholar]
2.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71:209–49. [DOI] [PubMed] [Google Scholar]
3.Turkbey B, Rosenkrantz AB, Haider MA, Padhani AR, Villeirs G, Macura KJ, Tempany CM, Choyke PL, Cornud F, Margolis DJ, Thoeny HC, Verma S, Barentsz J, Weinreb JC. Prostate imaging reporting and data system version 2.1: 2019 update of prostate imaging reporting and data system version 2. Eur Urol. 2019;76:340–51. [DOI] [PubMed] [Google Scholar]
4.Akinci D, Stanzione A, Bluethgen C, Vernuccio F, Ugga L, Klontzas ME, et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol. 2024;30:80–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kim S, Lee CK, Kim SS. Large language models: a guide for radiologists. Korean J Radiol. 2024;25:126–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Busch F, Hoffmann L, Dos Santos DP, Makowski MR, Saba L, Prucker P, Hadamitzky M, Navab N, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Large Language models for structured reporting in radiology: past, present, and future. Eur Radiol. 2025;35(5):2589–2602. [DOI] [PMC free article] [PubMed]
7.Nakaura T, Ito R, Ueda D, Nozaki T, Fushimi Y, Matsui Y, et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn J Radiol. 2024;42:685–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, et al. BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: a multilanguage study. Radiology. 2024;311:e232133. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Miaojiao S, Xia L, Xian Tao Z, Zhi Liang H, Sheng C, Songsong W. Using a large language model for breast imaging reporting and data system classification and malignancy prediction to enhance breast ultrasound diagnosis: retrospective study. JMIR Med Inform. 2025;13(11):e70924. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Singh R, Hamouda M, Chamberlin JH, Tóth A, Munford J, Silbergleit M, et al. ChatGPT vs. Gemini: comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports. Clin Imaging. 2025;121:110455. [DOI] [PubMed] [Google Scholar]
11.Matute-González M, Darnell A, Comas-Cufí M, Pazó J, Soler A, Saborido B, Mauro E, Turnes J, Forner A, Reig M, Rimola J. Utilizing a domain-specific large Language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;22(1):280. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett T. Assessing the performance of ChatGPT and Bard/Gemini against radiologists for PI-RADS classification based on prostate multiparametric MRI text reports. Br J Radiol. 2025;98:368–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22:276–82. [PMC free article] [PubMed] [Google Scholar]
14.Loeb S, Bjurlin MA, Nicholson J, Tammela TL, Penson DF, Carter HB, Carroll P, Etzioni R. Overdiagnosis and overtreatment of prostate cancer. Eur Urol. 2014;65:1046–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Haj-Mirzaian A, Burk KS, Lacson R, Glazer DI, Saini S, Kibel AS, et al. Magnetic resonance imaging, clinical, and biopsy findings in suspected prostate cancer: a systematic review and meta-analysis. JAMA Netw Open. 2024;7(3):e244258. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Schröder FH, Hugosson J, Roobol MJ, Tammela TL, Ciatto S, Nelen V, Kwiatkowski M, Lujan M, Lilja H, Zappa M, Denis LJ, Recker F, Berenguer A, Määttänen L, Bangma CH, Aus G, Villers A, Rebillard X, van der Kwast T, Blijenberg BG, Moss SM, de Koning HJ, Auvinen A, ERSPC Investigators. (2009) Screening and prostate-cancer mortality in a randomized European study. N Engl J Med. 2009;360:1320–1328. [DOI] [PubMed]
17.Sidaway P. (2024) MRI-based stratification reduces the risk of overdiagnosis of prostate cancer. Nat Rev Clin Oncol. 2024;21:838. [DOI] [PubMed]
18.Wang S, Kozarek J, Russell R, Drescher M, Khan A, Kundra V, et al. Diagnostic performance of prostate-specific antigen density for detecting clinically significant prostate cancer in the era of magnetic resonance imaging: a systematic review and meta-analysis. Eur Urol Oncol. 2024;7(2):189–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Pellegrino F, Stabile A, Sorce G, Quarta L, Robesti D, Cannoletta D, et al. Added value of prostate-specific antigen density in selecting prostate biopsy candidates among men with elevated prostate-specific antigen and PI-RADS ≥ 3 lesions on multiparametric magnetic resonance imaging of the prostate: a systematic assessment by PI-RADS score. Eur Urol Focus. 2024;10(4):634–40. [DOI] [PubMed] [Google Scholar]
20.Wen J, Ji Y, Han J, Shen X, Qiu Y. Inter-reader agreement of the prostate imaging reporting and data system version v2.1 for detection of prostate cancer: a systematic review and meta-analysis. Front Oncol. 2022;12:1013941. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4:65. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology. 2023;307:e231259. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gu K, Lee JH, Shin J, Hwang JA, Min JH, Jeong WK, Lee MW, Song KD, Bae SH. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578–87. [DOI] [PubMed] [Google Scholar]
24.Gordon EB, Towbin AJ, Wingrove P, Shafique U, Haas B, Kitts AB, Feldman J, Furlan A. Enhancing patient communication with Chat-GPT in radiology. J Am Coll Radiol. 2024;21:353–9. [DOI] [PubMed] [Google Scholar]
25.Adams LC, Truhn D, Busch F, Avan K, Stefan MN, Marcus RM, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting. Radiology. 2023;307:e230725. [DOI] [PubMed] [Google Scholar]
26.Jiang H, Xia S, Yang Y, Xu J, Hua Q, Mei Z, et al. Transforming free-text radiology reports into structured reports using ChatGPT: a study on thyroid ultrasonography. Eur J Radiol. 2024;175:111458. [DOI] [PubMed] [Google Scholar]
27.Kottlors J, Bratke G, Rauen P, Kabbasch C, Persigehl T, Schlamann M, et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology. 2023;308:e231167. [DOI] [PubMed] [Google Scholar]
28.Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology. 2018;286:800–9. [DOI] [PubMed] [Google Scholar]
29.Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology. 2024;310:e232756. [DOI] [PubMed] [Google Scholar]
30.Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Roustan D, Bastardot F. The clinicians’ guide to large language models: a general perspective with a focus on hallucinations. Interact J Med Res. 2025;14:e59823. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author.

[CR1] 1.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73:17–48. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71:209–49. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Turkbey B, Rosenkrantz AB, Haider MA, Padhani AR, Villeirs G, Macura KJ, Tempany CM, Choyke PL, Cornud F, Margolis DJ, Thoeny HC, Verma S, Barentsz J, Weinreb JC. Prostate imaging reporting and data system version 2.1: 2019 update of prostate imaging reporting and data system version 2. Eur Urol. 2019;76:340–51. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Akinci D, Stanzione A, Bluethgen C, Vernuccio F, Ugga L, Klontzas ME, et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol. 2024;30:80–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Kim S, Lee CK, Kim SS. Large language models: a guide for radiologists. Korean J Radiol. 2024;25:126–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Busch F, Hoffmann L, Dos Santos DP, Makowski MR, Saba L, Prucker P, Hadamitzky M, Navab N, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Large Language models for structured reporting in radiology: past, present, and future. Eur Radiol. 2025;35(5):2589–2602. [DOI] [PMC free article] [PubMed]

[CR7] 7.Nakaura T, Ito R, Ueda D, Nozaki T, Fushimi Y, Matsui Y, et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn J Radiol. 2024;42:685–96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, et al. BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: a multilanguage study. Radiology. 2024;311:e232133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Miaojiao S, Xia L, Xian Tao Z, Zhi Liang H, Sheng C, Songsong W. Using a large language model for breast imaging reporting and data system classification and malignancy prediction to enhance breast ultrasound diagnosis: retrospective study. JMIR Med Inform. 2025;13(11):e70924. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Singh R, Hamouda M, Chamberlin JH, Tóth A, Munford J, Silbergleit M, et al. ChatGPT vs. Gemini: comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports. Clin Imaging. 2025;121:110455. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Matute-González M, Darnell A, Comas-Cufí M, Pazó J, Soler A, Saborido B, Mauro E, Turnes J, Forner A, Reig M, Rimola J. Utilizing a domain-specific large Language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging. 2024;22(1):280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett T. Assessing the performance of ChatGPT and Bard/Gemini against radiologists for PI-RADS classification based on prostate multiparametric MRI text reports. Br J Radiol. 2025;98:368–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22:276–82. [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Loeb S, Bjurlin MA, Nicholson J, Tammela TL, Penson DF, Carter HB, Carroll P, Etzioni R. Overdiagnosis and overtreatment of prostate cancer. Eur Urol. 2014;65:1046–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Haj-Mirzaian A, Burk KS, Lacson R, Glazer DI, Saini S, Kibel AS, et al. Magnetic resonance imaging, clinical, and biopsy findings in suspected prostate cancer: a systematic review and meta-analysis. JAMA Netw Open. 2024;7(3):e244258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Schröder FH, Hugosson J, Roobol MJ, Tammela TL, Ciatto S, Nelen V, Kwiatkowski M, Lujan M, Lilja H, Zappa M, Denis LJ, Recker F, Berenguer A, Määttänen L, Bangma CH, Aus G, Villers A, Rebillard X, van der Kwast T, Blijenberg BG, Moss SM, de Koning HJ, Auvinen A, ERSPC Investigators. (2009) Screening and prostate-cancer mortality in a randomized European study. N Engl J Med. 2009;360:1320–1328. [DOI] [PubMed]

[CR17] 17.Sidaway P. (2024) MRI-based stratification reduces the risk of overdiagnosis of prostate cancer. Nat Rev Clin Oncol. 2024;21:838. [DOI] [PubMed]

[CR18] 18.Wang S, Kozarek J, Russell R, Drescher M, Khan A, Kundra V, et al. Diagnostic performance of prostate-specific antigen density for detecting clinically significant prostate cancer in the era of magnetic resonance imaging: a systematic review and meta-analysis. Eur Urol Oncol. 2024;7(2):189–203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Pellegrino F, Stabile A, Sorce G, Quarta L, Robesti D, Cannoletta D, et al. Added value of prostate-specific antigen density in selecting prostate biopsy candidates among men with elevated prostate-specific antigen and PI-RADS ≥ 3 lesions on multiparametric magnetic resonance imaging of the prostate: a systematic assessment by PI-RADS score. Eur Urol Focus. 2024;10(4):634–40. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Wen J, Ji Y, Han J, Shen X, Qiu Y. Inter-reader agreement of the prostate imaging reporting and data system version v2.1 for detection of prostate cancer: a systematic review and meta-analysis. Front Oncol. 2022;12:1013941. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4:65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology. 2023;307:e231259. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Gu K, Lee JH, Shin J, Hwang JA, Min JH, Jeong WK, Lee MW, Song KD, Bae SH. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 2024;44:1578–87. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Gordon EB, Towbin AJ, Wingrove P, Shafique U, Haas B, Kitts AB, Feldman J, Furlan A. Enhancing patient communication with Chat-GPT in radiology. J Am Coll Radiol. 2024;21:353–9. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Adams LC, Truhn D, Busch F, Avan K, Stefan MN, Marcus RM, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting. Radiology. 2023;307:e230725. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Jiang H, Xia S, Yang Y, Xu J, Hua Q, Mei Z, et al. Transforming free-text radiology reports into structured reports using ChatGPT: a study on thyroid ultrasonography. Eur J Radiol. 2024;175:111458. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Kottlors J, Bratke G, Rauen P, Kabbasch C, Persigehl T, Schlamann M, et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology. 2023;308:e231167. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology. 2018;286:800–9. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology. 2024;310:e232756. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Roustan D, Bastardot F. The clinicians’ guide to large language models: a general perspective with a focus on hallucinations. Interact J Med Res. 2025;14:e59823. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

Betul Akdal Dolek

Muhammed Said Besler

Abstract

Background

Methods

Results

Conclusion

Trial registration

Introduction

Methods

Study design and acquisition of data

Fig. 1.

Large language models and assessment protocol

Statistical analysis

Results

Table 1.

Fig. 2.

Discussion

Acknowledgments

Informed consent

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Evaluation of large language models in assigning PI-RADS v2.1 categories for prostate MRI reports

Betul Akdal Dolek

Muhammed Said Besler

Abstract

Background

Methods

Results

Conclusion

Trial registration

Introduction

Methods

Study design and acquisition of data

Fig. 1.

Large language models and assessment protocol

Statistical analysis

Results

Table 1.

Fig. 2.

Discussion

Acknowledgments

Informed consent

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases