Abstract
Background
The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks.
Purpose
To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management.
Materials and Methods
This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1–5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test.
Results
Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001).
Conclusion
LLMs achieved moderate agreement with human reader–assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.
© RSNA, 2024
Summary
Based on breast imaging reports written in three languages, GPT-3.5, GPT-4, and Bard achieved moderate agreement with human reader–assigned Breast Imaging Reporting and Data System categories but also yielded a high percentage of discordant category assignments that would negatively impact clinical management.
Key Results
■ In this exploratory retrospective multicenter study of 2400 breast imaging reports written in English, Italian, or Dutch, agreement for Breast Imaging Reporting and Data System (BI-RADS) category assignments between human readers was almost perfect (Gwet AC1 = 0.91), while agreement between human readers and GPT-3.5, GPT-4, and Bard was moderate (AC1 = 0.48, 0.52, and 0.42, respectively).
■ Differences in the frequency of BI-RADS category upgrades or downgrades that would result in a negative change in clinical management were observed across human readers and large language models (human readers, 1.5%; GPT-4, 10.6%; GPT-3.5, 14.3%; Bard, 18.1%; P < .001).
Introduction
Since the public release of large language models (LLMs), their potential applications in health care have been the focus of intense scrutiny (1). In addition to the heated debates on the use of LLMs in scientific writing (2) and medical education (3), their performance in clinical tasks has become a common area of investigation (4,5).
In radiology, LLMs have already been tested in a large spectrum of clinical tasks, from processing radiologic request forms (6) to providing imaging recommendations and differential diagnoses (7–9). In some of these tasks, such as the transformation of free-text reports to structured reports (10,11), publicly available generic LLMs (eg, GPT-3.5 and GPT-4 by OpenAI, and Bard [now Gemini] by Google) have shown promising results (10). Conversely, a less encouraging scenario has been observed for more complex tasks requiring a higher level of analogical reasoning and deeper clinical knowledge, such as providing imaging recommendations, and wherever LLMs have been asked to operate in languages other than English (9,12–14). Although LLMs and natural language processing tools trained for specific clinical tasks perform well (12,15), evaluating the abilities of generic LLMs remains important, as these tools are the most readily available (16) and may be used by both patients and nonradiologist physicians seeking a second opinion.
Due to concerns about low or moderate interreader agreement for Breast Imaging Reporting and Data System (BI-RADS) category assignment (17–21), the assignment of BI-RADS categories has been one of the tasks where the use of natural language processing tools has been most intensely evaluated (22–24). Studies have shown that, after multiple phases of training and validation on large, specifically assembled unannotated and annotated data sets of breast imaging reports, natural language processing tools can accurately extract BI-RADS features, infer BI-RADS assignments, and ultimately predict pathologic outcomes (ie, a subsequent cancer diagnosis at biopsy) (22–24). However, studies investigating the ability of generically trained LLMs to assign BI-RADS categories based on radiologic reports in different languages and their agreement with human readers were lacking. Thus, the purpose of this study was to evaluate the agreement between human readers and LLMs for BI-RADS categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management.
Materials and Methods
This exploratory retrospective multicenter study was approved by the ethics committee of the canton of Ticino (protocol 2023–01032), where the coordinating center (center 1; Ente Ospedaliero Cantonale, Lugano, Switzerland) is located, and the need for specific informed consent was waived. Reports from center 2 (Memorial Sloan Kettering Cancer Center, New York, NY, United States) were collected after approval by the local institutional review board (protocol IRB19–093). Reports from center 3 (Netherlands Cancer Institute, Amsterdam, the Netherlands) were collected in the framework of a previously published study (23) approved by the local institutional review board (protocol IRBd21–058).
Study Design and Sample
This study included radiologic reports for women who underwent MRI, mammography, and/or US in a breast cancer screening or diagnostic setting at three tertiary referral hospitals. All breast imaging reports were produced by board-certified breast-dedicated radiologists. Reports were written in English (the dominant language of the training data of most publicly available LLMs), Italian (51 million to 100 million speakers worldwide), or Dutch (10 million to 50 million speakers worldwide).
To avoid the imbalance toward BI-RADS 1 and 2 categories routinely observed for first-line breast imaging modalities (mammography and US), each center created four sets of reports according to imaging modality: MRI, mammography alone, US alone, and mammography plus US. Each set included an equal number of reports for each of the BI-RADS categories from 1 to 5, as categorized by the original reporting radiologists. For all centers, reports categorized as BI-RADS 0 or BI-RADS 6 were excluded, as these assignments are influenced by factors that may not be included in the “findings” section of the radiologic reports assessed in the current study (eg, for BI-RADS 6, an already-performed breast biopsy).
For center 1 and center 2, reports were consecutively retrieved from May 2020 and January 2021, respectively, until October 2023. The following inclusion criteria were applied: (a) patient aged at least 18 years and (b) report containing a complete description of imaging findings according to BI-RADS descriptors, impression, and BI-RADS assignment by the reporting radiologists. Reports in languages other than Italian (center 1) and English (center 2) were excluded.
The set of reports for center 3 was randomly sampled from reports (all in Dutch) included in the independent test set of the study by Zhang et al (23), collected between January 2000 and December 2020.
Report Processing and Evaluation
Included reports were exported into Excel spreadsheets (Microsoft). Seven readers (center 1: A.C., A.H., L.B., and M.C.; center 2: K.P. and R.L.G.; center 3: T.Z.) reviewed reports from their center to remove patient-identifying information and to separate the text of the four sections that breast imaging reports are routinely organized into at center 1 and 2: (a) clinical statement and examination technique, (b) findings, (c) impression, and (d) BI-RADS category. Except for the correction of typographical errors, the original text was left unaltered.
At each center, the “findings” section of each report was then independently reviewed and assigned a BI-RADS category by a board-certified breast-dedicated radiologist (S.S. at center 1, K.P. at center 2, and R.M.M. at center 3, with 6, 14, and 11 years of experience, respectively) blinded to the original BI-RADS assignment, all other sections of the report, and any patient-related data.
BI-RADS Category Assignment by LLMs
Four authors (A.C., A.H., L.B., S.S.) entered the text extracted from the findings section of each report, along with a prompt requesting a BI-RADS category assignment, into GPT-3.5 (25) and GPT-4 (26) (the LLMs available via ChatGPT; OpenAI) and Bard (since renamed Gemini; Google) (27). Because Bard would not assign a BI-RADS classification if the prompt was in a language other than English, standardized prompts in English (Table S1) were used for all LLMs. LLMs did not have access to other sections of the radiologic report or to the human-assigned BI-RADS categories. Chat sessions were restarted after each report was entered and a response received. This process was performed between October 4 and October 30, 2023, for GPT-3.5 and GPT-4 and between October 4 and October 15, 2023, for Bard. To assess repeatability, 20 reports from each imaging modality category (four for each BI-RADS category) in each language were randomly selected and reentered into the LLMs after 7 days.
Statistical Analysis
Because the BI-RADS multicategory structure amplifies the paradoxes of marginal probabilities affecting the Cohen κ coefficient, agreement between radiologists and LLMs in the assignment of BI-RADS categories was assessed by calculating Gwet agreement coefficients (AC1 values) and their 95% CIs (28,29), interpreted according to the Landis and Koch scale (30): 0–0.20, slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; 0.81–1.00, almost perfect agreement.
Agreement in BI-RADS category assignment between the original human reader and the reader who reviewed the radiologic report as well as between the original human reader and each of the three LLMs was assessed for all reports, reports stratified according to language, and reports stratified by imaging modality within languages. AC1 values were compared as described by Gwet (31).
Finally, assigned BI-RADS categories were grouped according to different clinical management pathways: BI-RADS 1 or 2 (normal or benign; no action), BI-RADS 3 (probably benign; short-term follow-up); and BI-RADS 4 or 5 (suspicious and highly suggestive of malignancy; referral to biopsy). Overall and language-specific AC1 values for human-human and human-LLM agreement were calculated, and the number of discordant assessments was compared with the McNemar test. To ascertain whether downgrades or upgrades by the second human reader or by an LLM would result in a positive or detrimental change in clinical management, we considered the following. For reports originally categorized as BI-RADS 4 or 5, biopsy results were retrieved from institutional databases, and downgrading was considered detrimental if cancer was found at biopsy and positive if cancer was not found at biopsy. For upgraded BI-RADS 1, 2, and 3 assignments, a 2% cancer yield was conservatively assumed (17), as follow-up data were not available for this study.
Analyses were performed by one author (A.C.) using STATA (version MP 18.1; StataCorp), and the kappaetc module (29) was used for agreement evaluations. As this exploratory study did not aim to confirm any predefined hypothesis, no correction for multiple testing was implemented, and P < .05 was considered to indicate a statistically significant difference.
Results
Study Sample
As shown in the study flowchart (Fig 1), 800 reports in each of the three languages were included (40 in each of the five BI-RADS categories assigned using the four imaging modalities), for a total of 2400 reports. Among the patients with reports originally categorized as BI-RADS 4 (n = 480) or BI-RADS 5 (n = 480), cancer was found at biopsy in 254 (52.9%) and 468 (97.5%) patients, respectively.
Figure 1:
Study flowchart. Reports from center 3 obtained from the study by Zhang et al (23). BI-RADS = Breast Imaging Reporting and Data System, CE-MRI = contrast-enhanced MRI, MG = mammography.
Human-Human Agreement
The agreement in BI-RADS category assignment between the original reader and the reviewing reader was almost perfect for all reports (0.91; Table 1), reports stratified by imaging modality (AC1 range, 0.90–0.92; Table 2), reports stratified by language (English, 0.94; Italian, 0.88; Dutch, 0.89; Table 3), and reports categorized by clinical management category (BI-RADS 0, BI-RADS 1 or 2, BI-RADS 3, and BI-RADS 4 or 5) (0.94; Table 4).
Table 1:
Agreement for BI-RADS Category 1–5 Assignments between Human Readers and between Human Readers and LLMs (n = 2400 reports)

Table 2:
Agreement in BI-RADS Category 1–5 Assignments between Human Readers and between Human Readers and Large Language Models for Different Imaging Modalities
Table 3:
Language-specific Agreement for BI-RADS Category 1–5 Assignments between Human Readers and between Human Readers and Large Language Models
Table 4:
Agreement and Changes in Clinical Management Categories between Human Readers and between Human Readers and Large Language Models
Overall Human-LLM and LLM-LLM Agreement
Among the 2400 included reports, all LLMs showed moderate agreement with the original human readers (Table 1), though within that level, agreement between the human readers and the three LLMs differed, with human–GPT-4 agreement observed to be higher than human–GPT-3.5 and human-Bard agreement (AC1 = 0.52 vs 0.48 vs 0.42; P < .001).
GPT-3.5–GPT-4 agreement was substantial and was higher than the moderate agreement between GPT-4 and Bard and between GPT-3.5 and Bard (AC1 = 0.65 vs 0.57 vs 0.54; P < .001). No evidence of a difference in human-LLM agreement was observed across the four imaging modalities (P value range, .09–.20; Table 2), nor was there a difference in human-LLM agreement between the different human readers (P value range, .16–.33; Table S2).
Human-LLM and LLM-LLM Agreement across Languages
All human-LLM and LLM-LLM agreement values were observed to be higher (Table 3) among the 800 reports written in English (moderate agreement for all human-LLM agreement values) compared with the 800 reports written in Italian (moderate human–GPT-3.5 and human–GPT-4 agreement and fair human-Bard agreement) and the 800 reports written in Dutch (moderate human–GPT-4 agreement and fair human–GPT-3.5 and human-Bard agreement).
GPT-3.5–GPT-4 agreement was moderate to substantial across all languages (AC1 = 0.69, 0.60, and 0.61 for English, Italian, and Dutch, respectively; P = .02). GPT-3.5–Bard agreement was moderate across all languages (AC1 = 0.60, 0.53, and 0.52 for English, Italian, and Dutch, respectively; P = .004), and GPT-4–Bard agreement was substantial for English and moderate for Italian and Dutch (AC1 = 0.62, 0.56, and 0.51; P < .001).
For reports stratified by language and imaging modality, no evidence of a difference in human-human and human–LLM agreement in the assignment of BI-RADS categories was found (P value range, .10–.68; Table S3).
Human-Human and Human-LLM Agreement according to Clinical Management
How BI-RADS assignments by the reviewing readers or the LLMs would potentially change clinical management across all reports is shown in Figure 2 and Table 4; for reports stratified by language, see Tables S4–S6.
Figure 2:
Sankey plots showing changes in Breast Imaging Reporting and Data System (BI-RADS) clinical management categories between human readers and between human readers and large language models (LLMs). Human-human agreement was assessed between the original radiologists who wrote the breast imaging reports (Human 1) and the radiologists who reviewed the findings section of the reports (Human 2). Human-LLM agreement was assessed between the original breast imaging reports and the outputs from three LLMs (Google Bard [27], GPT-3.5 [25], and GPT-4 [26]) provided with the findings section of the report. The proportion of disagreements between the original reporting radiologists and the radiologists who reviewed the findings section of the reports was observed to be lower (P < .001) than the proportions of disagreements between the original reporting radiologists and the LLMs.
When BI-RADS categories were grouped based on clinical management categories, an overall difference was found across human-human and human-LLM agreements, with the AC1 value for human-human agreement (0.94) observed to be higher (P < .001) than those for agreement between human readers and GPT-4 (0.80), GPT-3.5 (0.73), and Bard (0.71). Further, human-human disagreements on BI-RADS categories would have led to a clinically meaningful change in clinical management for 118 of 2400 reports (4.9%), fewer reports than for disagreements between the original reader and GPT-4, GPT-3.5, and Bard (435 of 2400 [18.1%], 573 of 2400 [23.9%], and 611 of 2400 [25.5%], respectively; P < .001).
Among reports originally classified incorrectly as BI-RADS 4 or 5 by human readers (ie, no cancer was observed at biopsy), no evidence of a difference in the proportion of reports correctly downgraded to BI-RADS 1, 2, or 3 was observed across second human readers and LLMs (52.6% [10 of 19] for human readers, 53.5% [46 of 86] for GPT-4, 52.6% [71 of 135] for GPT-3.5, and 49.3% [73 of 148] for Bard; P = .92).
Conversely, the proportion of reports upgraded to a BI-RADS category resulting in more aggressive management (BI-RADS 1 or 2 reports upgraded to a category requiring 6-month follow-up or BI-RADS 1, 2, or 3 reports upgraded to a category requiring biopsy) was lower for human review than for GPT-4, GPT-3.5, and Bard (1.2% [29 of 2400], 9.1% [219 of 2400], 11.9% [286 of 2400], and 15.3% [367 of 2400], respectively; P < .001). Among these reports, most were upgraded from follow-up to biopsy (24 of 29 [82.8%] for human review, 164 of 219 [74.9%] for GPT-4, 221 of 286 [77.3%] for GPT-3.5, and 257 of 367 [70.0%] for Bard).
The proportion of reports assigned a different BI-RADS category that would result in a detrimental change in clinical management appeared to be lower for human review (1.5% [37 of 2400]) than for GPT-4, GPT-3.5, and Bard (10.6% [255 of 2400], 14.3% [344 of 2400], and 18.1% [435 of 2400], respectively; P < .001). For all LLMs, the proportion of reports assigned a different BI-RADS category that would result in a positive change of clinical management (GPT-4, 50 of 2400 [2.1%]; GPT-3.5, 77 of 2400 [3.1%]; Bard, 80 of 2400 [3.3%]) appeared lower (P < .001) than the proportion of reports assigned a different BI-RADS category that would result in a detrimental change in clinical management.
Repeatability of LLM Assignments
Within each language, there was no evidence of a difference in the level of agreement between initial BI-RADS assignments and BI-RADS assignments 7 days later across the three LLMs. Levels of agreement (Table S7) were almost perfect for reports in English (0.82 for GPT-3.5, 0.83 for Bard, and 0.88 for GPT-4; P = .57) and substantial to almost perfect for reports in Italian (0.78 for GPT-3.5, 0.81 for Bard, and 0.83 for GPT-4; P = .76) and Dutch (0.80 for GPT-3.5, 0.80 for Bard, and 0.84 for GPT-4; P = .84).
Discussion
While large language models (LLMs) have shown promising results for simple tasks (eg, the processing of radiologic request forms), their performance in more complex tasks remains unclear. The aim of this exploratory retrospective study was to evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. For 2400 breast imaging reports, we found almost perfect agreement between human readers (Gwet AC1 = 0.91) and moderate agreement between human readers and LLMs (Gwet AC1 = 0.52 for GPT-4, 0.48 for GPT-3.5, and 0.42 for Bard; P < .001). The frequency of disagreements in BI-RADS category assignments that would result in a negative change in clinical management also differed, with such disagreements appearing to be more frequent between human readers and LLMs than between human readers (14.3% for human–GPT-3.5, 10.6% for human–GPT-4, 18.1% for human-Bard, and 1.5% for human-human; P < .001).
The moderate agreement between human readers and LLMs for BI-RADS category assignments and the relatively high percentage of reports with discordant BI-RADS categories resulting in negative changes in clinical management highlight how LLMs are currently unable to deal with tasks requiring complex medical reasoning. This has been shown in previous studies where LLMs were tested in making decisions or providing recommendations relating to liver and lung imaging (13,32,33). Arguably, a point with crucial consequences is that the LLM chatbots ChatGPT and Bard (now called Gemini) are publicly and freely available, whereas the vast majority of validated diagnostic and prognostic artificial intelligence models are commercial products exclusively accessible in regulated contexts (34). LLM chatbots do not currently meet the standards (eg, transparency and bias control) required of artificial intelligence tools certified for health care–related tasks, and thus professional, legal, and ethical issues can arise from their use for such purposes (1,4,35). Calls for specific approval for the use of LLMs as medical devices have already been issued, involving most tasks in which LLMs have been tested up to now (35,36). However, regulations have yet to be implemented, adding up to the “double-edged sword” profile of LLMs (37).
The levels of agreement between human readers and LLMs for BI-RADS category assignment were observed to be lower than the level of agreement between human readers. However, the human-LLM agreement observed in our study was close to the human-human agreement observed in prior studies evaluating a task with many more confounding factors—specifically, agreement in BI-RADS assignments when full access to images and clinical data is provided (18–21). This suggests that the use of the highly standardized BI-RADS lexicon in referral centers allows a reader who has access to only the “findings” section of a report to reliably assign the same BI-RADS category, and that generically trained LLMs are not currently able to achieve these standards. While context training with external data has been proposed to substantially improve the performance of generically trained LLMs in similar tasks (12), at this time the potential consequences of unrestricted use of LLMs as unwarranted providers of second opinions by patients and physicians alike could be disruptive. While disagreements between radiologists are usually resolved through internal consensus or peer-to-peer communication, the global and immediate availability of LLMs makes it possible for anyone with internet access to ask for an opinion on a radiologic report. Especially considering the current shortage of medical professionals that is hindering quick and reliable patient-physician and physician-physician communication (38), the already heavy workload of breast imaging professionals could be worsened by the additional time needed to rediscuss LLM-generated BI-RADS category assignments resulting in different management strategies (up to 25.5% for Bard) with patients and colleagues.
This study has three main limitations. First, only BI-RADS descriptors reported by the radiologists were inputted into LLMs; LLMs did not have access to images or clinical data that could have improved performance. Second, the evaluation of LLM assignment repeatability was conducted at a single time point. Third, we limited our investigation to three languages, two of which (Italian and Dutch) are likely to be less represented in the training sets of the evaluated LLMs than other languages such as Mandarin Chinese, Spanish, or French, which might be better represented and thus more easily processed by the LLMs.
In conclusion, although GPT-4, GPT-3.5, and Bard achieved moderate agreement with human-assigned Breast Imaging Reporting and Data System categories based on imaging reports in three languages, a high percentage of discordant category assignments that would result in negative changes in clinical management were observed. This raises several concerns about the consequences of unwarranted use of large language models (LLMs) by patients and health care professionals, highlighting the need for regulation of publicly available generically trained LLMs and quick development of context-trained extensions of these tools.
A.C. and K.P. contributed equally to this work.
R.M.M. and S.S. are co-senior authors.
K.P. is supported in part by a National Cancer Institute Cancer Center Support Grant (P30 CA008748).
Disclosures of conflicts of interest: A.C. No relevant relationships. K.P. Grants or contracts from the Research and Innovation Framework Programme, FET Open, Anniversary Fund of the Oesterreichische Nationalbank, Vienna Science and Technology Fund, Memorial Sloan Kettering Cancer Center, and Breast Cancer Research Foundation; unpaid consultant for Genentech; consulting fees from Merantix, AURA Health Technologies, and Guerbet; payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing, or educational events from the European Society of Breast Imaging, Bayer, Siemens Healthineers, International Diagnostic Course Davos, Olea Medical, and Roche; support for attending meetings and/or travel from the European Society of Breast Imaging; participation on a data and safety monitoring board or advisory board for Bayer and Guerbet; and institution (Memorial Sloan Kettering Cancer Center) has institutional financial interests relative to Grail. A.H. No relevant relationships. T.Z. No relevant relationships. L.B. No relevant relationships. R.L.G. No relevant relationships. B.C. No relevant relationships. M.C. No relevant relationships. S.R. No relevant relationships. F.D.G. Institution (Imaging Institute of Southern Switzerland) is a Siemens Healthineers reference center for research. R.M.M. Grants or contracts from the Dutch Cancer Society, Europees Fonds voor Regionale Ontwikkeling Programma Oost-Nederland, Horizon Europe, European Research Council, Dutch Research Council, Health Holland, Siemens Healthineers, Bayer, ScreenPoint Medical, Beckton Dickinson, PA Imaging, Lunit, and Koning Health; royalties or licenses from Elsevier; consulting fees from Siemens Healthineers, Bayer, ScreenPoint Medical, Beckton Dickinson, PA Imaging, Lunit, Koning Health, and Guerbet; participation on a data and safety monitoring board or advisory board for the SMALL trial; member of the European Society of Breast Imaging executive board; member of the European Society of Radiology Research Committee; member of the editorial board for European Journal of Radiology; member of the Dutch Breast Cancer Research Group; and associate editor for Radiology. S.S. Consulting fees from Arterys; payment or honoraria for lectures, presentations, speakers bureaus, manuscript writing, or educational events from GE HealthCare; and support for attending meetings and/or travel from Bracco.
Abbreviations:
- BI-RADS
- Breast Imaging Reporting and Data System
- LLM
- large language model
References
- 1. Thirunavukarasu AJ , Ting DSJ , Elangovan K , Gutierrez L , Tan TF , Ting DSW . Large language models in medicine . Nat Med 2023. ; 29 ( 8 ): 1930 – 1940 . [DOI] [PubMed] [Google Scholar]
- 2. Kitamura FC . ChatGPT is shaping the future of medical writing but still requires human judgment . Radiology 2023. ; 307 ( 2 ): e230171 . [DOI] [PubMed] [Google Scholar]
- 3. Bhayana R , Krishna S , Bleakney RR . Performance of ChatGPT on a radiology board–style examination: insights into current strengths and limitations . Radiology 2023. ; 307 ( 5 ): e230582 . [DOI] [PubMed] [Google Scholar]
- 4. Haupt CE , Marks M . AI-generated medical advice—GPT and beyond . JAMA 2023. ; 329 ( 16 ): 1349 – 1350 . [DOI] [PubMed] [Google Scholar]
- 5. Lee P , Bubeck S , Petro J . Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine . N Engl J Med 2023. ; 388 ( 13 ): 1233 – 1239 . [DOI] [PubMed] [Google Scholar]
- 6. Gertz RJ , Bunck AC , Lennartz S , et al . GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study . Radiology 2023. ; 307 ( 5 ): e230877 . [DOI] [PubMed] [Google Scholar]
- 7. Kottlors J , Bratke G , Rauen P , et al . Feasibility of differential diagnosis based on imaging patterns using a large language model . Radiology 2023. ; 308 ( 1 ): e231167 . [DOI] [PubMed] [Google Scholar]
- 8. Rao A , Kim J , Kamineni M , et al . Evaluating GPT as an adjunct for radiologic decision making: GPT-4 Versus GPT-3.5 in a breast imaging pilot . J Am Coll Radiol 2023. ; 20 ( 10 ): 990 – 997 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Barat M , Soyer P , Dohan A . Appropriateness of recommendations provided by ChatGPT to interventional radiologists . Can Assoc Radiol J 2023. ; 74 ( 4 ): 758 – 763 . [DOI] [PubMed] [Google Scholar]
- 10. Adams LC , Truhn D , Busch F , et al . Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study . Radiology 2023. ; 307 ( 4 ): e230725 . [DOI] [PubMed] [Google Scholar]
- 11. Mallio CA , Sertorio AC , Bernetti C , Beomonte Zobel B . Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing . Radiol Med (Torino) 2023. ; 128 ( 7 ): 808 – 812 . [DOI] [PubMed] [Google Scholar]
- 12. Rau A , Rau S , Zoeller D , et al . A context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines . Radiology 2023. ; 308 ( 1 ): e230970 . [DOI] [PubMed] [Google Scholar]
- 13. Patil NS , Huang RS , van der Pol CB , Larocque N . Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: do ChatGPT and Bard communicate information consistent with the ACR appropriateness criteria? J Am Coll Radiol 2023. ; 20 ( 10 ): 1010 – 1013 . [DOI] [PubMed] [Google Scholar]
- 14. Huang H , Tang T , Zhang D , et al . Not all languages are created equal in LLMs: improving multilingual capability by cross-lingual-thought prompting . arXiv 2305.07004 [preprint] https://arxiv.org/abs/2305.07004. Posted May 11, 2023. Updated October 22, 2023. Accessed November 12, 2023 .
- 15. Tan RSYC , Lin Q , Low GH , et al . Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting . J Am Med Inform Assoc 2023. ; 30 ( 10 ): 1657 – 1664 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wornow M , Xu Y , Thapa R , et al . The shaky foundations of large language models and foundation models for electronic health records . NPJ Digit Med 2023. ; 6 ( 1 ): 135 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. D’Orsi CJ , Sickles EA , Mendelson EB , Morris EA . ACR BI-RADS atlas . 5th ed. Reston, Va: : American College of Radiology; , 2013. . [Google Scholar]
- 18. Grimm LJ , Anderson AL , Baker JA , et al . Interobserver variability between breast imagers using the fifth edition of the BI-RADS MRI lexicon . AJR Am J Roentgenol 2015. ; 204 ( 5 ): 1120 – 1124 . [DOI] [PubMed] [Google Scholar]
- 19. El Khoury M , Lalonde L , David J , Labelle M , Mesurolle B , Trop I . Breast imaging reporting and data system (BI-RADS) lexicon for breast MRI: interobserver variability in the description and assignment of BI-RADS category . Eur J Radiol 2015. ; 84 ( 1 ): 71 – 76 . [DOI] [PubMed] [Google Scholar]
- 20. Salazar AJ , Romero JA , Bernal OA , Moreno AP , Velasco SC . Reliability of the BI-RADS final assessment categories and management recommendations in a telemammography context . J Am Coll Radiol 2017. ; 14 ( 5 ): 686 – 692.e2 . [DOI] [PubMed] [Google Scholar]
- 21. de Margerie-Mellon C , Debry JB , Dupont A , et al . Nonpalpable breast lesions: impact of a second-opinion review at a breast unit on BI-RADS classification . Eur Radiol 2021. ; 31 ( 8 ): 5913 – 5923 . [DOI] [PubMed] [Google Scholar]
- 22. Banerjee I , Bozkurt S , Alkim E , Sagreiya H , Kurian AW , Rubin DL . Automatic inference of BI-RADS final assessment categories from narrative mammography report findings . J Biomed Inform 2019. ; 92 : 103137 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zhang T , Tan T , Wang X , et al . RadioLOGIC, a healthcare model for processing electronic health records and decision-making in breast disease . Cell Rep Med 2023. ; 4 ( 8 ): 101131 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kuling G , Curpen B , Martel AL . BI-RADS BERT and using section segmentation to understand radiology reports . J Imaging 2022. ; 8 ( 5 ): 131 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. GPT-3.5 . OpenAI . https://platform.openai.com/docs/models/gpt-3-5. Accessed October 4–30, 2023 .
- 26. GPT-4 . OpenAI . https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo. Accessed October 4–30, 2023 .
- 27. Bard. Google . https://gemini.google.com/. Accessed October 4–15, 2023 .
- 28. Gwet KL . Computing inter-rater reliability and its variance in the presence of high agreement . Br J Math Stat Psychol 2008. ; 61 ( Pt 1 ): 29 – 48 . [DOI] [PubMed] [Google Scholar]
- 29. Klein D . Implementing a general framework for assessing interrater agreement in Stata . Stata J 2018. ; 18 ( 4 ): 871 – 901 . [Google Scholar]
- 30. Landis JR , Koch GG . The measurement of observer agreement for categorical data . Biometrics 1977. ; 33 ( 1 ): 159 – 174 . [PubMed] [Google Scholar]
- 31. Gwet KL . Testing the difference of correlated agreement coefficients for statistical significance . Educ Psychol Meas 2016. ; 76 ( 4 ): 609 – 637 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Cao JJ , Kwon DH , Ghaziani TT , et al . Accuracy of information provided by ChatGPT regarding liver cancer surveillance and diagnosis . AJR Am J Roentgenol 2023. ; 221 ( 4 ): 556 – 559 . [DOI] [PubMed] [Google Scholar]
- 33. Rahsepar AA , Tavakoli N , Kim GHJ , Hassani C , Abtin F , Bedayat A . How AI responds to common lung cancer questions: ChatGPT vs Google Bard . Radiology 2023. ; 307 ( 5 ): e230922 . [DOI] [PubMed] [Google Scholar]
- 34. Rajpurkar P , Lungren MP . The current and future state of AI interpretation of medical images . N Engl J Med 2023. ; 388 ( 21 ): 1981 – 1990 . [DOI] [PubMed] [Google Scholar]
- 35. Gilbert S , Harvey H , Melvin T , Vollebregt E , Wicks P . Large language model AI chatbots require approval as medical devices . Nat Med 2023. ; 29 ( 10 ): 2396 – 2398 . [DOI] [PubMed] [Google Scholar]
- 36. Meskó B , Topol EJ . The imperative for regulatory oversight of large language models (or generative AI) in healthcare . NPJ Digit Med 2023. ; 6 ( 1 ): 120 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Shen Y , Heacock L , Elias J , et al . ChatGPT and other large language models are double-edged swords . Radiology 2023. ; 307 ( 2 ): e230163 . [DOI] [PubMed] [Google Scholar]
- 38. Aminololama-Shakeri S , Soo MS , Grimm LJ , et al . Radiologist-patient communication: current practices and barriers to communication in breast imaging . J Am Coll Radiol 2019. ; 16 ( 5 ): 709 – 716 . [DOI] [PubMed] [Google Scholar]






![Sankey plots showing changes in Breast Imaging Reporting and Data System (BI-RADS) clinical management categories between human readers and between human readers and large language models (LLMs). Human-human agreement was assessed between the original radiologists who wrote the breast imaging reports (Human 1) and the radiologists who reviewed the findings section of the reports (Human 2). Human-LLM agreement was assessed between the original breast imaging reports and the outputs from three LLMs (Google Bard [27], GPT-3.5 [25], and GPT-4 [26]) provided with the findings section of the report. The proportion of disagreements between the original reporting radiologists and the radiologists who reviewed the findings section of the reports was observed to be lower (P < .001) than the proportions of disagreements between the original reporting radiologists and the LLMs.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f48/11070611/a3c6c42f3761/radiol.232133.fig2.jpg)