Abstract
Patient access to radiology reports has heightened the need for patient-friendly communication. Automated generation of patient-centered summaries using large language models (LLMs) is a promising solution. However, their use on real-life reports is limited by privacy concerns. Here, we evaluate the safety and effectiveness of on-premise, privacy-preserving LLMs for generating lay summaries of real French brain MRI reports for emergency presentations of headache. In this retrospective study, we sampled 105 brain MRI reports (January–December 2022) for radiologist evaluation and a subset of 30 reports for non-physician evaluation. Three open-weights models (Llama 3.3 70B, Athene V2, Mistral Small) generated French lay summaries via a single standardized prompt. Radiologists’ mean ratings across models were high for exactness (4.10, 95% CI: 4.04–4.16), exhaustiveness (4.34, 95% CI: 4.29–4.39), didacticness (3.83, 95% CI: 3.79–3.88), and readiness for clinical use (3.84, 95% CI: 3.79–3.89). Non-physicians reported higher perceived understanding with summaries, from 2.85 (95% CI: 2.67–3.04) to 4.27 (95% CI: 4.15–4.38, p < 0.001). The correct identification rate for reports increased from 75.2% to 83.6% (p < 0.001). The ability to identify causal findings also improved, from 80.6% to 84.8% (p < 0.001) overall. Overall error rate in LLM-generated lay summaries was 19.7% (62/315), warranting expert oversight.
Keywords: LLM, Large language model, Lay summary, Understanding, Patient-centered reports, Artificial intelligence
Subject terms: Diseases, Health care, Medical research, Neurology
Introduction
Effective communication of radiology findings to patients has become increasingly important due to widespread access to personal health information via online patient portals1. Radiology reports, typically written for healthcare professionals, contain specialized medical terminology that can be difficult for patients to interpret2. Misunderstanding these reports can lead to anxiety, stress, unnecessary follow-up visits, and increased healthcare utilization3,4.
This issue is particularly significant in emergency settings, where clear communication of imaging findings is essential for managing patient anxiety and ensuring proper follow-up5,6. Among those exams, brain MRIs for headaches are increasingly frequent, and can yield high rates of both causal and incidental findings7–9. The list of potential findings being virtually limitless, flexible, patient-centered methods are needed to better communicate the clinical significance of results9. Among methods used to enhance patient communication, lay summaries are particularly interesting as they have been shown to be efficient while maintaining the integrity of the original report10. However, with the increase in radiology workload, generating these summaries is not feasible without automation10,11.
Large Language Models (LLMs) have shown promising abilities in radiology12,13, including for simplifying imaging reports14,15. Their limitless adaptability to specific contexts and their clinical knowledge has been leveraged to lower the reading difficulty of reports, while maintaining medical correctness15,16. However, most studies so far have employed commercial LLMs unsuitable for use in clinical settings, or used English reports either fictitious or sourced from publicly available datasets14–16. Privacy-preserving models, used on-premise, offer an interesting alternative as they can be deployed on site without communication to commercial servers17. Their ability to generate lay summaries from real reports has not been tested yet.
We hypothesized that privacy-preserving models could be used to generate safe lay summaries of reports to enhance non-physician understanding of brain MRI key results. To test this hypothesis, we studied three privacy-preserving on-premise LLMs, LLaMA 3.3, Athene V2, and Mistral Small in generating accurate patient-friendly summaries of emergency brain MRI reports for patients presenting with headaches with both radiologists and non-physicians rating.
Methods
This retrospective study aimed to evaluate the security, quality and ability to enhance comprehension by non-physicians of LLM-generated lay summaries of French free-text radiology reports with open-weights LLMs. The data warehouse from which the reports were extracted was approved by the French data protection authority (reference no. 2019–103). Use of the data for this specific study was approved by the Lille University Hospital institutional review board in June 2023 (EDS2307251350).
This study was conducted in accordance with the Declaration of Helsinki, and all human participants provided informed consent.
This study is reported in accordance with the TRIPOD-LLM statement.
Data
Free-text reports were obtained from the health data warehouse of our institution. They were pseudonymized by detecting and removing the place of residence of the patient, their name, and the name of the prescribing physician using eHOP software (Université de Rennes). Eligible reports were brain MRI scans from patients in the emergency department performed from January 2022 to December 2022. Reports were initially written in French as free text by 22 trainees and 21 board-certified radiologists, who were unaware of this study at the time of reporting. Out of the 595 reports, a random sample of 105 reports was drawn for the radiologists rating and 30 random reports were further drawn from the sample for the non-physician rating.
Radiologists rating study
Three board-certified radiologists specialized in neuro-imaging participated in this study (B.L.G., 5 years of experience, Y.G., 7 years of experience, and Q.V.M. 14 years of experience). Radiologists were presented with the original report along with the three unaltered LLM-generated lay summaries in random order, blinded from the models. They were tasked to evaluate each lay summary according to four aspects on a five-point Likert scale: exactness (“Are the information contained in this report medically correct?”), exhaustiveness (“Are all information from the original report found in the lay summary?”), didacticness (“Is this lay summary providing supplementary information to better understand the content of the original report?”), and readiness for clinical use (“Would you agree to include this lay summary as is along with the original report?”). Errors were then further classified by B.L.G. in three categories. “Medical errors” were defined as errors regarding anatomical or physiological descriptions, or the clinical misinterpretation of medical terminology and abbreviations (e.g., misidentifying an acronym). We categorized acronym misinterpretations (such as RVCS) as medical errors rather than linguistic ones because, while the error is lexical in nature, the resulting summary conveys incorrect clinical information that could impact patient safety. “Linguistic errors” defined as summaries containing issues with fluency, including non-idiomatic phrasing, un-translated source text (English/Chinese terms in a French output), or neologisms. “Hallucinations”, defined as the introduction of observations in the summary that were entirely absent from the source report.
Non-physicians rating
Eleven non-physicians working in the medical informatics lab participated in this study. All participants were not exposed to radiology reports as part of their job. None of them held a degree in medicine. Their characteristics are presented in Table 1. They were selected on the basis of their clearance for accessing medical reports from our data warehouse for research purposes. Non-physicians were presented with a random set of altered and original reports in a random order (15 original reports and 15 altered, five per model tested). Participants rated their subjective understanding of the reports (“How much do you understand the content of this report?”), perceived ability to discuss the content of the report with friends or family (“How much do you feel able to discuss the results with friends or family?”), and anxiety (“If you were reading this report for yourself or as a patient surrogate, how much anxiety would you feel?”). Objective report understanding was assessed by asking the participants to assess the existence of a positive finding (“Is there any positive finding in the results section of this report?”) and the existence of a finding that could potentially explain the patient’s headache (“Is there any finding that could potentially explain the patient’s headache?”). The reference standard was obtained from a previous study, using the rating of 4 independent experts with consensus for discrepant cases, as described in a previous study17.
Table 1.
Characteristics of the non-physician raters.
| Age (median, range) | 27 (22–47) |
| Sex (n male (%)) | 6 (55) |
| Higher education degree (n (%)) | |
| Bachelor | 2 (18) |
| Masters degree | 6 (55) |
| PhD | 3 (27) |
Sample size calculation
To determine the required sample size, we conducted a power analysis using Monte Carlo simulations tailored to our mixed-effects logistic regression model. This approach allowed us to account for the random effects of both participants (raters) and reports. Our goal was to achieve a statistical power of 0.8 for detecting an increase in objective understanding from 75% to 85% (adapted from18 given the high education rate of our raters), at a significance level of 0.05. To account for a possible clustering effect, the intra-participant correlation was set at 0.10. The simulations indicated that with 11 participants, approximately 30 reports would be sufficient to achieve the desired power.
Models
The three top-scoring open-weights models at 70B or less parameters (constrained by the calculation power of our institution) in the “French” sub-section of chatbot arena19 on March 1st 2025 were selected. These models were Athene V2, fine-tuned from Qwen, Llama 3.3 70b, and Mistral small. Models were deployed using Ollama in python. The models were run on three Quadro RTX 6000 graphics processing units (NVIDIA).
Prompting
As multiple models were tested on several raters, a single prompt was selected for all analyses. Prompt used was qualitatively engineered on a subset of 10 brain MRI reports not included in the study, by B.L.G and A.H. The models were tasked to create a 4–6 sentences easily understandable summary intended for the patient (lay summary). Emphasis was put on explicating difficult medical terminology and relating the findings to the symptoms of the patient. No specific instructions were given to handle ambiguous findings. The final prompt was “Write a 4–6 sentences patient summary in French for the following radiology report. It should contain all key information from the initial report, and explicit difficult medical terminology. It should be easy to read for patients with no medical knowledge. Be sure to include a sentence on how the findings relate to the patient’s symptoms. Start your answer with ‘Résumé patient:’
Here is the radiology report: [report].”
All models were downloaded on our local server and accessed using Ollama. A temperature of 0 was used for all models to ensure deterministic output.
Statistical analyses
For radiologist ratings on the five Likert-scale dimensions (exactness, exhaustiveness, didacticness, and readiness for clinical use), responses were summarized descriptively as proportions for each response category. For non-physician ratings, the primary endpoints were the correctness of responses to the binary questions: (i) “Does this report contain any abnormal finding?”; and (ii) “Does this report contain a finding that could plausibly explain the patient’s headache?”. A crossed random-effects logistic regression was applied, with report type (original vs. AI-augmented lay summary) included as a fixed effect, and random effects for both participants and reports20. The main analysis focused on adjusted marginal probabilities of a correct response in each condition, expressed as percentages with 95% confidence intervals (CIs). Odds ratios with 95% CIs were also reported as synthetic effect measures. Subgroup analyses were additionally performed according to whether the report contained a positive finding or not, by including an interaction term between report type and report content in the model. Secondary endpoints for non-physicians included perceived understanding, perceived ability to discuss results, and projected anxiety, rated on ordinal Likert scales. These were analyzed using mixed-effects linear logistic regression with the same crossed random-effects structure. Results were expressed as adjusted means of Likert scores with 95% CIs. No multiplicity correction was applied, as secondary analyses were considered exploratory. P < 0.05 was considered to indicate statistical significance. All analyses were performed by a statistician (R.B.) not involved in the rating of the reports using the lme4 and emmeans packages in the R software (version 4.3.1; R Foundation).
Results
Reports
From the 105 original reports included in the sample, 64.8% (68/105) were classified as containing at least one abnormal finding and 27.6% (29/105) as containing a positive finding that could explain the patient’s headache. In the sample of 30 reports for non-physician ratings, there were 70% (21/30) abnormal reports and 30.0% (9/30) with a positive finding that could explain the symptoms.
Radiologists ratings
Mean rating for exactness was 4.10 overall (95% CI 4.04–4.16), with the only significant difference being between Athene and Mistral (p = 0.005) (Fig. 1). Mean exhaustivity rating was 4.34 overall (95% CI 4.29–4.39), with only Llama and Athene rated significantly higher than Mistral (both p < 0.001). Mean did actiness was 3.83 overall (95% CI 3.79–3.88), only significantly higher for Llama and Athene as compared to Mistral (both p < 0.001). Mean rating for readiness for clinical use was 3.84 overall (95% CI 3.79–3.89), with the only significant difference between Athene and Mistral (p = 0.002).
Fig. 1.
Radiologists rating of lay summaries generated by LLMs (overall), Athene, Mistral and Llama. Only statistically significant differences are shown.
Error analysis
The overall rate of summary containing errors was 19.7% (62 errors across the 105 lay summaries generated by each of 3 models), with no difference in error rates between models. Medical errors occurred in 9.2% (29/315) of summaries, with no difference between models. Typical examples included incorrect acronym explanation (“RVCS” described as “central venous thrombosis”) and incorrect anatomical descriptions (“cerebellum” described as “left part of the brain”). Linguistic errors occurred in 8.3% (26/315) of summaries overall, and were more frequent for Athene (p = 0.02) (Table 2). Typical examples included unnatural phrasings derived from the English wording and occurrences of English or Chinese words in the French summary. More examples are provided in Table 3.
Table 2.
Error analysis of athene, Llama and mistral.
| Error type | Athene | Llama | Mistral | Overall | p-value |
|---|---|---|---|---|---|
| Medical errors | 4.8% (5/105) | 11.4% (12/105) | 11.4% (12/105) | 9.2% (29/315) | 0.16 |
| Linguistic errors | 14.3% (15/105) | 4.8% (5/105) | 5.7% (6/105) | 8.3% (26/315) | 0.02* |
| Hallucinations | 1.9% (2/105) | 1.0% (1/105) | 3.8% (4/105) | 2.2% (7/315) | 0.36 |
| Total errors | 21.0% (22/105) | 17.1% (18/105) | 21.0% (22/105) | 19.7% (62/315) | 0.73 |
*Statistically significant (p < 0.05) using a Chi-squared test.
Table 3.
Error examples.
| Original report /// translation | Lay summary /// translation | Explanation |
|---|---|---|
|
Persistence d’une collection sous le volet mesurée jusqu’à 5 mm d’épaisseur. /// Persistent sub-flap collection measuring up to 5 mm in thickness. |
Athene: - **Collection sous le volet: ** Il y a encore un petit积聚(最多5毫米厚)在手术区域下方。 /// Sub-flap collection: there is still a small积聚(最多5毫米厚)在手术区域下方。 |
End of the sentence in chinese |
|
- absence d’œdème périlésionnel ni d’effet de masse sur le parenchyme adjacent. /// No perilesional edema or mass effect on the adjacent brain tissue. |
Athene: - Il n’y a pas d’enflure ou de compression des tissus alentours /// There is no swelling or compression of the surrounding tissues. |
Edema is simplified in enflure (edema) which can also be used as an insult in French and would not be used in a radiology report. |
|
Formation kystique intrasellaire en projection de la posthypophyse de 5 mm de grand axe, non spécifique. /// Intrasellar cystic formation projecting from the posterior pituitary, 5 mm in its largest dimension, nonspecific. |
Llama: … * Formation kystique intrasellaire : une petite poche remplie de liquide near l’hypophyse. /// Intrasellar cystic formation: a small fluid-filled pocket near the pituitary gland. |
An English word (near) appears mid-sentence where its French translation (proche) should have been used. |
| No mention of AVM |
Llama: Termes techniques: * MAV: Malformation Artério-Veineuse /// Technical glossary: AVM: Arteriovenous Malformation. |
No mention of AVM in the original report. |
|
Réévaluation de lésions de PRES et recherche d’argument pour un SVCR. /// Reassessment of PRES lesions and investigation for possible RVCS. |
Mistral: L’IRM a été réalisée pour réévaluer les anomalies du syndrome de Posterior Reversible Encephalopathy Syndrome (PRES) et rechercher des signes d’une thrombose veineuse cérébrale (SVCR). /// MRI was performed to reassess abnormalities of Posterior Reversible Encephalopathy Syndrome (PRES) and to look for signs of cerebral venous thrombosis (RCVS). |
Incorrect disambiguation of RVCS (Reversible vasoconstriction syndrome) into cerebral venous thrombosis (CVT) |
|
Pas de signe de dissection ou de thrombose veineuse cérébrale. /// No sign of vessel dissection or cerebral venous thrombosis. |
Mistral: L’examen n’a montré aucun signe d’AVC, de dislocation (dissection) des vaisseaux du cou ni de caillot dans le cerveau. /// The exam showed no signs of stroke, vessel dislocation (dissection) in the neck, or blood clot in the brain. |
Dislocation is used as an incorrect synonym of dissection. |
Non-physician ratings
Lay summaries led to a substantial improvement in perceived understanding, from 2.85 (95% CI: 2.67, 3.04) for original reports to 4.27 (95% CI: 4.15, 4.38, p < 0.001, Figs. 2 and 3). All three models consistently demonstrated a positive effect on perceived understanding, with mean ratings of 4.44 (95% CI: 4.28, 4.60) for Athene, 4.13 (95% CI: 3.91, 4.34) for Llama, and 4.24 (95% CI: 4.03, 4.45) for Mistral (all p < 0.001 vs. original). The benefit was observed for both abnormal reports (2.58 to 4.20, p < 0.001) and normal reports (3.47 to 4.44, p < 0.001).
Fig. 2.
Non-physician ratings of reports, either original or with lay summaries generated by LLMs (overall), Athene, Mistral and Llama.
Fig. 3.
Perceived comprehension rating of individual reports in their original format and after generation of a lay summary by LLMs.
The effect was also positive on the perceived ability to discuss results with relatives, with mean ratings increasing from 2.92 (95% CI: 2.72, 3.12) for original reports to 4.00 (95% CI: 3.86, 4.14) for reports with a lay summary (p < 0.001).
The overall effect of lay summaries on anxiety was limited but significant, with mean ratings decreasing from 2.55 (95% CI: 2.34, 2.75) for original reports to 2.34 (95% CI: 2.15, 2.53) for reports with a LLM-generated lay summary (p = 0.007).
Effect on normal vs. abnormal report classification
Lay summaries significantly improved the ability of non-physician participants to classify reports as normal or abnormal (Fig. 4). The correct identification rate increased from 75.2% (95% CI: 71.4%, 79.0%) for original reports to 83.6% (95% CI: 80.3%, 86.9%, p < 0.001). The observed baseline identification rate of 75.2% aligned closely with the 75% baseline assumed in our a priori power calculation. For individual models, the correct identification rates were 80.0% (95% CI: 73.9%, 86.1%) for Athene, 87.3% (95% CI: 82.2%, 92.4%) for Llama, and 83.6% (95% CI: 78.0%, 89.3%) for Mistral. This improvement was only observed for abnormal reports, with correct identification rising from 64.0% (95% CI: 58.9%, 69.1%) to 76.9% (95% CI: 72.5%, 81.3%, p < 0.001). No effect was shown on normal reports, with both original and lay-summarized versions achieving 100% correct identification (95% CI: 100.0%, 100.0%).
Fig. 4.
Non-physicians objective understanding of reports, either original or with lay summaries generated by Athene, Mistral and Llama. (A) For detecting abnormal findings in the reports and (B) for identifying findings causing headaches.
Effect on causal vs. incidental findings classification
Lay summaries significantly improved objective understanding of the relationship between the findings from the report and the symptoms (Fig. 4). The overall correct identification rate increased from 80.6% (95% CI: 77.1%, 84.1%) for original reports to 84.8% (95% CI: 81.6%, 88.0%, p < 0.001) with lay summaries. For individual models, the overall correct identification rates were 83.6% (95% CI: 78.0%, 89.3%) for Athene, 83.6% (95% CI: 78.0%, 89.3%) for Llama, and 87.3% (95% CI: 82.2%, 92.4%) for Mistral. This improvement was only observed for abnormal reports, with correct identification rising from 71.9% (95% CI: 67.1%, 76.7%) to 78.6% (95% CI: 74.3%, 82.9%, p < 0.001). Conversely, for reports without anomalies, both original and lay-summarized versions achieved 100% correct identification (95% CI: 100.0%, 100.0%).
Discussion
Our results show that privacy-preserving models have potential for creating lay summaries of real radiology reports to enhance non-physicians’ understanding of key results while maintaining reports integrity. We showed that all tested models displayed a similar ability to create secure patient summaries, even though all of them also displayed limitations, including unnatural phrasings and difficulties for disambiguating abbreviations. We showed a limited effect of lay summaries on the ability of non-physician participants to answer questions about MRI reports, contrasting with a large benefit of LLM-generated lay summaries on their perceived understanding.
Our results are in line with previous literature that found that LLMs can be used to simplify radiology reports. In a pioneer study using ChatGPT, Bard, and Bing, Amin et al. showed that commercial LLMs improved the readability of reports from the publicly available MIMIC dataset while maintaining relevant information15. However, this work assessed readability only through automated readability scores and did not evaluate comprehension by human readers. Similarly, Can et al. recently conducted a study on fictitious reports, testing both commercial and privacy-preserving models, and found that all models improved readability metrics based on textual difficulty measures rather than reader assessment14. To better estimate the impact of LLM-generated simplification on actual understanding, Tripathi et al. recruited four non-physician raters, showing a marked improvement in perceived understandability of reports after alteration by GPT-416, with 97% of the LLM-generated summaries rated as fully understandable, with no measure of objective understanding.
Our data highlight that the benefits of lay summaries are uneven across report types and comprehension measures. For normal MRI reports, lay summaries added little measurable value in terms of objective understanding, as non-physician readers already achieved near-perfect accuracy. In contrast, abnormal reports showed the largest objective gains. Interestingly, perceived comprehension improved markedly for both normal and abnormal reports, suggesting that participants felt more confident in their understanding even when their objective performance did not improve. This discrepancy may partly reflect the high education level of our rater sample, which could have led to a ceiling effect in objective comprehension. Nevertheless, the strong effect on perceived understanding remains clinically meaningful, as prior work has shown that patients’ perceived clarity of reports can influence the rate of contacts to healthcare professionals in case of normal results10,21,22.
Across 315 summaries, the composite error rate was 19.7%, mainly due to medical (9.2%) and linguistic (8.3%) shortcomings. Medical errors were typically represented by misexpanded abbreviations. In previous studies, LLMs were already shown to specifically struggle with medical acronyms in non-English languages23. This phenomenon may be a direct reflection of linguistic imbalance in their training dataset, with predominance of English data24. To minimize errors, use of language- and domain-specific acronym dictionaries could be needed. In addition, linguistic flaws, particularly Anglicisms and foreign-language insertions, such as Chinese for Athene V2 (fine-tuned from the Qwen architecture, developed in China), were noted. This suggests that despite deterministic settings (T = 0), the models’ internal representations remain biased toward their primary training and reinforcement learning through human feedback (RLHF) corpora, particularly when navigating dense technical/medical domains in a low-resource target language25. These limitations suggest a human-in-the-loop model for clinical implementation, especially in non-English speaking countries, where a LLM generates a baseline summary, which is then rapidly edited by a radiologist to ensure clinical safety.
This study has limitations. First, the study was conducted in a single center, on French reports. As such, future studies to test the applicability to other settings and languages are needed. Second, our non-physician sample consisted of highly-educated participants. This may induce a selection bias overestimating comprehension of the reports, skewing the results towards higher understanding of both original and altered reports. Future studies will be needed to assess the impact of such methods on diverse readers and patients. Finally, understanding a radiology report cannot be simply measured by assessing the correct classification into normal/abnormal reports and causal/incidental findings10. Future studies assessing transfer knowledge (testing the ability of laypersons to recite and contextualize radiology findings) and actionability of original and LLM-enhanced radiology reports. Specifically, our study did not assess report actionability. As recently emphasized10, future work will be needed to assess the impact of reports alteration on the action and decision behaviors, including the effect on medical seeking behaviors.
To summarise, we showed that privacy-preserving LLM-generated lay summaries improved perceived understanding and objective comprehension in this cohort of highly educated non-physicians. Medical errors and linguistic issues were observed, warranting the necessity of a ‘human-in-the-loop’ framework to ensure clinical safety before these summaries are deployed in patient portals.
Author contributions
BLG wrote the main manuscript text and prepared figures and tables. RB and AH provided statistical support and reviewed the manuscript. LS and RL provided support for the non-physician rater study and reviewed the manuscript. YG and QVM provided support for the physician rater study and reviewed the manuscript. GK provided support for study conceptualization and reviewed the manuscript.
Funding
This study was supported by the Health Data Hub through the 2024 DatAE research program (project reference: CEPHALALG.IA). The DatAE program is financed by the French Ministry of Health, through the Directorate General for Healthcare Provision (Direction Générale de l’Offre de Soins, DGOS).
Data availability
The datasets generated and/or analysed during the current study consist of sensitive patient health information (brain MRI reports). In accordance with the Institutional Review Board (IRB) approval for this study, which prohibits data sharing to protect patient confidentiality, the data are not publicly available.
Declarations
Competing interests
The authors declare no competing interests.
IRB statement
Use of the data for this specific study was approved by the Lille University Hospital institutional review board in June 2023 (EDS2307251350).
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Lee, C. I., Langlotz, C. P. & Elmore, J. G. Implications of direct patient online access to radiology reports through patient web portals. J. Am. Coll. Radiol.13, 1608–1614 (2016). [DOI] [PubMed] [Google Scholar]
- 2.Pillemer, F. et al. Direct release of test results to patients increases patient engagement and utilization of care. PLOS ONE. 11, e0154743 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rosenkrantz, A. B. Differences in perceptions among Radiologists, referring Physicians, and patients regarding Language for incidental findings reporting. AJR Am. J. Roentgenol.208, 140–143 (2017). [DOI] [PubMed] [Google Scholar]
- 4.Palen, T. E., Ross, C., Powers, J. D. & Xu, S. Association of online patient access to clinicians and medical records with use of clinical services. JAMA308, 2012–2019 (2012). [DOI] [PubMed] [Google Scholar]
- 5.Marty, H., Bogenstätter, Y., Franc, G., Tschan, F. & Zimmermann, H. How well informed are patients when leaving the emergency department? Comparing information provided and information retained. Emerg. Med. J.30, 53–57 (2013). [DOI] [PubMed] [Google Scholar]
- 6.Sheikh, H., Brezar, A., Dzwonek, A., Yau, L. & Calder, L. A. Patient Understanding of discharge instructions in the emergency department: do different patients need different approaches? Int. J. Emerg. Med.11, 5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.ElHabr, A. et al. Increasing utilization of emergency department neuroimaging from 2007 through 2017. AJR Am. J. Roentgenol.218, 165–173 (2022). [DOI] [PubMed] [Google Scholar]
- 8.Trofimova, A. V., Duszak, R. Jr., Kadom, N. & Sadigh, G. Increasing and disparate use of neuroimaging for adults and children with non-traumatic headaches in the US emergency departments: opportunities for improvement. Headache: J. Head Face Pain. 61, 179–189 (2021). [DOI] [PubMed] [Google Scholar]
- 9.Evans, R. W. Incidental findings and normal anatomical variants on MRI of the brain in adults for primary headaches. Headache: J. Head Face Pain. 57, 780–791 (2017). [DOI] [PubMed] [Google Scholar]
- 10.van der Mee, F. A. M. et al. The impact of different radiology report formats on patient information processing: a systematic review. Eur. Radiol.35, 2644–2657 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Alexander, R. et al. Mandating limits on Workload, Duty, and speed in radiology. Radiology304, 274–282 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Le Guellec, B. et al. Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images. Diagn. Interv. Imaging. 10.1016/j.diii.2025.04.006 (2025). [DOI] [PubMed] [Google Scholar]
- 13.Lecler, A., Duron, L. & Soyer, P. Revolutionizing radiology with GPT-based models: current applications, future possibilities and limitations of ChatGPT. Diagn. Interv. Imaging. 10.1016/j.diii.2023.02.003 (2023). [DOI] [PubMed] [Google Scholar]
- 14.Can, E. et al. Large Language models for simplified interventional radiology reports: A comparative analysis. Acad. Radiol.32, 888–898 (2025). [DOI] [PubMed] [Google Scholar]
- 15.Amin, K. S. et al. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for simplifying radiology reports. Radiology309, e232561 (2023). [DOI] [PubMed] [Google Scholar]
- 16.Tripathi, S. et al. PRECISE framework: enhanced radiology reporting with GPT for improved readability, reliability, and patient-centered care. Eur. J. Radiol.187, 112124 (2025). [DOI] [PubMed] [Google Scholar]
- 17.Le Guellec, B. et al. Performance of an Open-Source large Language model in extracting information from Free-Text radiology reports. Radiology: Artif. Intell.6, e230364 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Perlis, N. et al. Exploring the value of using patient-oriented MRI reports in clinical practice - a pilot study. Support Care Cancer. 30, 6857–6876 (2022). [DOI] [PubMed] [Google Scholar]
- 19.Overview Leaderboard | LMArena. https://lmarena.ai/leaderboard.
- 20.O’Shaughnessy, E., Detrinidad, E., Soyer, P. & Lecler, A. An introductory guide to statistics for the radiologist. Diagn. Interv. Imaging. 106, 49–52 (2025). [DOI] [PubMed] [Google Scholar]
- 21.Dabrowiecki, A., Sadigh, G. & Duszak, R. Chest radiograph reporting: public preferences and perceptions. J. Am. Coll. Radiol.17, 1259–1268 (2020). [DOI] [PubMed] [Google Scholar]
- 22.Kadom, N. et al. Info-RADS: adding a message for patients in radiology reports. J. Am. Coll. Radiol.18, 128–132 (2021). [DOI] [PubMed] [Google Scholar]
- 23.Kugic, A., Schulz, S. & Kreuzthaler, M. Disambiguation of acronyms in clinical narratives with large Language models. J. Am. Med. Inf. Assoc.31, 2040–2046 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Johnson, R. L. et al. The Ghost in the Machine has an American accent: value conflict in GPT-3. Preprint at (2022). 10.48550/arXiv.2203.07785.
- 25.Wendler, C., Veselovsky, V., Monea, G. & West, R. Do Llamas Work in English? On the Latent Language of Multilingual Transformers. Preprint at (2024). 10.48550/arXiv.2402.10588.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analysed during the current study consist of sensitive patient health information (brain MRI reports). In accordance with the Institutional Review Board (IRB) approval for this study, which prohibits data sharing to protect patient confidentiality, the data are not publicly available.




