See also the article by Le Guellec et al in this issue.

Tugba Akinci D’Antonoli, MD, is currently a radiology resident at Cantonal Hospital Baselland and a researcher at the University of Basel, Switzerland. Her research interests include deep learning and radiomics applications in cardiothoracic radiology and neuroradiology. She is a member of the 2023–2025 trainee editorial board of Radiology: Artificial Intelligence and is also a member of the Young Club Committee in the European Society of Medical Imaging Informatics and scientific editorial board member at European Radiology.

Christian Bluethgen, MD, MSc, is currently an attending radiologist and clinician scientist at the Institute for Diagnostic and Interventional Radiology at the University Hospital Zurich and the University of Zurich, Switzerland and previously a visiting postdoctoral researcher at the Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI), USA. His research focuses on thoracic imaging and the design and application of multimodal deep learning models for radiological applications.
It has been 128 years since one of the earliest known radiology reports was written by William James Morton in New York (1). Since then, radiologists have been producing radiology reports day and night. Pioneers recognized the potential of this growing body of radiology reports: They acknowledged the benefits of structured reporting to better organize radiology reports and make their content easier to retrieve and ultimately use in research, teaching, and quality improvement (2). Over the years, ontologies and nomenclatures for clinical radiology have been introduced to standardize the reporting process, and various methods such as natural language processing (NLP) have been developed to leverage this growing library of radiology reports (2). However, structured reporting never gained widespread popularity among radiologists; reports often remained freestyle and the information they contained was difficult to mine (3).
With the advent of large language models (LLMs), a class of powerful models came into focus that can process text sequences with high performance and an unprecedented degree of flexibility. Earlier deep learning models, such as bidirectional encoder representations from transformers (or BERT), had already incorporated contextual information and outperformed traditional machine learning models used for NLP, such as support vector machines or recurrent neural networks, but still required domain adaptation before they could be used on specific downstream tasks (4). In contrast, LLMs can be instructed to perform specific tasks without any prior examples or with only a few examples (5).
Publicly available LLMs such as ChatGPT (OpenAI) have revolutionized how we handle text-based data. ChatGPT’s straightforward, easy-to-use interface, combined with its flexibility, allows for novel text analysis applications without the steep learning curve associated with earlier technologies. These factors have spurred numerous early studies assessing the capabilities of LLMs for medical text analysis (6). However, the use of hosted, closed-source LLMs such as ChatGPT also presents several operational challenges. Limitations in version control, lack of transparency in model workings, and concerns about data accessibility and protection—especially with models hosted by third parties—can impede reproducibility, reliability, and traceability in research and clinical settings. These issues necessitate a careful approach when selecting and deploying these models, particularly in sensitive fields such as medical text analysis.
Open-source LLMs offer various opportunities and represent a much-needed alternative. Open-source LLMs can be run locally and do not share any information with third parties about the data, prompts, or queries that the model receives or processes; they are privacy-preserving models (7). Moreover, developers and researchers are empowered to modify, improve, and tailor these models and supporting frameworks to specific tasks as they can access the underlying components of LLMs. In this way, control over the deployment of LLMs for medical purposes can remain in the hands of health care professionals and organizations, ensuring adherence to established guardrails for patient safety and privacy protection.
In this issue of Radiology: Artificial Intelligence, Le Guellec et al (8) investigated the feasibility of automated information extraction from free-text radiology reports. They used the Vicuna 13-B model (version 1.3; LMSYS Org), an open-source chatbot based on Meta’s LLaMA model and fine-tuned on ShareGPT using shared-user conversations (9). Authors interacted with the model through FastChat, an open platform for training, deployment, and evaluation of LLMs, and also developed a Python script to automate the interactions, which they made publicly available (https://github.com/BastienLeGuellec/RadioVicuna).
In the study by Le Guellec et al (8), the performance of Vicuna was tested on 2398 free-text emergency brain MRI reports from their institution, which are not publicly available. The reports were written by 22 different trainees and 21 different board-certified radiologists. The reporting language was French and was not translated into English for this study.
The authors designed four different tasks: (a) to identify the presence or absence of headache as a symptom based on the clinical indication, (b) to label the presence or absence of contrast medium injection based on the protocol, (c) to classify the report as either normal or abnormal based on the conclusion, and (d) to draw a causal inference between the findings and the symptom of headache. In addition, they designed four different prompts in English for each task and used few-shot in-context learning, in which they first provided the model with an increasing number of fake contextual examples until the diagnostic performance was optimized. These contextual examples were selected to reflect both the positive and negative examples as well as the different formulations used by radiologists. They also repeated the same tasks with prompts translated into French and compared the results with those in English.
For each task, Vicuna demonstrated high performance with prompting in English and in French. In task 1, Vicuna correctly identified headache as a symptom in 583 of 595 reports with sensitivity of 98.0% and specificity of 99.3%. In task 2, it accurately identified the presence of contrast medium injection in 514 of 517 reports, achieving sensitivity of 99.4% and specificity of 98.6%. In task 3, Vicuna correctly identified 219 of 227 reports that are classified as abnormal, with sensitivity of 96.0% and specificity of 98.9%. In task 4, it successfully identified a causal relationship between findings and headache as a manifesting symptom in 120 of 136 reports, with sensitivity of 88.2% and specificity of 72.5%. Vicuna’s performance remained consistent when these experiments were repeated with the prompts in French.
Le Guellec et al (8) also reported an error analysis and provided examples along with self-explanations from Vicuna. For example, Vicuna incorrectly identified aneurysms as the cause of headache in four reports and did not identify any of the cytotoxic corpus callosum lesions as the cause of headache in two reports. One of the limitations of this study was that it was a single-center study, and the authors did not test for variability in writing styles and languages across centers. In addition, the available clinical information was limited to the text that could be extracted from the report itself. Third, the reference standard for task 4, namely the causal inference between the findings and headache, was subjective and based on the experience of radiologists. Finally, Vicuna was based on LLaMa 2 and could be outperformed by the models based on the recently released LLaMa 3.
The study by Le Guellec et al (8) is an elegant example of using an LLM as a text processing engine rather than a knowledge database (10). The method they propose explores the capability to quickly analyze large amounts of textual data, which proves extremely beneficial for tasks that would require hours to complete manually. By facilitating the extraction of relevant information without spending valuable time or extensive programming, this approach opens new opportunities for health care professionals and researchers to use radiology reports in innovative ways.
More than a century after the first radiology report was written, we are now making significant progress in developing tools to leverage large collections of radiology reports, to extract relevant information, and to identify potential connections between imaging findings and symptoms. Open-source LLMs currently appear to be the best way to deploy such systems in health care while safeguarding patient privacy and data security, despite all the possible confabulations they may suffer like the rest of the LLM family.
Footnotes
Disclosures of conflicts of interest: T.A.D. Support for attending meetings and/or travel from Cantonal Hospital Baselland and European Society of Medical Imaging Informatics; trainee editorial board member of Radiology: Artificial Intelligence. C.B. Research support from Promedica Foundation, Chur, CH.
References
- 1. Langlotz CP . The Radiology Report: A Guide to Thoughtful Communication for Radiologists and Other Medical Professionals . 1st ed. CreateSpace Independent Publishing Platform; , 2015. . [Google Scholar]
- 2. Kahn CE Jr , Langlotz CP , Burnside ES , et al . Toward best practices in radiology reporting . Radiology 2009. ; 252 ( 3 ): 852 – 856 . [DOI] [PubMed] [Google Scholar]
- 3. Pinto Dos Santos D , Cuocolo R , Huisman M . O structured reporting, where art thou? Eur Radiol 2023. . 10.1007/s00330-023-10465-x. Published online November 27, 2023 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Tejani AS , Ng YS , Xi Y , Fielding JR , Browning TG , Rayan JC . Performance of multiple pretrained BERT models to automate and accelerate data annotation for large datasets . Radiol Artif Intell 2022. ; 4 ( 4 ): e220007 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Akinci D’Antonoli T , Stanzione A , Bluethgen C , et al . Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions . Diagn Interv Radiol 2024. ; 30 ( 2 ): 80 – 90 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Keshavarz P , Bagherieh S , Nabipoorashrafi SA , et al . ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives . Diagn Interv Imaging 2024. . 10.1016/j.diii.2024.04.003. Published online April 27, 2024 . [DOI] [PubMed] [Google Scholar]
- 7. Raeini M . Privacy-preserving large language models (PPLLMs) . SSRN 2023 . https://ssrn.com/abstract=4512071. Posted July 24, 2023. Accessed April 28, 2024 .
- 8. Le Guellec B , Lefèvre A , Geay C , et al . Performance of an open-source large language model in extracting information from free-text radiology reports . Radiol Artif Intell 2024. ; 6 ( 4 ): e230364 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality . The Vicuna Team . https://lmsys.org/blog/2023-03-30-vicuna/. Published March 30, 2023. Accessed April 28, 2024 .
- 10. Truhn D , Reis-Filho JS , Kather JN . Large language models should be used as scientific reasoning engines, not knowledge databases . Nat Med 2023. ; 29 ( 12 ): 2983 – 2984 . [DOI] [PubMed] [Google Scholar]
