See also the article by Bhayana et al in this issue.

Dr Reza Forghani is a neuro- and head and neck radiologist and artificial intelligence (AI) expert. He is professor of radiology and AI and vice chair of AI at the Department of Radiology, University of Florida College of Medicine, and founder and director of the Radiomics & Augmented Intelligence Laboratory (RAIL). His area of scientific interest is translational multimodal AI for clinically impactful AI development and translation into clinical practice.
We have witnessed tremendous advances in artificial intelligence (AI) applications over the past decade, with potential for transformation of the health care of the future. Although some of the potential has been exaggerated and many claims made prematurely, there is no question that the enormous leap forward in technology seen with the release of the first large language model (LLM), ChatGPT by OpenAI, far exceeded expectations. For many, the performance of LLMs and other emerging foundation models would have not been considered possible just a few years earlier. LLMs are a type of foundation model trained on vast amounts of data that can understand and generate human language (1). They can be fine-tuned for various specific applications involving language. Foundation models are a broader category of models that include LLMs and other models trained on diverse data types—potentially text, image, waveforms, audio, and so on—and can serve as the basis for various and complex downstream tasks. This includes multimodal models that can understand and combine different types of data, such as images and text (1).
The study by Bhayana et al (2) in this issue of Radiology evaluates the performance of an optimized form of LLM based on retrieval-augmented generation (RAG) on the task of answering 150 radiology board–style multiple-choice questions. The approach used, Perplexity (Perplexity AI) with GPT-4 Turbo, was compared with performance of ChatGPT-4 with GPT-4 Turbo. The results were also compared with those reported for the same question set using ChatGPT based on GPT-3.5 and GPT-4 in earlier investigations (3,4). Previously, it had been shown that GPT-4 answered 81% of questions correctly, exceeding the passing threshold of 70% and outperforming GPT-3.5, which answered 69% of questions correctly (4). In the study by Bhayana et al (2), Perplexity with GPT-4 Turbo substantially outperformed ChatGPT-4 with GPT-4 Turbo, answering 90% of questions correctly. The latter answered 79% of questions correctly, without notable improvement compared with that previously reported for ChatGPT-4. This study highlights essential attributes as well as challenges of this transformative technology that has implications far beyond the specific use case evaluated and is essential for radiology practitioners from different backgrounds and practice settings to be aware of.
First, the study once again highlights the potential of this technology, achieving a 90% score on a radiology board–style multiple-choice examination, an impressive performance and demonstration of machine intelligence. The potential applications of LLMs in health care and radiology are multifold and include automated report generation (or next-generation reporting), database queries and summarization, education, workflow automation, and patient communication, to name some common applications. The applications expand even further with emerging multimodal models that can analyze and/or generate different data combinations, such as image and text, discussed later. However, while very powerful, this technology has limitations that users must be aware of to ensure proper use and avoid misleading or outright wrong information.
One important pitfall of LLMs is the potential for hallucinations. LLM “hallucination” refers to the phenomenon where the model generates information or responses that are not based on real-world data or facts. These responses can sound plausible and convincing but are incorrect or entirely fabricated. This occurs because standard LLMs predict the next word in a sequence based on patterns learned during training, without using an established ground truth to verify the factual accuracy of the content generated. This has substantial implications and, unless effectively addressed, represents a major barrier to the safe and effective use of LLMs in radiology and health care.
One approach used for improving the fidelity and reliability of LLMs is RAG (5). RAG is a technique that improves LLM reliability by combining language models with real-time information retrieval. Instead of solely relying on pretrained knowledge, RAG allows the model to retrieve relevant data from external authoritative sources during the generation process. This optimizes the accuracy of the response based on up-to-date information and reduces the likelihood of hallucinations. The article by Bhayana et al (2) shows a nice example of substantial improvement in performance of an LLM (ChatGPT) by incorporating RAG. More broadly, the article highlights both a major challenge for the use of LLMs, namely hallucinations, and an example of approaches for optimizing LLM performance. In any field, but particularly in health care and radiology, having the appropriate guardrails to ensure that this powerful technology provides accurate information is essential for the technology to achieve its full potential. Whether the reader’s interest is an educational application or the myriad of other potential radiology applications of LLMs, the points highlighted are essential knowledge for both those developing LLMs and those using them.
These are still relatively early days in the development and use of foundation models such as LLMs. Rapid development continues in this area, and the potential is vast. One of the most exciting developments is foundation models that are trained using various data types. In medicine, such multimodal AI models have the potential to analyze and combine different types of medical data, including medical images and text such as that in a patient’s electronic health record (1). These provide exciting and unparalleled opportunities for health care application development, including image-to-text conversion and broader applications. Multimodal models have the potential to eventually revolutionize medicine by enabling tasks that provide a comprehensive understanding of the patient by using the full spectrum of text, laboratory results, medical imaging, and other diagnostic information available in a patient’s electronic health record (1). These could potentially also provide horizontal platforms facilitating and accelerating medical device development in an unprecedented manner.
An example of early work in multimodal foundation models using text and medical images is image retrieval (6). Using multimodal embeddings, two-dimensional sections, and three-dimensional volumes, one recent study explored search and retrieval of diagnostically similar images (6). These early studies are examples building on and expanding the robust capabilities of LLMs into multimodal models with broad potential applications in diagnostic clinical decision support, medical education, and other medical tasks. Imagine a future where a radiologist encounters a challenging case and with a click of a button can retrieve a similar image of a rare disease, assisting the expert in making the challenging diagnosis. Further expanding on these capabilities, imagine encountering a brain or liver with multiple metastases and having the machine count and provide measurements of each or the largest lesion. These are not just aspirational statements; this is already becoming possible from a purely technological standpoint. However, although beyond the scope of this editorial, for these technologies to be effectively implemented in the clinical practice of the future, we will likely need to move away from legacy information technology approaches and address other infrastructure and regulatory barriers for AI implementation (7,8).
In conclusion, LLMs and emerging multimodal foundation models are likely to have a transformative impact on innovation and practice in radiology. Whether in an academic setting or private practice, radiologists will be well served to be familiar with the strengths and limitations of these models so they can leverage them to enhance their clinical practice and patient care in a safe and effective manner.
Footnotes
Disclosures of conflicts of interest: R.F. Institutional research grants from Nuance Communications–Microsoft, Canon Medical Systems, GE HealthCare, National Science Foundation, and National Institutes of Health (STTR grant subaward); book royalties from Thieme Medical Publishers; lecture payment and/or travel support from Canon Medical Systems; payment for radiologist expert testimony from McCarthy Tétrault LLP (Toronto, Ontario); patents planned, issued, or pending for a method and system for performing medical treatment outcome assessment or medical condition diagnostic (US11257210B2); advisory board for Neuropacs.
References
- 1. Rajpurkar P , Lungren MP . The Current and Future State of AI Interpretation of Medical Images . N Engl J Med 2023. ; 388 ( 21 ): 1981 – 1990 . [DOI] [PubMed] [Google Scholar]
- 2. Bhayana R , Fawzi A , Deng Y , Bleakney RR , Krishna S . Retrieval-Augmented Generation for Large Language Models in Radiology: Another leap forward in board examination performance . Radiology 2024. ; 313 ( 1 ): e241489 . [DOI] [PubMed] [Google Scholar]
- 3. Bhayana R , Krishna S , Bleakney RR . Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations . Radiology 2023. ; 307 ( 5 ): e230582 . [DOI] [PubMed] [Google Scholar]
- 4. Bhayana R , Bleakney RR , Krishna S . GPT-4 in Radiology: Improvements in Advanced Reasoning . Radiology 2023. ; 307 ( 5 ): e230987 . [DOI] [PubMed] [Google Scholar]
- 5. Bhayana R . Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications . Radiology 2024. ; 310 ( 1 ): e232756 . [DOI] [PubMed] [Google Scholar]
- 6. Abacha AB , Santamaria-Pang A , Lee HH , et al . 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology . arXiv 2311.13752 [preprint] https://arxiv.org/abs/2311.13752. Posted November 23, 2023. Accessed August 20, 2024 .
- 7. Bryan RN , Forghani R . Artificial Intelligence in Radiology: Not If, But How and When . Radiology 2024. ; 311 ( 3 ): e241222 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Forghani R . Artificial Intelligence Transformation of Healthcare AI of the Future: Revisiting Approach to Healthcare IT and Use of Third-party Platforms . American Hospital & Healthcare Management . https://issuu.com/verticaltalk/docs/americanhhm-issue-04. Published 2024. Accessed August 19, 2024 .
