Introduction
In the past 2 years, humankind has experienced unparalleled development in artificial intelligence (AI), particularly with the advent of advanced large language models (LLMs). These models have started revolutionizing various fields in medicine, including radiology (1).
However, their rapid evolution has brought both opportunities and challenges. This editorial, a follow-up on the article by Shen et al (2), provides continuing insight into the rapidly evolving field of LLMs, focusing on clinical and academic use cases. We will explore the pearls and pitfalls of LLMs and give a future outlook while touching on clinical decision support (CDS), society guidelines and best practices, accuracy monitoring, academic administrative support, open-source and commercial LLMs, and agentic workflows.
Augmenting Decision Support
CDS can aid clinician ordering appropriateness using best practice guidelines and help limit unnecessary imaging. Since this topic was first reviewed (2), interest in LLM use for augmenting CDS has grown. LLMs have shown moderate accuracy in providing imaging recommendations for breast cancer screening and breast pain evaluations (3). LLM response accuracy improves with additional prompting of relevant guidelines, such as the American College of Radiology Appropriateness Criteria (4,5). LLM use for optimizing CDS has yet to be studied for most imaging modalities, indications, and guidelines.
LLMs have potential to be more user-friendly in prompt engineering and build than conventional CDS systems. For ordering providers, the conversational nature of LLMs aids learnability and guideline adoption. LLMs could provide CDS in locations where protocoling or formal CDS systems are not implemented.
However, LLM use in medical imaging CDS is still in its infancy. Limited studies have investigated the use of electronic health record–based patient information directly by an LLM to guide CDS. Implementation and testing for use of LLMs for imaging CDS at the point-of-care have not been studied or standardized. Bias and hallucinations in outputs remain a large concern. The “black box” nature of LLMs creates gaps in understanding their decision-making processes. Recently, AI systems have been used to describe lacunae in CDS-based alerts (6) and organize CDS user feedback (7), but upkeep of LLM-based imaging CDS systems with new guidelines and shifting knowledge bases is challenging. Additionally, LLM-based CDS cost may be prohibitive at scale.
Overall, LLMs show promise in augmenting CDS at the point of care. Further research is needed to understand appropriateness accuracy for most guidelines and standardize testing and implementation of LLM-based CDS in clinical practice.
Society Guidelines and Best Practices
A few decades ago, the proliferation of health-related information on the internet presented a new challenge for physicians: patients would come to their appointments with questions after using “Dr Google.” In response, many health care organizations and professional societies created their own resources, like RadiologyInfo.org (http://www.radiologyinfo.org), to which physicians could refer patients for accurate information.
Now, the sudden widespread availability of LLMs has created a new version of this challenge. When patients ask health care–related questions of a public LLM, they cannot verify the accuracy of the information returned. Asking LLMs for references is speculative at best, since LLMs are known to fabricate citations (8). Currently, no guidelines or best practices from professional societies exist for using LLMs in health care–related tasks. However, Ayers et al (9) showed the responses from an LLM were rated as being more empathetic and of higher accuracy than those from health care professionals. Tripathi et al (10) presented a set of prompting guidelines for patients to use when asking LLMs health-related questions. However, these have not been endorsed or vetted by organized radiology. Even if such guidelines were issued, there is no guarantee that patients would follow them, nor can health care providers be liable for potential harm from inaccurate LLM responses.
Accuracy Monitoring
Due to the stochastic nature of LLMs, output can fluctuate over time even with the same prompts. Additionally, rapid development and updates can lead to differences in output once these models are clinically implemented. Data drift may also pose an issue, such as when new guidelines introduce terminology not included in the training data. When LLM-based tools are used as summarizing or reporting aids in radiology, consistent and robust performance is essential, akin to detection algorithms. Extra vigilance is necessary, as these models still tend to inflate stereotypes and generate misinformation, potentially disadvantaging certain groups (11). Thus, continuous monitoring will be crucial to safeguard ethical and responsible use. Establishing and standardizing methods to evaluate baseline LLM responses is the first step (12). Measuring performance of LLMs is nontrivial, as conventional n-gram based metrics (comparing n-item sequences with reference texts) like Recall-oriented Understudy for Gisting Evaluation, or ROUGE, and Bilingual Evaluation Understudy, or BLEU, fail to capture semantic meaning effectively (13). Domain-agnostic and health care–specific metrics have been proposed, especially for medical chatbots (13). Some radiology-specific metrics, such as MRScore (14) and the RadCliQ score (15), have also been proposed. Other groups have proposed metrics evaluating factuality, crucial in combating hallucinations, such as Search-Augmented Factuality Evaluator (16) and FActScore (17). While no consensus exists yet on which metric to use, a set of metrics will be needed to estimate full performance.
A possible solution is periodic or continuous automatic monitoring using these metrics, establishing safety thresholds that trigger human attention or temporary system deactivation. Another solution is integrated user feedback, such as is common in LLM-based chat services like OpenAI’s ChatGPT. This human feedback loop can facilitate model refinement and adjustment, but it may suffer from issues like automation bias and an initial time investment. Ultimately, combining quantitative metrics with qualitative user feedback may result in a robust monitoring system.
Academic Administrative Support
While the creative aspect of academic work remains with humans, LLM-based tools have emerged to enhance efficiency, perform repetitive tasks, and assist with coding and writing. Notably, even GPT-4o tends to generate lengthy, generic, and at times blatantly incorrect texts that may be recognized by some as being artificial. When used appropriately, providing the model with proper context and manual refinement, LLMs can be highly valuable tools. Proficiency in prompt engineering will become a key skill, like information retrieval and memorization for previous generations. Opponents argue that overreliance on LLMs might lead to reduced critical thinking and analytical skills.
Recently, editors and grant boards have noted a surge in submissions, prompting many journals to establish guidelines on LLM usage. Unfortunately, misuse has been noted, such as anatomically incorrect fabricated figures in peer-reviewed articles (18). Another significant concern is the confidentiality breach associated with uploading sensitive information, such as manuscripts under review, to text-retaining third-party services like ChatGPT. For reviewers and editors, LLMs can identify and summarize relevant literature and aid with the linguistics. Authors and reviewers should clearly disclose how they used LLMs (19). However, reviewers must be cautious of fictitious references and always manually verify and adjust the output text.
Open-Source Versus Commercial LLMs
Since the initial release of ChatGPT in November 2022, other LLMs have been trying to catch up with OpenAI’s commercial LLM product, including other commercial LLMs (eg, Gemini from Google, Claude from Anthropic) as well as “open-source” LLMs, including those from Meta (Llama, Lllama2, Llama3) and Mistral AI (Mistral, Mixtral).
The popular open-source LLMs have model plus weights available (although no training data or pretraining code) to be used locally or modified using fine-tuning techniques. These open-source LLMs have generally lagged in performance when released compared with the latest commercial counterparts, although modified versions (ie, with fine-tuning and/or retrieval-augmented generation) can surpass commercial LLMs for specific use cases.
The benefits of commercial LLMs and their application programming interfaces, or APIs, allow for relatively easy and inexpensive developments of apps for companies and researchers. However, these benefits start to diminish with increased use and costs because relatively frequent commercial model updates and deprecation can cause applications to no longer function and also pose challenges to reproducible research. The potential dominance of commercial LLMs could make radiology applications and the entire health care industry reliant on a few giant tech companies, which have historically been unpredictable in their support for health care ventures.
Open-source models offer additional benefits, including local deployment, even in fully offline environments for sensitive clinical information technology infrastructures worldwide, which often have policies prohibiting the transmission of personal health information outside hospital firewalls. Running local LLMs also allows for more customization that could ultimately provide additional benefits to patient care.
The potential risk of open-source LLMs lagging behind commercial LLMs in performance cannot be overstated for clinical applications, where both doctors and patients seek the latest technology. In some cases, doctors cannot use the latest commercial LLMs due to hospital policies prohibiting cloud-based applications, while patients can access state-of-the-art commercial LLMs. This discrepancy may lead to scenarios where doctors are less effective, potentially causing friction in doctor-patient relationships, especially as these LLMs become more proficient in medical contexts. Therefore, additional research and development by the radiology and health care community to further improve open-source health care–specific LLMs is urgently needed to avoid these scenarios.
Agentic Workflows
Agentic workflows are systems where the model actively performs tasks and makes decisions autonomously, similar to how a human agent would work, rather than just providing passive responses based on user prompts. They represent a shift in how we interact with and enhance the capabilities of LLMs. Andrew Ng proposed categorizing agents into four types. Each type defines a specific interaction pattern: reflective agents assess and improve their own outputs; tool-using agents use external tools to complete tasks; planning agents strategize the steps required to achieve a goal; and collaborative agents work together to solve problems, often involving multiple agents or LLMs working in tandem (20).
Recent evidence suggests that agentic workflows enhance the performance of LLMs beyond traditional prompt engineering. For example, when LLMs like GPT-3.5 were involved in agentic workflows, their performance in coding tasks improved, even surpassing newer models like GPT-4 in certain aspects. This improvement is largely attributed to the iterative and reflective capabilities of agentic workflows, allowing models to refine their outputs progressively rather than in a single static response (20).
However, the complexity of designing and managing multiple interacting agents can be a significant challenge, especially when scaling up. The iterative nature of these workflows often requires more computational resources and time, as multiple interactions or iterations are necessary to refine outputs satisfactorily. Furthermore, there is the risk of inefficiency or redundancy when multiple agents are not perfectly coordinated.
In radiology, agentic workflows could substantially enhance the application of LLMs. For instance, an LLM could initially generate a report based on imaging data, then reflect on its own analysis or consult other specialized LLMs to refine its findings. It could potentially improve the detection of errors in reports beyond the performance reported with error correction with single prompts (21).
The Future of Radiology
Nearly a decade ago, computer scientist Geoffrey Hinton predicted the demise of radiologists due to AI within 5 years, stating, “If you work as a radiologist, you are like Wile E. Coyote in the cartoon. You are already over the edge of the cliff, but you have not looked down yet” (22). Despite this prediction, the demand for radiologists has never been higher, leading to workforce shortages in both academic and nonacademic settings (23). This shortage has resulted in increased radiologist burnout, affecting up to 88% of radiologists globally (24).
LLMs, due to their widespread availability, have the potential to catalyze the AI transformation of radiology, although in a more positive way than predicted by Hinton (25). LLMs have already begun to allow radiologists to spend less time on repetitive tasks and focus more on cognitive aspects and direct patient care. LLMs will synthesize the vast amount of information in medical records, improving diagnostic capabilities and expanding into predictive medicine. They will also impact the content of reports, making them more accessible through translation to both lay language and patients’ preferred languages. As leaders in bioinformatics and medical data science, radiologists are well-positioned to realize the potential of this disruptive technology.
Still, a sea change such as this will necessitate updates in training and continuing education for the radiology workforce. Increased AI knowledge has been shown to correlate with adoption and inversely correlate with fear (26). Organized radiology has begun adapting to the need for this type of training.
If radiology is indeed heading off a cliff, LLMs should be embraced as a parachute to safely reach the next phase of health care. As radiology and radiologists continue to harness the potential of LLMs, the value provided to patients and the demand for services will only increase.
Footnotes
Disclosures of conflicts of interest: M.H. Consulting fees from Capvision; speakers’ honoraria to institution from Canon; support for attending meetings from the European Society of Medical Imaging Informatics (EuSoMII) and European Society of Radiology (ESR); stakeholder board member for AI-POD, EuSoMII board member, ESR eHealth & Informatics Subcommittee member, ECR Imaging Informatics/Artificial Intelligence and Machine Learning Chairperson (2025), committee member with FMS (Dutch), and Radiology: Artificial Intelligence associate editor and trainee editorial board advisory panel member. F.K. Consulting fees from Bunkerhill Health, GE HealthCare, and MD.ai; speaker fees from Sharing Progress In Cancer Care; early career consultant to the Editor of Radiology, associate editor of Radiology: Artificial Intelligence, vice-chair of the Society for Imaging Informatics in Medicine (SIIM) Machine Learning Committee, member of the RSNA AI Committee, and member of the RSNA Radiology Informatics Council. T.S.C. Grants to institution from the National Institutes of Health, Independence Blue Cross, American College of Radiology, and the RSNA; consulting fees from Quattro; honoraria from Icahn School of Medicine, Massachusetts General Hospital, ISMIE, and the British Journal of Radiology; reimbursement for travel from SIIM and PARAD; board member for SIIM, PARAD, Association of Academic Radiology, and PRS (2023). K.D.H. Executive committee member of the New York State Radiological Society and board member of the New York State Radiological Educational Foundation. J.E. No relevant relationships. G.S. Member of the SIIM Board of Directors. L.M. Grants from Siemens, Gordon and Betty Moore Foundation, Mary Kay Foundation, and Google; consulting fees from Lunit, iCAD, and Guerbet; payment for lectures from iCAD and Guerbet; board member for the International Society for Magnetic Resonance in Medicine and Society of Breast Imaging; stock or stock options in Lunit; editor of Radiology.
References
- 1. Van Veen D , Van Uden C , Blankemeier L , et al . Adapted large language models can outperform medical experts in clinical text summarization . Nat Med 2024. ; 30 ( 4 ): 1134 – 1142 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Shen Y , Heacock L , Elias J , et al . ChatGPT and other large language models are double-edged swords . Radiology 2023. ; 307 ( 2 ): e230163 . [DOI] [PubMed] [Google Scholar]
- 3. Rao A , Kim J , Kamineni M , Pang M , Lie W , Succi MD . Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making . medRxiv 2023.02.02.23285399 [preprint] 2023.02.02.23285399. Posted February 7, 2023. Accessed July 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Nguyen D , Swanson D , Newbury A , Kim YH . Evaluation of ChatGPT and Google Bard using prompt engineering in cancer screening algorithms . Acad Radiol 2024. ; 31 ( 5 ): 1799 – 1804 . [DOI] [PubMed] [Google Scholar]
- 5. Scheschenja M , Bastian MB , Wessendorf J , et al . ChatGPT: evaluating answers on contrast media related questions and finetuning by providing the model with the ESUR guideline on contrast agents . Curr Probl Diagn Radiol 2024. ; 53 ( 4 ): 488 – 493 . [DOI] [PubMed] [Google Scholar]
- 6. Liu S , McCoy AB , Peterson JF , et al . Leveraging explainable artificial intelligence to optimize clinical decision support . J Am Med Inform Assoc 2024. ; 31 ( 4 ): 968 – 974 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Liu S , McCoy AB , Wright AP , et al . Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support . J Am Med Inform Assoc 2024. ; 31 ( 6 ): 1388 – 1396 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Gravel J , D’Amours-Gravel M , Osmanlliu E . Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions . Mayo Clin Proc Digit Health 2023. ; 1 ( 3 ): 226 – 234 . [Google Scholar]
- 9. Ayers JW , Poliak A , Dredze M , et al . Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum . JAMA Intern Med 2023. ; 183 ( 6 ): 589 – 596 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Tripathi S , Sukumaran R , Dheer S , Cook T . Promptwise: prompt engineering paradigm for enhanced patient-large language model interactions towards medical education . SSRN J 2024. . [Google Scholar]
- 11. Zack T , Lehman E , Suzgun M , et al . Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study . Lancet Digit Health 2024. ; 6 ( 1 ): e12 – e22 . [Published correction appears in Lancet Digit Health 2024;6(7):e445.] [DOI] [PubMed] [Google Scholar]
- 12. Park YJ , Pillai A , Deng J , et al . Assessing the research landscape and clinical utility of large language models: a scoping review . BMC Med Inform Decis Mak 2024. ; 24 ( 1 ): 72 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Abbasian M , Khatibi E , Azimi I , et al . Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI . NPJ Digit Med 2024. ; 7 ( 1 ): 82 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Liu Y , Wang Z , Li Y , et al . MRScore: Evaluating Radiology Report Generation with LLM-based Reward System . arXiv 2404.17778 [preprint] https://arxiv.org/abs/2404.17778. Posted April 27, 2024. Accessed July 22, 2024 .
- 15. Yu F , Endo M , Krishnan R , et al . Evaluating progress in automatic chest x-ray radiology report generation . Patterns (N Y) 2023. ; 4 ( 9 ): 100802 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wei J , Yang C , Song X , et al . Long-form factuality in large language models . arXiv 2403.18802 [preprint] https://arxiv.org/abs/2403.18802. Posted March 27, 2024. Updated April 3, 2024. Accessed July 2024.
- 17. Min S , Krishna K , Lyu X , et al . FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation . arXiv 2305.14251 [preprint] https://arxiv.org/abs/2305.14251. Posted May 23, 2023. Updated October 11, 2023. Accessed July 2024.
- 18. Wu Y , Pang S , Guo J , Yang J , Ou R . Assessment of the efficacy of alkaline water in conjunction with conventional medication for the treatment of chronic gouty arthritis: a randomized controlled study . Medicine (Baltimore) 2024. ; 103 ( 14 ): e37589 . [Published retraction appears in Medicine (Baltimore) 2024;103(28):e38913.] [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 19. Moy L . Guidelines for use of large language models by authors, reviewers, and editors: considerations for imaging journals . Radiology 2023. ; 309 ( 1 ): e239024 . [DOI] [PubMed] [Google Scholar]
- 20. Ng A . Four AI Agent Strategies That Improve GPT-4 and GPT-3.5 Performance . https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/?ref=dl-staging-website.ghost.io. Accessed July 24, 2024 .
- 21. Gertz RJ , Dratsch T , Bunck AC , et al . Potential of GPT-4 for detecting errors in radiology reports: implications for reporting accuracy . Radiology 2024. ; 311 ( 1 ): e232714 . [DOI] [PubMed] [Google Scholar]
- 22. Mukherjee , S. A.I. versus M.D.: what happens when diagnosis is automated? https://www.newyorker.com/magazine/2017/04/03/ai-versus-md. Published March 27, 2017. Accessed July 2014.
- 23. Rawson JV , Smetherman D , Rubin E . Short-term strategies for augmenting the national radiologist workforce . AJR Am J Roentgenol 2024. ; 222 ( 6 ): e2430920 . [DOI] [PubMed] [Google Scholar]
- 24. Fawzy NA , Tahir MJ , Saeed A , et al . Incidence and factors associated with burnout in radiologists: a systematic review . Eur J Radiol Open 2023. ; 11 : 100530 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Bhayana R . Chatbots and large language models in radiology: a practical primer for clinical and research applications . Radiology 2024. ; 310 ( 1 ): e232756 . [DOI] [PubMed] [Google Scholar]
- 26. Huisman M , Ranschaert E , Parker W , et al . An international survey on AI in radiology in 1,041 radiologists and radiology residents part 1: fear of replacement, knowledge, and attitude . Eur Radiol 2021. ; 31 ( 9 ): 7058 – 7066 . [DOI] [PMC free article] [PubMed] [Google Scholar]
