Abstract
Large language models (LLMs) are increasingly prevalent in medical research; however, fundamental limitations in their architecture create inherent reliability challenges, particularly in specialized medical contexts. These limitations stem from autoregressive prediction mechanisms and computational constraints related to undecidability, hindering perfect accuracy. Current mitigation strategies include advanced prompting techniques such as Chain-of-Thought reasoning and Retrieval-Augmented Generation (RAG) frameworks, although these approaches are insufficient to eliminate the core reliability issues. Meta-analyses of human-artificial intelligence collaboration experiments revealed that, although LLMs can augment individual human capabilities, they are most effective in specific contexts allowing human verification. Successful integration of LLMs in medical research requires careful tool selection aligned with task requirements and appropriate verification mechanisms. Evolution of the field indicates a balanced approach combining technological innovation with established expertise, emphasizing human oversight particularly in complex biological systems. This review highlights the importance of understanding the technical limitations of LLMs while maximizing their potential through thoughtful application and rigorous verification processes, ensuring high standards of scientific integrity in medical research.
Keywords: Medical writing, Scholarly communication, Reliability, Natural language processing, Artificial intelligence
GRAPHICAL ABSTRACT
Highlights
· Large language models (LLMs) inherently generate inaccuracies due to autoregressive architecture and computational limits, posing challenges in medical research.
· Advanced prompting and Retrieval-Augmented Generation (RAG) can partially mitigate hallucinations but require human oversight.
· Effective LLM use in medical research demands task-specific tool selection paired with rigorous human validation for complex tasks.
Introduction
Large language models (LLMs) are increasingly being adopted in academic writing, potentially changing the patterns of scholarly communication. In a recent report, at least 10% of PubMed abstracts published during the first half of 2024 showed signs of LLM usage when analyzing vocabulary patterns [1]. Some examples include "meticulously delving" and "intricate interplay," which are highly flowery words that were not historically popular in scientific abstracts. This high prevalence raises important questions regarding reliability and appropriate use of LLMs in medical research contexts.
The response of the medical community to LLMs is cautious. Although clinicians generally accept LLMs for supportive research roles such as data processing and content drafting, they express significant reservations regarding their use in high-stakes tasks such as original manuscript writing [2]. These concerns are not unfounded, and some studies have identified limitations in the ability of LLMs to handle specialized medical content. For example, a comprehensive evaluation in ophthalmology research revealed that LLMs struggle with accurate reference generation and cannot reliably handle specific medical information, such as failing to distinguish between oral versus intravenous corticosteroid doses in optic neuritis treatment [3].
This review addresses the gap between LLM adoption and reliability concerns based on three key objectives: analyzing technical limitations of LLMs in medical research contexts, evaluating emerging reliability enhancement strategies, and providing evidence-based recommendations for LLM integration in research workflows. The results can provide medical researchers with practical knowledge for leveraging LLMs while maintaining scientific integrity.
Technical understanding of LLM limitations
The reliability limitations of LLMs are rooted in their architectural design. These models generate text through autoregressive prediction, where each token is iteratively predicted and appended based on preceding tokens, creating an inherent constraint on their ability to maintain long-range coherence and accuracy. This sequential generation process is particularly challenging in specialized domains where high-quality training data may be scarce, leading to decreased prediction accuracy in technical fields [4].
Recent theoretical work in computational logic indicates that "hallucinations" or generation of false or unsupported information is an intrinsic characteristic of LLM architecture rather than a temporary limitation. Banerjee et al. [5] demonstrated that fundamental computational constraints, particularly those associated with undecidability as established in concepts like Gödel's First Incompleteness Theorem, hindering the guarantee of perfect accuracy in LLM outputs. These structural hallucinations manifest in four distinct forms: factual incorrectness, misinterpretation, partial incorrectness, and fabrications, each carrying a non-zero probability of occurrence regardless of architectural sophistication or training data quality.
In addition to these technical limitations, the fundamental nature of LLM-generated inaccuracies can be understood through a philosophical concept. Although these models excel at producing linguistically coherent responses, they fundamentally lack mechanisms for truth verification. Hicks et al. [6] characterizes this behavior as "bullshit" in the philosophical sense defined by Harry Frankfurt, distinguishing it from mere inaccuracies or lies. This characterization emphasizes that LLMs are not attempting to represent or misrepresent truth but are instead inherently indifferent to it, producing what the authors term "soft bullshit," which is content generated without any intentional stance toward truth or deception. Furthermore, when these pattern-matching systems are presented to users as if they are reliable sources of factual information, they may inadvertently produce "hard bullshit," actively misleading users regarding their capability for truth-telling.
Current approaches to enhancing reliability
Various strategies have been proposed to mitigate the reliability issues of LLMs. One key approach involves advanced prompting techniques that guide the model to produce more accurate and reliable outputs [7]. For example, prompting the model to articulate the reasoning steps through techniques like Chain-of-Thought can reduce hallucinations by encouraging sequential processing of information. Generating multiple responses and selecting the most consistent one, known as Self-Consistency, can also improve reliability. Breaking down complex questions into simpler subquestions (Decomposition) allows the model to focus on each component individually, reducing the chance of a logic leap or fabricated information. Adding verification steps throughout the response generation process (Chain-of-Verification) forces the model to assess its assertions against internal information, increasing the reliability of its outputs (Table 1). For details and examples of these prompting techniques, readers are encouraged to refer to the comprehensive survey of Schulhoff et al. [7], which provides extensive documentation of effective prompting strategies applicable to various contexts.
Table 1.
Prompting techniques to enhance large language model reliability
Technique | Description |
---|---|
Chain-of-thought | Prompts the model to reason step-by-step, processing information sequentially. |
Self-consistency | Generates multiple responses and selects the most consistent one. |
Decomposition | Breaks down complex questions into simpler subquestions, allowing focused analysis of each component. |
Chain-of-verification | I ncorporates verification at each step of reasoning, ensuring each step aligns with known information. |
Another approach is Retrieval-Augmented Generation (RAG), which combines a language model with an information retrieval component [8]. This allows the model to pull relevant information from trusted sources or databases, grounding its responses in factual data (Fig. 1). However, even RAG systems are not immune to errors. For instance, a Google artificial intelligence (AI) search feature powered by RAG erroneously suggested using "nontoxic glue" to prevent pizza toppings from falling off, highlighting the persistent challenge of hallucinations even in retrieval-augmented systems [9].
Fig. 1.
Comparison of standard and Retrieval-Augmented Generation (RAG) frameworks in large language models (LLMs). (A) In standard LLM generation, the model generates an answer from the user query and system instruction alone. (B) In RAG, a retriever supplements the prompt with relevant information from a database, aiming to ground responses in factual data and reduce hallucinations.
Optimizing LLM use in medical research
To effectively integrate LLMs into medical research, a nuanced understanding of task-specific performance is essential. A meta-analysis of 370 human-AI collaboration experiments revealed a complex interaction pattern in which human-AI combinations generally augmented individual human capabilities but often failed to achieve synergistic benefits over AI systems alone, particularly in decision-making tasks [10]. However, the analysis identified specific contexts where human-AI collaboration showed superior performance, especially in content creation and jobs that humans can perform adequately without the assistance of AI. These findings suggest that medical researchers should strategically employ LLMs in roles that are well-suited to human verification and validation, such as literature summary, hypothesis generation, and data analysis, but maintain careful consideration of task allocation and oversight mechanisms.
The practical implementation of these principles can be observed in several key research applications. In literature synthesis, systems like PaperQA2 have demonstrated impressive capabilities through an "agentic" system equipped with sophisticated ranking and contextual summarization tools, sometimes exceeding PhD-level expertise in specific tasks [11]. These results show the potential of well-designed augmentation systems while emphasizing the continued importance of human expertise because human feedback helps refine benchmarks and guide the development of the system to better handle nuanced and complex scientific questions. Similarly, Unlu et al. [12] demonstrated that non-experts using a RAG-enabled LLM system achieved performance comparable to expert screeners in clinical trial screening tasks. However, the authors emphasized critical considerations, including the need for human expert verification and monitoring of potential biases, reinforcing the importance of human oversight.
The selection of appropriate tools and implementation strategies is crucial for successful LLM integration. Although general-purpose LLMs such as ChatGPT or Claude are useful for many low-stake tasks, they may produce high rates of errors in rigorous research applications. For research-related information searches, specialized tools with robust search and verification capabilities are essential [13]. The choice of tools should align with task requirements and incorporate appropriate verification mechanisms, particularly for tasks involving medical knowledge where accuracy is paramount. This careful tool selection process ensures that LLMs enhance rather than compromise research quality.
As the field continues to evolve, emerging technologies indicate potential future directions for LLM integration in medical research. Lu et al. [14] demonstrated an 'AI Scientist' framework capable of autonomously generating research ideas, conducting in silico experiments, and writing papers in machine learning domains. However, the success of this system in controlled machine learning environments highlights an important contrast with medical research, where human oversight remains crucial due to ethical implications and the complexity of biological systems. This reinforces the importance of viewing LLMs as augmentation tools rather than replacements for human expertise in medical contexts.
Conclusions
This review highlights the importance of understanding the reliability limitations of LLMs in medical research. Understanding the technical backgrounds of the inherent tendency of LLMs to generate inaccurate information is essential.
The successful integration of LLMs into research may require a balanced approach that combines technological innovation with established expertise. According to Kellogg et al. [15], it may be a misconception to assume that issues related to AI utilization will resolve over time, or that they can be fully addressed by junior professionals who are more technologically adept. Experienced senior staff excel in assessing risks, foreseeing potential disruptions, and evaluating how new technologies can align with or challenge organizational goals. Collaborative efforts that balance both the adaptability of newer researchers and the experience of established experts are necessary to effectively address the challenges associated with LLM reliability.
The findings of this review suggest that optimal LLM integration in medical research depends on matching tool capabilities with task requirements: utilizing general-purpose LLMs for preliminary drafting and data organization, employing specialized retrieval-augmented systems for literature synthesis and hypothesis generation, and maintaining direct human oversight for complex biological interpretations and high-stakes decisions. This task-specific approach combined with systematic verification offers a practical framework for leveraging LLM capabilities while preserving research quality.
Footnotes
Conflicts of interest
No potential conflict of interest relevant to this article was reported.
Funding
This research was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT; Grant No. 2018R1A5A2021242).
Author contribution
This article is a single-author article.
References
- 1.Kobak D, González-Márquez R, Horvát EÁ, Lause J. Delving into ChatGPT usage in academic writing through excess vocabulary. doi: 10.48550/arXiv.2406.07016. arXiv:2406.07016 [Preprint]. 2016 [cited 2024 Nov 8]. Available from: [DOI] [Google Scholar]
- 2.Spotnitz M, Idnay B, Gordon ER, Shyu R, Zhang G, Liu C, et al. A survey of clinicians' views of the utility of large language models. Appl Clin Inform. 2024;15:306–12. doi: 10.1055/a-2281-7092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hua HU, Kaakour AH, Rachitskaya A, Srivastava S, Sharma S, Mammo DA. Evaluation and comparison of ophthalmic scientific abstracts and references by current artificial intelligence chatbots. JAMA Ophthalmol. 2023;141:819–24. doi: 10.1001/jamaophthalmol.2023.3119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Lin CC, Jaech A, Li X, Gormley MR, Eisner J. Limitations of autoregressive models and their alternatives. In: Toutanova K, Rumshisky A, Zettlemoyer L, Hakkani-Tur D, Beltagy I, Bethard S, et al. editors. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021:5147-73. [Google Scholar]
- 5.Banerjee S, Agarwal A, Singla S. LLMs will always hallucinate, and we need to live with this. doi: 10.48550/arXiv.2409.05746. arXiv:2409.05746 [Preprint]. 2024 [cited 2024 Sep 12]. Available from: [DOI] [Google Scholar]
- 6.Hicks MT, Humphries J, Slater J. ChatGPT is bullshit. Ethics Inf Technol. 2024;26:38. [Google Scholar]
- 7.Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The prompt report: a systematic survey of prompting techniques. doi: 10.48550/arXiv.2406.06608. arXiv:2406.06608 [Preprint]. 2024 [cited 2024 Sep 3]. Available from: [DOI] [Google Scholar]
- 8.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. doi: 10.48550/arXiv.2005.11401. arXiv:2005.11401 [Preprint]. 2021 [cited 2024 Nov 7]. Available from: [DOI] [Google Scholar]
- 9. Google AI search tells users to glue pizza and eat rocks [Internet]. BBC News; [cited 2024 Nov 7]. Available from: https://www.bbc.com/news/articles/cd11gzejgz4o.
- 10.Vaccaro M, Almaatouq A, Malone T. When combinations of humans and AI are useful: a systematic review and meta-analysis. Nat Hum Behav. 2024;8:2293–303. doi: 10.1038/s41562-024-02024-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Skarlinski MD, Cox S, Laurent JM, Laurent JM, Braza JD, Hinks M, et al. Language agents achieve superhuman synthesis of scientific knowledge. doi: 10.48550/arXiv.2409.13740. arXiv:2409.13740 [Preprint]. 2024 [cited 2024 Sep 12]. Available from: [DOI] [Google Scholar]
- 12.Unlu O, Shin J, Mailly CJ, Oates MF, Tucci MR, Varugheese M, et al. Retrieval-augmented generation–enabled GPT-4 for clinical trial screening. NEJM AI. 2024;1:AIoa2400181. [Google Scholar]
- 13.Ahn S. The transformative impact of large language models on medical writing and publishing: current applications, challenges and future directions. Korean J Physiol Pharmacol. 2024;28:393–401. doi: 10.4196/kjpp.2024.28.5.393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lu C, Lu C, Lange RT, Foerster J, Clune J, Ha D. The AI scientist: towards fully automated open-ended scientific discovery. doi: 10.48550/arXiv.2408.06292. arXiv:2408.06292 [Preprint]. 2024 [cited 2024 Aug 13]. Available from: [DOI] [Google Scholar]
- 15. Kellogg K, Lifshitz-Assaf H, Randazzo S, Mollick ER, Dell'Acqua F, McFowland E III, et al. Don’t expect juniors to teach senior professionals to use generative AI: emerging technology risks and novice AI Risk mitigation tactics. 2024. Harvard Business School Technology & Operations Mgt. Unit Working Paper 24-074, The Wharton School Research Paper, Available at SSRN: https://ssrn.com/abstract=4857373 or http://dx.doi.org/10.2139/ssrn.4857373.