Skip to main content
Digital Health logoLink to Digital Health
. 2025 Apr 21;11:20552076251337177. doi: 10.1177/20552076251337177

Enhancing medical AI with retrieval-augmented generation: A mini narrative review

Omid Kohandel Gargari 1, Gholamreza Habibi 1,
PMCID: PMC12059965  PMID: 40343063

Abstract

Retrieval-augmented generation (RAG) is a powerful technique in artificial intelligence (AI) and machine learning that enhances the capabilities of large language models (LLMs) by integrating external data sources, allowing for more accurate, contextually relevant responses. In medical applications, RAG has the potential to improve diagnostic accuracy, clinical decision support, and patient care. This narrative review explores the application of RAG across various medical domains, including guideline interpretation, diagnostic assistance, clinical trial eligibility screening, clinical information retrieval, and information extraction from scientific literature. Studies highlight the benefits of RAG in providing accurate, up-to-date information, improving clinical outcomes, and streamlining processes. Notable applications include GPT-4 models enhanced with RAG to interpret hepatologic guidelines, assist in differential diagnosis, and aid in clinical trial screening. Furthermore, RAG-based systems have demonstrated superior performance over traditional methods in tasks such as patient diagnosis, clinical decision-making, and medical information extraction. Despite its advantages, challenges remain, particularly in model evaluation, cost-efficiency, and reducing AI hallucinations. This review emphasizes the potential of RAG in advancing medical AI applications and advocates for further optimization of retrieval mechanisms, embedding models, and collaboration between AI researchers and healthcare professionals to maximize RAG's impact on medical practice.

Keywords: Retrieval-augmented generation (RAG), large language models (LLMs), artificial intelligence, medical applications, clinical decision support, diagnostic assistance, guideline interpretation, clinical trial eligibility screening

Introduction

Recent developments in artificial intelligence (AI) have opened up considerable possibilities for improving healthcare efficiency and quality. In 2023, the U.S. Food and Drug Administration approved nearly 700 AI-powered devices across diverse medical specialties such as radiology, ophthalmology, and hematology. 1 However, despite these advancements, the adoption of AI in practical healthcare settings remains limited. A key challenge is the safety concerns voiced by patients, healthcare professionals, and the general public. According to a recent survey conducted in the United States, 60% of respondents expressed discomfort with the use of AI by medical providers. 2

Large language models (LLMs) are sophisticated AI systems capable of comprehending and generating text with a high degree of fluency, often resembling human communication. They can perform a range of tasks, including summarizing medical literature, assisting in diagnosis, generating clinical reports, and supporting medical decision-making. 3 These models are trained on vast collections of text data. Although they exhibit remarkable proficiency and hold great potential in various medical applications, public skepticism toward medical AI tends to grow when individuals become aware that these systems can, in some instances, outperform human experts—a groundbreaking phenomenon in human history. Some of these concerns are actually correct as we have seen these models are prone to hallucinations—generating false or misleading information with high confidence—which could be critical, particularly in medical contexts. 4

In AI and machine learning, RAG is a technique that enhances the capabilities of LLMs by integrating external data sources. This allows the models to provide more accurate and contextually relevant responses. The process involves retrieving relevant information from a knowledge base and incorporating it into the model's output generation. This approach improves the accuracy and reliability of AI-generated responses by grounding them in up-to-date, authoritative information. 5

The field of medicine constantly seeks advancements to improve diagnostic accuracy, treatment efficacy, and overall patient care. One such advancement is the integration of RAG in medical applications. RAG combines the strengths of information retrieval systems and generative models to enhance the capabilities of AI in various medical tasks. This manuscript aims to explore the application of RAG in various medical contexts. We conducted a comprehensive search across PubMed, SCOPUS, EMBASE, and Web of Science from inspection till May 2024 to identify studies utilizing RAG for medical tasks, followed by a narrative review to determine the specific areas of application. The identified applications included clinical decision support, assisting with guideline interpretation for evidence-based care; diagnostic assistance, supporting clinicians in diagnosis through enhanced information retrieval; clinical trial eligibility screening; clinical information retrieval and chatbots, improving access to medical information; and information extraction from scientific literature, enabling efficient data extraction from large volumes of publications. This review highlights the diverse applications of RAG in enhancing various aspects of medical practice and research.

Guideline interpretation and clinical decision support

In this section, we explore studies focused on improving guideline interpretation and providing clinical decision support through the use of RAG in medical settings.

One notable study aimed to improve the interpretation of hepatologic disease guidelines using RAG-enhanced GPT-4. The researchers employed two evaluation methods: human evaluation and text similarity scores, which compared the generated answers to expert responses. The customized LLM framework significantly outperformed GPT-4 Turbo in various scenarios, achieving an overall accuracy of 99.0%, compared to GPT-4 Turbo's 43.0% (p < 0.001). A notable finding was that consistent text formatting and re-formatting tables as text-based lists improved accuracy to 90.0% (p < 0.001). Ultimately, custom prompt engineering led to 99.0% accuracy (p < 0.001).

For text-based questions, the framework achieved 100% accuracy, surpassing GPT-4 Turbo's 62.0% (p < 0.001), with improvements from in-context guidelines (86.0%; p = 0.01) and consistent formatting (90.0%; p = 0.002). For table-based questions, the accuracy reached 96.0%, significantly better than GPT-4 Turbo's 28.0% (p < 0.001), with incremental improvements through guidelines (44.0%; p = 0.38), text cleaning and .csv conversion (60.0%; p = 0.046), and structured formatting (96.0%; p < 0.001). In clinical scenarios, the framework achieved 100% accuracy compared to GPT-4 Turbo's 20.0% (p < 0.001), with enhancements from guidelines (52.0%; p = 0.039), text cleaning and .csv conversion (72.0%; p < 0.001), and structured formatting (84.0%; p < 0.001). Hallucination analysis revealed 90.3% fact-conflicting and 9.7% input-conflicting hallucinations, with no contextual-conflicting hallucinations. BLEU score, ROUGE-LCS F1, METEOR Score F1, and the custom OpenAI Score were used to compare text similarity, showing improvement with RAG. 6

Another study developed LiVersa, a liver disease-specific LLM chat interface, within the University of California, San Francisco's PHI-compliant implementation of Microsoft OpenAI GPT models. They utilized RAG to incorporate guidelines from the Association for the Study of Liver Diseases into text embeddings using Microsoft Azure's ADA Text Embedding model, stored in a searchable database. During interactions, user prompts were matched against this database to retrieve relevant information, which the LLM then used to generate responses.

Evaluation involved pre-set sample questions and comparisons with medical trainees’ responses to case-vignette-based knowledge assessments. LiVersa accurately answered all 10 forced “yes” or “no” questions but provided incomplete rationales for 3 out of 10 clinical scenarios. This study lacked a comparison to the base model, highlighting a gap in the evaluation methodology. 7

Diagnostic assistance

This section examines the use of RAG in assisting with diagnostic processes, particularly focusing on studies that compare its performance to traditional models and methods. A study focusing on differential diagnosis for gastrointestinal cases utilized GPT-4 to provide diagnoses based on imaging findings. Researchers deployed RAG using a top 10 reading list for gastrointestinal imaging, including 96 peer-reviewed documents. Performance was measured by comparing the accuracy of diagnoses to the ground truth established by experienced radiologists.

Results demonstrated that the RAG-enhanced model achieved 78% accuracy in identifying the main diagnosis, compared to 54% for the base GPT-4 model. The enhanced model also provided at least one correct differential diagnosis in 98% of cases versus 92% for the base model. Despite higher costs and longer response times, the RAG-enhanced model's integration of imaging findings marked a significant improvement over the base model, which relied more on clinical information alone. 8

Another investigation explored the potential of GPT-4 in predicting patient admissions from emergency department (ED) visits, comparing its performance against traditional machine learning (ML) models using real-world data. The study utilized electronic health records from seven NYC hospitals and trained Bio-Clinical-BERT and XGBoost models on unstructured and structured data, respectively, creating an ensemble ML model.

Various scenarios were tested for GPT-4, including Zero-shot, Few-shot with and without RAG, and with ML numerical probabilities. The ensemble ML model achieved high performance (AUC 0.88, AUPRC 0.72, accuracy 82.9%). Initially, GPT-4 showed lower performance (AUC 0.79, AUPRC 0.48, accuracy 77.5%), but incorporating RAG and ML probabilities significantly enhanced its accuracy (AUC 0.87, AUPRC 0.71, accuracy 83.1%). RAG alone notably improved GPT-4's performance (AUC 0.82, AUPRC 0.56, accuracy 81.3%), highlighting GPT-4's potential when augmented with real-world data. 9

Clinical trial eligibility screening

In this section, we review the use of RAG in enhancing the efficiency and accuracy of clinical trial eligibility screening, showcasing its potential to streamline processes.

Researchers developed the RECTIFIER system to screen patients for the COPILOT-HF trial, comparing its performance to human study staff. RECTIFIER utilized a RAG architecture to filter relevant clinical notes, preventing the system from exceeding GPT-4's token limit. Researchers optimized prompts and chunk sizes to enhance accuracy across development, validation, and testing phases.

RECTIFIER demonstrated superior performance, achieving an accuracy of 93.6% compared to the study staff's 85.9%. The AI system also showed higher sensitivity (90.5% vs. 82.1%) and specificity (95.2% vs. 88.7%) across most criteria, with significant improvement in identifying symptomatic heart failure patients (sensitivity of 92.8% compared to 84.3% for the staff). RECTIFIER provided high consistency, reducing variability in patient screening outcomes. The cost-efficiency of RECTIFIER was also highlighted, indicating its potential to significantly reduce the time and resources required for patient screening in clinical trials. 10

Clinical information retrieval and chatbots

This section delves into the development and evaluation of RAG-based clinical information retrieval systems and disease-specific chatbots, emphasizing their accuracy and reliability.

One study introduced Almanac, a framework designed to enhance factuality, completeness, and safety in clinical queries. Unlike traditional point-of-care tools, Almanac integrates external tools such as search engines and medical databases to ensure accurate and well-cited responses.

The study evaluated Almanac against ChatGPT using a dataset of open-ended clinical scenarios assessed by board-certified clinicians and resident physicians. Results showed that Almanac significantly outperformed ChatGPT in factuality, with an average increase of 18 percentage points across specialties, particularly in cardiology (91% vs. 69%). Almanac correctly handled all clinical calculation scenarios, whereas ChatGPT failed all. In terms of completeness, Almanac had a marginal but statistically insignificant improvement over ChatGPT. Regarding safety, Almanac was highly resilient to adversarial prompts (95% vs. 0% for ChatGPT). Despite these advancements, physicians preferred ChatGPT's answers 57% of the time, indicating potential areas for further refinement in user experience. 11

Another study focused on developing a comprehensive knowledge base for a RAG chatbot focusing on multiple myeloma research. Researchers targeted articles published between 1964 and 2022, using broad keywords to generate queries and retrieve PubMed IDs (PMIDs). Custom functions were developed to fetch complete records from the Entrez API, including article title, abstract, authors, journal, and publication date. Data was saved in JSON format for efficient storage and retrieval.

Exploratory data analysis revealed increasing publication trends and prominent journals in multiple myeloma research. Semantic search and clustering analysis were employed to enhance the chatbot's ability to provide accurate and relevant information. The study evaluated the performance of their RAG model against two state-of-the-art language models, GPT-3.5-turbo-16k and GPT-4-32k, using a benchmark dataset of challenging multiple myeloma-related questions curated by expert oncologists. They developed an interactive dashboard to facilitate comparison, focusing on accuracy, relevance, and domain-specific knowledge. The RAG model demonstrated similar performance in accuracy and relevance but at a significantly lower cost, leveraging a compact embedding model (BAAI/bge-small-en-v1.5) and a Mistral Instruct 7B language model. One key advantage of the RAG model was its ability to mitigate hallucinations, providing truthful responses when relevant information was not found. The retrieval mechanism selected the top-k most relevant PubMed papers based on document chunk embeddings, ensuring transparency by providing users access to the source material. In terms of computational efficiency, the RAG model maintained competitive performance while reducing computational cost compared to larger models. 12

Another study introduced AtlasGPT, an LLM tailored for neurosurgery. AtlasGPT employs RAG techniques to ensure precise answers by integrating external data sources into its responses. The model's architecture allows for context-sensitive responses, catering to users with varying levels of medical knowledge. AtlasGPT's implementation is scalable and easily integrable, focusing on providing accurate and reliable outputs through pretrained neurosurgical literature. 13

Information extraction from scientific literature

This section reviews the comparison of RAG-based methods with traditional human extraction techniques for extracting relevant information from scientific literature.

One study compared the concordance of information extracted and the time taken between OpenAI's GPT-3.5 Turbo and conventional human extraction methods in retrieving relevant information from scientific articles on diabetic retinopathy (DR). Researchers randomly selected twenty papers on DR from PubMed and extracted information such as the country of study, significant risk factors of DR, inclusion and exclusion criteria, and odds ratio (OR) and 95% confidence interval (CI). The first researcher extracted the information, which was then checked by a second researcher. Discrepancies were resolved through discussions with a third researcher.

Using the OpenAI API, they invoked a question and answer (QA) model using GPT-3.5 Turbo to process the twenty papers as an entire batch of PDF files. Instructional prompts were used to query the same information from all articles. Concordance for each information extraction was calculated as the number of articles with accurate information extracted by GPT-3.5 Turbo divided by the total number of articles. The time taken for extraction was also assessed. GPT-3.5 Turbo was unable to extract information from three (15%) articles not in PDF format. For the remaining 17 (85%) papers, GPT-3.5 Turbo took 5 minutes compared to 1310 minutes by the researcher. Concordance between GPT-3.5 Turbo and manual extraction by the researcher was highest for the extraction of the country of study at 100%, 64.7% for significant risk factors of DR, 47.1% for inclusion and exclusion criteria, and 41.2% for OR and 95% CI. The concordance levels indicate the complexity associated with each prompt and the potential for improvement in AI-based information extraction methods (Figure 1). 14

Figure 1.

Figure 1.

Use of retrieval augmented generation in medicine.

Challenges in using RAG

Several challenges were encountered while working on RAG in studies:

Knowledge deficiencies and outdated information

RAG models rely on the external knowledge sources provided to them. Limitations in the number, variety, and up-to-dateness of these sources can lead to knowledge deficiencies. 7 For example, a liver disease-specific RAG model, LiVersa, provided incorrect justifications because the specific version of the guidelines it needed was not included in its dataset. 7 Similarly, the general training data of large language models might lack the latest research, which is a critical challenge in the rapidly evolving medical field. 8

Hallucinations and factual inaccuracies

Despite using external knowledge, RAG models can still generate incorrect or fabricated information, known as hallucinations.6,7 This is a significant concern in medical applications where accuracy is paramount and can potentially lead to patient harm.6,11

Complexity and heterogeneity of medical knowledge

Medical domains, and the format of clinical guidelines are often highly complex and heterogeneous.6,12 Guidelines can have varying structures, and crucial information might be located in different formats like text, tables, or flowcharts, making it difficult for RAG systems to interpret and retrieve relevant information accurately. 6

Ensuring relevance and contextual understanding

RAG models need to not only retrieve information but also understand its relevance to the specific query and the broader context. A generic LLM might not fully consider critical information like imaging findings, over-relying on clinical symptoms, or provide superficial explanations lacking precise diagnoses. Ensuring the retrieved documents are the most relevant and that the generator effectively uses this context is a continuous challenge.8,10

Data quality and bias

The quality of the data used for retrieval is crucial. Issues such as errors, missing information, or inherent biases in the data (e.g. reflecting disparities in healthcare for certain populations) can negatively impact the performance and fairness of RAG models.7,8,10

Evaluation and benchmarking

Evaluating the performance of RAG models in medicine is challenging. Standard NLP evaluation metrics might not fully capture the nuances of medical relevance, completeness, and contextual correctness. 6 The development of domain-specific benchmark datasets, like the one created for multiple myeloma questions or clinical scenarios, is essential but ongoing. Furthermore, expert physician oversight is often necessary for accurate evaluation.11,12

Transparency and explainability

The decision-making process of LLMs, even with RAG, can lack transparency, making it difficult to understand why a particular answer was generated. In medical contexts, being able to trace the reasoning back to specific evidence is crucial for trust and accountability. 8

Integration into clinical workflows

Successfully integrating RAG models into real-world clinical practice presents operational challenges. These include ensuring the system is user-friendly, doesn't increase cognitive load for clinicians, and has appropriate safeguards against system downtime.9,10,12

Ethical, legal, and regulatory considerations

Using AI, including RAG, in healthcare raises significant ethical, legal, and data privacy concerns. Compliance with regulations like HIPAA (in the context of PHI) and ensuring patient safety are paramount and require careful consideration in the development and deployment of these systems.7,8,11

Computational cost and efficiency

Utilizing large language models can be computationally expensive. Optimizing RAG architectures for efficiency and cost-effectiveness is important for widespread adoption in healthcare settings.9,10,12

Retrieval collapse

In some cases, the retrieval component of RAG models might “collapse” and consistently retrieve the same documents regardless of the input, leading the generator to potentially ignore the retrieved content. 5 While this was observed in tasks like story generation, the principle highlights a potential failure mode for retrieval in knowledge-intensive tasks as well (Figure 2).

Figure 2.

Figure 2.

Challenges of using retrieval augmented generation in medicine.

Beyond the areas explored in this study, RAG could be highly beneficial in several other aspects of medicine. These include personalized treatment planning, where RAG can integrate patient-specific data with the latest medical research to recommend tailored treatment options. In medical education, RAG can provide up-to-date learning resources and simulate interactive case studies, enhancing the training of healthcare professionals. Additionally, RAG can improve pharmacovigilance by monitoring and analyzing vast amounts of drug safety data to identify adverse effects more effectively. The potential of RAG extends to remote patient monitoring and telemedicine, where it can support real-time decision-making by accessing and synthesizing relevant clinical information.

We recommend further integration and exploration of RAG techniques in medical applications. Future research should focus on optimizing the retrieval mechanisms and embedding models to enhance the performance and efficiency of RAG-based systems. Developing domain-specific LLMs should be prioritized to cater to specialized medical needs. Collaboration between AI researchers and medical professionals is crucial to refine these models and ensure their clinical relevance. Comprehensive evaluation frameworks should be established to compare RAG-enhanced models with traditional methods, ensuring rigorous validation and continuous improvement.

The integration of RAG in medical applications represents a significant advancement in AI, offering enhanced capabilities for diagnostic accuracy, clinical decision support, and information retrieval. By leveraging external data sources, RAG-enhanced models provide more accurate and contextually relevant responses compared to traditional language models.

Conclusion

In conclusion, the integration of RAG with large language models represents a significant advancement in the application of AI in healthcare. This review highlights the promising utility of RAG across a range of medical tasks, including clinical decision support, diagnostic assistance, clinical trial eligibility screening, clinical information retrieval, chatbot development, and information extraction from scientific literature. These applications demonstrate clear improvements in accuracy, reliability, and efficiency when compared to base models and even human performance in certain contexts. However, despite these advantages, challenges remain—notably, the dependency of RAG systems on the quality, comprehensiveness, and recency of external knowledge sources. Inaccurate or outdated references can compromise the quality of generated responses, which is particularly concerning in high-stakes medical settings. Addressing these limitations will be critical to ensuring safe and effective implementation of RAG-powered tools in clinical practice. As the field evolves, future research should focus on establishing standardized evaluation frameworks, improving data integration methods, and ensuring transparency to foster greater trust among healthcare professionals and patients alike.

ORCID iDs: Omid Kohandel Gargari https://orcid.org/0000-0002-8182-0582

Gholamreza Habibi https://orcid.org/0009-0007-1693-9142

Statements and declarations

Author contributions/CRediT: Omid Kohandel Gargari did conceptualization, methodology, investigation, writing—original draft, and visualization. Gholamreza Habibi did conceptualization, writing—review and editing.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  • 1.Joshi G, Jain A, Araveeti SR, et al. FDA-approved artificial intelligence and machine learning (AI/ML)-enabled medical devices: an updated landscape. Electronics (Basel) 2024; 13: 98. [Google Scholar]
  • 2.Tyson A, Pasquini G, Spencer Aet al. et al. 60% of Americans would be uncomfortable with provider relying on AI in their own health care. 2023.
  • 3.Van Veen D, Van Uden C, Blankemeier L, et al. Clinical text summarization: adapting large language models can outperform human experts. Res Sq 2023; 3: rs-3483777. [Google Scholar]
  • 4.Huang L, Yu W, Ma W, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst 2025; 43: 1–55. [Google Scholar]
  • 5.Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inf Process Syst 2020; 33: 9459–9474. [Google Scholar]
  • 6.Kresevic S, Giuffrè M, Ajcevic M, et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Dig Med 2024; 7: 102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ge J, Sun S, Owens J, et al. Development of a liver disease-specific large language model chat interface using retrieval augmented generation. Hepatology 2024; 80: 1158–1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.S R, A R, J N, A F, F B, M R , et al. A retrieval-augmented chatbot based on GPT-4 provides appropriate differential diagnosis in gastrointestinal radiology: a proof of concept study. Eur Radiol Exp 2024; 8: 60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Glicksberg BS, Timsina P, Patel D, et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024; 31: 1921–1928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.O U, J S, CJ M, MF O, MR T, M V , et al. Retrieval augmented generation enabled generative Pre-trained transformer 4 (GPT-4) performance for clinical trial screening. United States 2024; 2024: 2–8. [Google Scholar]
  • 11.Zakka C, Shad R, Chaurasia A. et al. Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI 2024; 1: 10.1056/aioa2300068. 2024;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Quidwai MA, Lagana A. A rag chatbot for precision medicine of multiple myeloma. medRxiv 2024: 2024.03.14.24304293.
  • 13.Hopkins BS, Carter B, Lord J, et al. Editorial. AtlasGPT: Dawn of a new era in neurosurgery for intelligent care augmentation, operative planning, and performance. J Neurosurg 2024; 140: 1211–1214. [DOI] [PubMed] [Google Scholar]
  • 14.CCY G, NDA R, W R-C, R A, P R, J A , et al. Evaluating the OpenAI's GPT-3.5 Turbo's performance in extracting information from scientific articles on diabetic retinopathy. Syst Rev 2024; 13: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Digital Health are provided here courtesy of SAGE Publications

RESOURCES