Abstract
Large language models (LLMs) excel in natural language processing (NLP) but struggle with domain-specific complexities in electronic health records (EHRs). We demonstrate that retrieval-augmented generation (RAG) enhances LLMs for dietary supplement (DS) information extraction. By testing models like Llama-3 with diverse retrievers on tasks including entity recognition and usage classification, task-aligned retrieval outperforms reliance on model size or specialization. Smaller general models paired with optimized retrievers match or exceed specialized counterparts-structured retrieval aids complex tasks (e.g., triple extraction), while semantic retrieval improves classification. Results challenge assumptions that larger or domain-specific models are superior, emphasizing dynamic knowledge integration over brute-force scaling. This approach offers practical strategies for clinical NLP, enabling efficient EHR analysis without massive resources. Prioritizing retrieval strategies over model size advances tools for evidence-based healthcare, highlighting adaptability and cost-effectiveness in real-world medical applications.
Introduction
The rapid advancement of large language models (LLMs)1,2 has significantly reshaped natural language processing (NLP), driving progress across various domains.3,4 Models such as GPT-4,5 Llama-3,6 and Mistral7 have demonstrated remarkable capabilities in text comprehension,8 reasoning,9 and generation,10 setting new benchmarks in general-domain NLP. This progress has also propelled advancements in medical NLP,11–13 where LLMs are increasingly applied to clinical text processing,14,15 biomedical literature analysis,16 and knowledge discovery.17 One of the most critical applications in this domain is information extraction (IE),18,19 which enables the transformation of unstructured electronic health records (EHRs) into structured, actionable knowledge. Given the vast volume of clinical data generated daily, accurate IE from EHRs is essential for clinical decision support,20 drug safety monitoring,21 and biomedical research.22,23
Despite their strong performance on public NLP benchmarks, LLMs still exhibit significant limitations in processing EHRs.24,25 One major challenge arises from the private nature of EHRs,26 as they are rarely publicly available due to strict patient privacy regulations. This prevents LLMs from being trained directly on large-scale clinical datasets, limiting their exposure to real-world medical narratives. Additionally, EHRs are primarily authored by medical professionals—such as doctors and nurses—who employ highly specialized terminology, abbreviations, and complex sentence structures that differ from general-domain text.27 Moreover, medical knowledge is constantly discovering and accumulating, and the LLM can not always keep up with the latest. As a result, even powerful LLMs often fail to extract meaningful insights from EHRs, necessitating additional strategies to enhance their adaptability.
The emergence of retrieval-augmented generation (RAG)28,29 has introduced a promising solution to this challenge by allowing LLMs to retrieve external knowledge dynamically during inference. Rather than relying solely on their pre-trained representations, RAG enables models to access relevant domain-specific information, example cases, or contextual cues that can significantly improve task performance. This paradigm shift—from direct problem-solving to knowledge-augmented reasoning30—effectively reduces the burden on LLMs and enhances their ability to generalize to unseen medical data.31 In the context of EHR processing, where training data is scarce and medical text is highly specialized, RAG presents a powerful technique to supplement LLMs with real-time access to task-relevant information.
Recent advancements in retrieval-augmented generation (RAG) combined with large language models (LLMs) have shown promise in electronic health record (EHR) analysis, particularly for clinical document understanding,32 medical question answering,33 and information extraction.34,35. However, limited attention has been given to dietary supplement (DS)-related EHR processing. While our prior work RAMIE36 built up a multi-task37 framework in DS-EHR analysis, critical performance limitations persist in real-world clinical settings. Notably, surveys by the Council for Responsible Nutrition (CRN) indicate that 75% of Americans use dietary supplements,38,39 underscoring their widespread impact on health behaviors, drug interactions, and therapeutic outcomes. Despite this significance, effective LLM-based methods for extracting DS-related information from EHRs are underdeveloped. This study bridges this gap by systematically evaluating RAG-enhanced LLMs for DS knowledge extraction, advancing foundational techniques for clinical text analysis.
To this end, we evaluate eight state-of-the-art LLMs on four dietary supplement datasets, utilizing retrieval-based augmentation techniques to enhance model performance. Our study systematically assesses how RAG influences LLMs’ ability to process DS-related EHRs, focusing on key information extraction tasks that are essential for structuring clinical knowledge. By analyzing the effectiveness of retrieval-augmented fine-tuning, we provide valuable insights into how LLMs can be optimized for EHR-based information extraction, offering guidance for future applications in medical NLP.
Methods
Tasks and datasets
This study focuses on the extraction of structured knowledge from clinical text. Specifically, we consider four fundamental tasks: named entity recognition (NER), relation extraction (RE), triple extraction (TE), and usage classification (UC).
Among these tasks, NER aims to identify dietary supplement mentions and their associated events, such as symptoms or diseases, within unstructured clinical narratives. RE involves identifying and categorizing relationships between dietary supplements and events. TE extends the combination of the NER and RE tasks by extracting structured triplets that encapsulate (supplement-relation-symptom) patterns, offering a more comprehensive evaluation of end-to-end information extraction ability. Finally, UC addresses the classification of dietary supplement usage, distinguishing between start, continue, discontinue, and uncertain. These tasks collectively contribute to structuring dietary supplement-related information within EHRs, enhancing their utility for downstream applications such as automated clinical documentation and evidence-based decision support.
To ensure a practical evaluation, we utilize four dietary supplement-related datasets derived from real-world EHR corpora, each curated to support a specific information extraction task. All datasets have undergone rigorous quality control procedures, including inter-annotator agreement validation, and have been de-identified to comply with patient privacy regulations. The details of data curation and examples are in previous works40,41. The Institutional Review Board (IRB) approval was obtained for accessing these datasets. The dataset statistics are shown in Table. 1.
Table 1.
The statistics of datasets.
| Dataset | Train | Validation | Test |
|---|---|---|---|
| Named entity recognition | 2,365 | 292 | 292 |
| Relation extraction | 3,964 | 464 | 464 |
| Triple extraction | 2,365 | 292 | 292 |
| Usage classification | 2,000 | 230 | 230 |
Retrieval-augmented generation integration
As shown in Fig. 1, the retrieval mechanism operates by constructing a unified representation that concatenates the input sentence with its corresponding response, followed by a similarity computation with the input sentence embedding using cosine similarity. During the training phase, retrievers frequently identify instances where the retrieved sentence is identical to the input due to their high embedding similarity. This redundancy can lead the model to merely replicate the response from the example rather than engaging in substantive analytical reasoning. To prevent this undesirable reliance on direct copying, we enforce a constraint that prohibits the retriever from selecting the input sentence itself during training. Instead, it is configured to retrieve the most semantically similar sentence from the remaining dataset, thereby ensuring exposure to diverse yet contextually relevant examples. In contrast, during inference, the retriever is permitted to select the most relevant example from the entire training corpus, as there is no risk of overlap between the input and retrieved samples. The retrieved examples are then integrated into a structured prompt, which serves as additional context during instruction fine-tuning of the LLM. To maximize retrieval efficacy, we employ three state-of-the-art retrieval models—MedCPT,42 Contriever,43 and BMRetriever.44. Their information is summarised in Table 2.
Figure 1.

Retrieval-augmented generation integrated with large language models.
Table 2.
An overview of selected retrievers.
| Retriever | Parameters | Pre-trained method | Validation |
|---|---|---|---|
| MedCPT | 108M | Contrastive Pre-training | PubMed |
| Contriever | 108M | Unsupervised contrastive learning | Wikipedia and CCNet |
| BMRetriever | 410M | Unsupervised pre-training | PubMed, arXiv, MedRxiv, BioRxiv, etc |
Large language models
To establish a robust foundation for our study, we systematically evaluate eight state-of-the-art LLMs: Mistral-7B,7 Llama-27B,45 Llama-2-13B,45 Llama-3-8B,46 BioMistral-7B,47 PMC-Llama-13B,48 MedAlpaca-7B,49 and MedAlpaca-13B.49 The selected models encompass both general-domain and biomedical-domain LLMs, enabling a comprehensive assessment of their performance in dietary supplement-related information extraction.
Metrics
To evaluate the performance of the models across the dietary supplement-related information extraction tasks, we employ precision, recall, and F1-score as the primary evaluation metrics. Precision measures the proportion of correctly predicted instances among all predicted positive instances, reflecting the model’s ability to minimize false positives. Recall quantifies the proportion of correctly identified instances relative to the total number of ground truth instances, capturing the model’s effectiveness in identifying relevant information. F1-score, the harmonic mean of precision and recall, provides a balanced assessment of the model’s overall performance.
Experimental setting
All experiments were conducted on NVIDIA A100 GPUs with 80GB of memory, ensuring sufficient computational resources for large-scale model fine-tuning and inference. For fine-tuning, we adopted LoRA (Low-Rank Adaptation),50 a parameter-efficient adaptation technique that enables effective specialization of large language models while significantly reducing memory overhead. The rank parameter was set to 64, with an alpha value of 32, and a dropout rate of 0.1, ensuring robust adaptation without overfitting. Fine-tuning was conducted using the AdamW optimizer,51 with a learning rate of 1e-5.
Results
The systematic evaluation of retrieval-augmented language models across four biomedical information extraction tasks reveals nuanced performance patterns tied to task complexity, model architecture, and retrieval methodology, as detailed in Tables 3-6.
Table 3.
Performance of retrieval-augmented LLMs on dietary supplements for named entity recognition task.
| Model | Zero-shot | Random | MedCPT | Contriever | BMRetriever | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| BioMistral-7B | 86.01 | 85.89 | 85.95 | 84.75 | 83.45 | 84.09 | 88.25 | 83.59 | 85.86 | 85.93 | 80.67 | 83.21 | 87.39 | 84.49 | 85.91 |
| Llama-2-7B | 85.55 | 81.66 | 83.56 | 85.46 | 79.78 | 82.52 | 85.17 | 81.16 | 83.12 | 83.86 | 80.95 | 82.38 | 84.99 | 83.45 | 84.21 |
| Llama-2-13B | 86.40 | 82.34 | 84.32 | 84.70 | 81.30 | 82.97 | 89.81 | 84.14 | 86.88 | 88.28 | 84.30 | 86.24 | 88.21 | 82.29 | 85.14 |
| Llama-3-8B | 86.18 | 84.38 | 85.27 | 86.64 | 85.67 | 86.15 | 87.37 | 83.73 | 85.51 | 87.68 | 84.14 | 85.87 | 88.09 | 83.78 | 85.88 |
| MedAlpaca-7B | 91.51 | 85.81 | 88.57 | 84.22 | 81.64 | 82.91 | 89.83 | 84.66 | 87.02 | 87.43 | 82.20 | 84.73 | 87.69 | 81.22 | 84.33 |
| MedAlpaca-13B | 85.01 | 82.90 | 83.94 | 82.68 | 78.83 | 80.71 | 82.60 | 81.34 | 81.96 | 78.55 | 78.32 | 78.93 | 82.39 | 81.90 | 82.14 |
| Mistral-7B | 84.01 | 84.83 | 84.42 | 85.63 | 85.51 | 85.57 | 86.63 | 84.35 | 85.47 | 85.73 | 85.73 | 85.73 | 86.34 | 85.26 | 85.79 |
| PMC-Llama-13B | 85.37 | 82.90 | 84.11 | 83.61 | 81.99 | 82.80 | 85.43 | 82.80 | 84.09 | 85.06 | 84.70 | 84.88 | 83.08 | 84.24 | 83.66 |
Table 6.
Performance of retrieval-augmented LLMs on dietary supplements for usage classification task.
| Model | Zero-shot | Random | MedCPT | Contriever | BMRetriever | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| BioMistral-7B | 90.79 | 90.79 | 90.79 | 89.96 | 89.96 | 89.96 | 89.52 | 89.52 | 89.52 | 89.08 | 89.08 | 89.08 | 91.27 | 91.27 | 91.27 |
| Llama-2-7B | 89.96 | 89.96 | 89.96 | 89.08 | 89.08 | 89.08 | 89.08 | 89.08 | 89.08 | 90.39 | 90.39 | 90.39 | 88.64 | 88.64 | 88.64 |
| Llama-2-13B | 91.70 | 91.70 | 91.70 | 89.96 | 89.96 | 89.96 | 90.39 | 90.39 | 90.39 | 94.76 | 94.76 | 94.76 | 92.14 | 92.14 | 92.14 |
| Llama-3-8B | 91.01 | 91.01 | 91.01 | 90.39 | 90.39 | 90.39 | 89.52 | 89.52 | 89.52 | 93.89 | 93.89 | 93.89 | 88.64 | 88.64 | 88.64 |
| MedAlpaca-7B | 92.57 | 92.57 | 92.57 | 87.77 | 87.77 | 87.77 | 89.96 | 89.96 | 89.96 | 93.45 | 93.45 | 93.45 | 89.95 | 89.95 | 89.95 |
| MedAlpaca-13B | 90.83 | 90.83 | 90.83 | 69.43 | 69.43 | 69.43 | 67.11 | 67.11 | 67.11 | 81.17 | 81.17 | 81.17 | 72.05 | 72.05 | 72.05 |
| Mistral-7B | 89.08 | 89.08 | 89.08 | 89.08 | 89.08 | 89.08 | 88.64 | 88.64 | 88.64 | 92.14 | 92.14 | 92.14 | 90.83 | 90.83 | 90.83 |
| PMC-Llama-13B | 89.08 | 89.08 | 89.08 | 89.52 | 89.52 | 89.52 | 91.70 | 91.70 | 91.70 | 92.14 | 92.14 | 92.14 | 90.39 | 90.39 | 90.39 |
Table 3 summarizes NER performance across models and retrieval strategies. MedAlpaca-7B achieved the highest baseline F1 score (88.57%), outperforming larger models like Llama-2-13B (84.32%) and PMC-Llama-13B (84.11%). Retrieval augmentation yielded mixed improvements: BMRetriever enhanced Mistral-7B (85.79% F1) and Llama-3-8B (85.88%), while MedCPT improved BioMistral-7B (85.86%) but degraded MedAlpaca-13B (81.96%). Notably, MedAlpaca-13B exhibited significant performance drops under retrieval-augmented settings, suggesting potential overfitting in domain-adapted models.
As shown in Table 4, BioMistral-7B with Contriever achieved the highest F1 score (93.95%), surpassing zero-shot baselines (93.09%) in RE task. General-purpose models like Mistral-7B and Llama-3-8B demonstrated robustness, achieving 93.30% and 91.79% F1 with BMRetriever and MedCPT, respectively. Conversely, domain-adapted models (MedAlpaca-7B/13B) underperformed, with F1 scores lower than 89.85% across retrievers. MedCPT retrieval paradoxically degraded performance for MedAlpaca-7B (87.26% vs. baseline 89.03%), highlighting misalignment between biomedical retrievers and specialized models.
Table 4.
Performance of retrieval-augmented LLMs on dietary supplements for relation extraction task.
| Model | Zero-shot | Random | MedCPT | Contriever | BMRetriever | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| BioMistral-7B | 93.09 | 93.09 | 93.09 | 93.30 | 93.30 | 93.30 | 92.44 | 92.44 | 92.44 | 93.95 | 93.95 | 93.95 | 91.36 | 91.36 | 91.36 |
| Llama-2-7B | 83.37 | 83.37 | 83.37 | 88.12 | 88.12 | 88.12 | 88.34 | 88.34 | 88.34 | 91.58 | 91.58 | 91.58 | 89.63 | 89.63 | 89.63 |
| Llama-2-13B | 92.66 | 92.66 | 92.66 | 93.09 | 93.09 | 93.09 | 91.79 | 91.79 | 91.79 | 92.87 | 92.87 | 92.87 | 92.44 | 92.44 | 92.44 |
| Llama-3-8B | 93.61 | 93.61 | 93.61 | 92.44 | 92.44 | 92.44 | 90.06 | 90.06 | 90.06 | 93.52 | 93.52 | 93.52 | 91.79 | 91.79 | 91.79 |
| MedAlpaca-7B | 89.03 | 89.03 | 89.03 | 88.77 | 88.77 | 88.77 | 87.26 | 87.26 | 87.26 | 89.85 | 89.85 | 89.85 | 86.39 | 86.39 | 86.39 |
| MedAlpaca-13B | 89.42 | 89.42 | 89.42 | 85.31 | 85.31 | 85.31 | 86.44 | 86.44 | 86.44 | 91.24 | 91.24 | 91.24 | 87.07 | 87.07 | 87.07 |
| Mistral-7B | 92.66 | 92.66 | 92.66 | 92.44 | 92.44 | 92.44 | 91.58 | 91.58 | 91.58 | 93.52 | 93.52 | 93.52 | 93.30 | 93.30 | 93.30 |
| PMC-Llama-13B | 90.50 | 90.50 | 90.50 | 91.36 | 91.36 | 91.36 | 90.28 | 90.28 | 90.28 | 92.44 | 92.44 | 92.44 | 91.36 | 91.36 | 91.36 |
Table 5 reveals that TE, the most complex task, exhibited the lowest F1 scores overall. Mistral-7B with BMRetriever achieved the highest F1 (77.12%), outperforming domain-adapted BioMistral-7B (75.74%). Model scale did not guarantee superiority: Llama-2-7B (71.39%) matched Llama-2-13B (75.47%) when paired with BMRetriever. MedAlpaca-13B showed severe degradation under MedCPT (64.52% F1), while random retrieval caused universal performance drops (e.g., MedAlpaca7B: 67.17% vs. baseline 75.36%). These results underscore TE’s sensitivity to compositional reasoning and retrieval relevance.
Table 5.
Performance of retrieval-augmented LLMs on dietary supplements for triple extraction task.
| Model | Zero-shot | Random | MedCPT | Contriever | BMRetriever | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| BioMistral-7B | 75.00 | 68.48 | 71.59 | 74.26 | 65.87 | 69.82 | 70.43 | 67.67 | 69.02 | 72.77 | 71.26 | 72.01 | 75.06 | 76.44 | 75.74 |
| Llama-2-7B | 72.09 | 64.57 | 68.12 | 69.66 | 63.36 | 66.36 | 69.71 | 63.46 | 66.44 | 70.07 | 62.88 | 66.28 | 71.90 | 70.90 | 71.39 |
| Llama-2-13B | 73.01 | 69.66 | 71.29 | 73.83 | 65.43 | 69.37 | 76.16 | 68.34 | 72.04 | 74.82 | 70.34 | 72.51 | 76.55 | 74.42 | 75.47 |
| Llama-3-8B | 78.29 | 70.39 | 74.13 | 73.60 | 66.06 | 69.63 | 74.10 | 67.91 | 70.87 | 72.95 | 69.78 | 71.33 | 77.75 | 73.95 | 75.80 |
| MedAlpaca-7B | 81.25 | 70.27 | 75.36 | 70.56 | 65.94 | 67.17 | 74.12 | 68.78 | 71.35 | 66.28 | 65.82 | 66.05 | 68.89 | 71.26 | 70.06 |
| MedAlpaca-13B | 76.05 | 67.78 | 71.68 | 68.42 | 67.82 | 68.12 | 65.42 | 63.63 | 64.52 | 69.97 | 62.87 | 66.23 | 65.37 | 64.48 | 64.92 |
| Mistral-7B | 74.88 | 73.33 | 74.09 | 75.13 | 67.13 | 70.90 | 75.12 | 67.03 | 70.85 | 73.70 | 69.07 | 71.31 | 77.30 | 76.94 | 77.12 |
| PMC-Llama-13B | 70.46 | 63.82 | 66.97 | 67.18 | 65.43 | 66.30 | 70.63 | 70.47 | 70.55 | 70.30 | 70.14 | 70.22 | 69.23 | 71.16 | 70.18 |
UC performance (Table 6) varied widely, with Llama-2-13B achieving state-of-the-art F1 (94.76%) using Contriever. Random retrieval severely impacted MedAlpaca-13B (69.43% vs. baseline 90.83%), while BMRetriever stabilized general models like Mistral-7B (90.83%). Domain adaptation again proved inconsistent: MedAlpaca-7B excelled in zero-shot settings (92.57%) but faltered with retrievers (89.95% for BMRetriever), whereas BioMistral-7B maintained robustness (91.27% F1 with BMRetriever). Task simplicity enabled high baselines, but retrieval strategies still provided critical gains for complex cases.
Retrieval augmentation strategies yielded divergent impacts across tasks. BMRetriever emerged as the most consistently beneficial approach, particularly for TE, where it improved Mistral-7B’s performance from 74.09% (zero-shot) to 77.12% (+3.03%). Contriever demonstrated task-specific superiority, achieving state-of-the-art UC results (94.76% F1 for Llama-2-13B) while underperforming in TE for domain-adapted models like MedAlpaca-7B (66.05% vs. zero-shot 75.36%). Notably, MedCPT retrieval paradoxically degraded performance for certain configurations, reducing MedAlpaca-13B’s UC accuracy by 23.72 percentage points compared to its zero-shot baseline.
Architectural analysis revealed a scale-performance decoupling: 13B parameter models failed to systematically outperform their 7B counterparts, with MedAlpaca-7B matching or exceeding MedAlpaca-13B’s TE performance across all retrieval conditions. Medical domain adaptation produced mixed outcomes—while MedAlpaca-7B achieved the highest zero-shot NER accuracy (88.57%), its 13B variant suffered severe performance degradation under retrieval-augmented conditions (64.52% TE F1 w/ MedCPT). Strikingly, the general-purpose Mistral-7B with BMRetriever outperformed all domain-adapted models in TE (77.12% vs. BioMistral-7B’s 75.74%), suggesting limitations in current medical adaptation methodologies for compositional tasks.
Task complexity inversely correlated with retrieval effectiveness, with TE showing maximum average gains (+4.23% F1 across methods) versus minimal RE improvements (+0.91%). This pattern held across model architectures, though with notable exceptions—Contriever enhanced UC performance by up to 12.59 percentage points (PMC-Llama-13B), while degrading MedAlpaca-13B’s TE accuracy by 5.45 points. Random example injection universally underperformed engineered retrieval, with particularly severe impacts on larger models (21.40% UC F1 drop for MedAlpaca-13B).
Discussion
The experimental findings provide critical insights into the interplay between RAG, model architecture, and task complexity in dietary supplement (DS)-related information extraction. Below, we contextualize these results within broader research trends, discuss their implications, and outline limitations and future directions.
Our results demonstrate that RAG’s effectiveness varies significantly across tasks, as shown in Fig. 2. While BMRetriever consistently improved performance for compositional tasks like triple extraction, Contriever excelled in usage classification, likely due to its ability to capture fine-grained semantic patterns. This indicates that retrieval mechanisms must align with task requirements—compositional reasoning benefits from structured context (e.g., BMRetriever’s example-based prompts), while classification tasks thrive on semantic similarity (e.g., Contriever’s dense embeddings). Notably, MedCPT underperformed in certain scenarios, particularly for domain-adapted models like MedAlpaca-13B, suggesting that biomedical-specific retrievers may inadvertently introduce noise when combined with already specialized models.
Figure 2.
Comparison of different retrievers for each model. The stars indicate the highest F1 score for each task within one model.
Contrary to expectations, medical domain adaptation did not universally enhance performance. While MedAlpaca-7B achieved superior zero-shot NER accuracy (88.57% F1), its 13B variant suffered degradation under retrieval-augmented settings. This highlights potential overfitting risks in domain-specific models when exposed to external knowledge. Strikingly, general-purpose models like Mistral-7B outperformed domain-adapted counterparts in TE when paired with BMRetriever (77.12% vs. 75.74%), indicating that current medical adaptation strategies may inadequately address compositional reasoning. Future work should explore hybrid approaches that balance domain specialization with flexible retrieval mechanisms.
The lack of systematic superiority of larger models (e.g., Llama-2-13B vs. Llama-2-7B) challenges the conventional assumption that model scale guarantees performance gains. This decoupling is most evident in TE, where task complexity (combining NER and RE) amplified variability across architectures. The inverse correlation between task complexity and retrieval efficacy (+4.23% F1 gain for TE vs. +0.91% for RE) underscores the need for task-aware retrieval strategies. For instance, UC’s high performance ceiling (94.76% F1) suggests that simpler tasks may saturate quickly, leaving limited room for improvement, while compositional tasks demand more sophisticated augmentation.
Finally, as shown in Fig. 3, the heat map of the average F1 scores across the four tasks indicates that Contriever and BMRetriever are more suitable for dietary supplement information extraction compared to other methods. Among the eight LLMs we used, Llama3-8B and Mistral-7B are the best-performing models for extracting dietary supplement information.
Figure 3.
Heat Map of average F1 score over four tasks for different retrievers.
Conclusion
This study highlights the efficacy of RAG in improving LLMs for extracting dietary supplement-related information from clinical narratives. The results reveal that the success of RAG hinges on aligning retrieval strategies with task requirements: certain retrievers excel in complex reasoning tasks, while others are better suited for classification-oriented objectives. Notably, smaller general-purpose models, when paired with tailored retrievers, often outperform larger or domain-specialized counterparts, challenging the assumption that model scale or domain adaptation alone guarantees superior performance. These findings underscore the importance of strategic retrieval integration over reliance on model size or pre-existing domain knowledge, offering a pathway to enhance structured clinical data analysis and support evidence-based healthcare decision-making.
Acknowledgements
This work was supported by the National Institutes of Health’s National Center for Complementary and Integrative Health under grant numbers R01AT009457 and U01AT012871, the National Institute on Aging under grant number R01AG078154, the National Cancer Institute under grant number R01CA287413, the National Institute of Diabetes and Digestive and Kidney Diseases under grant number R01DK115629, and the National Institute on Minority Health and Health Disparities under grant number 1R21MD019134-01.
Competing Interests
The authors declare no competing interests.
Figures & Tables
References
- 1.Zhao WX, Zhou K, Li J, et al. A survey of large language models. arXiv preprint arXiv:2303.18223. 2023:1. [Google Scholar]
- 2.Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology. 2024;15:1–45. [Google Scholar]
- 3.Carl N, Schramm F, Haggenmüller S, et al. Large language model use in clinical oncology. NPJ Precision Oncology. 2024;8:240. doi: 10.1038/s41698-024-00733-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature medicine. 2023;29:1930–40. [Google Scholar]
- 5.Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023.
- 6.Grattafiori A, Dubey A, Jauhri A, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. 2024.
- 7.Jiang AQ, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv preprint arXiv:2310.06825. 2023.
- 8.Cheng D, Huang S, Wei F. Adapting Large Language Models to Domains via Reading Comprehension. arXiv preprint arXiv:2309.09530. 2023.
- 9.Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 2022;35:24824–37. [Google Scholar]
- 10.Liu J, Xia CS, Wang Y, Zhang L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems. 2023;36:21558–72. [Google Scholar]
- 11.Zhan Z, Zhou S, Zhou H, et al. An evaluation of DeepSeek Models in Biomedical Natural Language Processing. arXiv preprint arXiv:2503.00624. 2025.
- 12.Zhou S, Xu Z, Zhang M, et al. Large language models for disease diagnosis: A scoping review. arXiv preprint arXiv:2409.00097. 2024.
- 13.Li M, Zhan Z, Yang H, Xiao Y, Huang J, Zhang R. Benchmarking retrieval-augmented large language models in biomedical nlp: Application, robustness, and self-awareness. arXiv preprint arXiv:2405.08151. 2024.
- 14.Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ digital medicine. 2022;5:194. doi: 10.1038/s41746-022-00742-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhan Z, Zhou S, Zhou H, Liu Z, Zhang R. EPEE: Towards Efficient and Effective Foundation Models in Biomedicine. arXiv preprint arXiv:2503.02053. 2025 [Google Scholar]
- 16.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Thapa S, Adhikari S. ChatGPT, bard, and large language models for biomedical research: opportunities and pitfalls. Annals of biomedical engineering. 2023;51:2647–51. doi: 10.1007/s10439-023-03284-0. [DOI] [PubMed] [Google Scholar]
- 18.Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022 Dec. pp. 1998–2022. DOI: 10.18653/v1/2022.emnlp-main.130. Available from: https://aclanthology.org/2022.emnlp-main.130/ [Google Scholar]
- 19.Tran H, Yang Z, Yao Z, Yu H. BioInstruct: instruction tuning of large language models for biomedical natural language processing. Journal of the American Medical Informatics Association. 2024;31:1821–32. doi: 10.1093/jamia/ocae122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ye J, Woods D, Jordan N, Starren J. The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support. AMIA Summits on Translational Science Proceedings. 2024;2024:459. [Google Scholar]
- 21.Davis SE, Zabotka L, Desai RJ, et al. Use of electronic health record data for drug safety signal identification: a scoping review. Drug safety. 2023;46:725–42. doi: 10.1007/s40264-023-01325-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tang AS, Woldemariam SR, Miramontes S, Norgeot B, Oskotsky TT, Sirota M. Harnessing EHR data for health research. Nature Medicine. 2024;30:1847–55. [Google Scholar]
- 23.Oliver D, Arribas M, Perry BI, et al. Using electronic health records to facilitate precision psychiatry. Biological psychiatry 2024.
- 24.Li L, Zhou J, Gao Z, et al. A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs) arXiv preprint arXiv:2405.03066. 2024 [Google Scholar]
- 25.Mess SA, Mackey AJ, Yarowsky DE. Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations. Plastic and Reconstructive Surgery–Global Open. 2025;13:e6450. doi: 10.1097/GOX.0000000000006450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tertulino R, Antunes N, Morais H. Privacy in electronic health records: a systematic mapping study. Journal of Public Health. 2024;32:435–54. [Google Scholar]
- 27.Evans RS. Electronic health records: then, now, and in the future. Yearbook of medical informatics. 2016;25:S48–S61. [Google Scholar]
- 28.Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020;33:9459–74. [Google Scholar]
- 29.Chen J, Lin H, Han X, Sun L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(16):17754–62. [Google Scholar]
- 30.Mondal D, Modi S, Panda S, Singh R, Rao GS. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. Proceedings of the AAAI conference on artificial intelligence. 2024;38(17):18798–806. [Google Scholar]
- 31.Kim K, Cho K, Jang R, et al. Updated primer on generative artificial intelligence and large language models in medical imaging for medical professionals. Korean Journal of Radiology. 2024;25:224. doi: 10.3348/kjr.2023.0818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhan Z, Wang J, Zhou S, Deng J, Zhang R. Mmrag: Multi-mode retrieval-augmented generation with large language models for biomedical in-context learning. arXiv preprint arXiv:2502.15954. 2025.
- 33.Hou Y, Zhang R. Enhancing Dietary Supplement Question Answer via Retrieval-Augmented Generation (RAG) with LLM. medRxiv. 2024:2024–9. [Google Scholar]
- 34.Zhu Y, Ren C, Wang Z, et al. Emerge: Integrating rag for improved multimodal ehr predictive modeling. arXiv preprint arXiv:2406.00036. 2024.
- 35.Ng KKY, Matsuba I, Zhang PC. RAG in health care: a novel framework for improving communication and decisionmaking by addressing LLM limitations. NEJM AI 2025. p. 2. AIra2400380.
- 36.Zhan Z, Zhou S, Li M, Zhang R. RAMIE: retrieval-augmented multi-task information extraction with large language models on dietary supplements. Journal of the American Medical Informatics Association. 2025:ocaf002. [Google Scholar]
- 37.Zhan Z, Zhang R. Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models. arXiv preprint arXiv:2412.11455. 2024.
- 38.Council for Responsible Nutrition. 2023 Consumer Survey on Dietary Supplements. Accessed: March 12, 2025. 2023. Available from: https://www.crnusa.org/2023survey.
- 39.Council for Responsible Nutrition. CRN Survey Shows Consistent Supplement Usage with Increase of Specialty Product Use Over Time. Accessed: March 12, 2025. 2024. Available from: https://crnusa.org/newsroom/crnsurvey-shows-consistent-supplement-usage-increase-specialty-product-use-overtime.
- 40.Fan Y, Zhang R. Using natural language processing methods to classify use status of dietary supplements in clinical notes. BMC medical informatics and decision making. 2018;18:15–22. doi: 10.1186/s12911-018-0596-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Fan Y, Zhou S, Li Y, Zhang R. Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text. Journal of the American Medical Informatics Association. 2021;28:569–77. doi: 10.1093/jamia/ocaa218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Jin Q, Kim W, Chen Q, et al. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics. 2023;39:btad651. doi: 10.1093/bioinformatics/btad651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Izacard G, Caron M, Hosseini L, et al. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118. 2021.
- 44.Xu R, Shi W, Yu Y, et al. Bmretriever: Tuning large language models as better biomedical text retrievers. arXiv preprint arXiv:2404.18443. 2024.
- 45.Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023.
- 46.Meta-Llama 3 8B on Hugging Face. 2024. https://huggingface.co/meta-llama/Meta-Llama-3-8B .
- 47.Labrak Y, Bazoge A, Morin E, Gourraud PA, Rouvier M, Dufour R. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373. 2024.
- 48.Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC-LLaMA: toward building open-source language models for medicine. Journal of the American Medical Informatics Association. 2024:ocae045. [Google Scholar]
- 49.Han T, Adams LC, Papaioannou JM, et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247. 2023.
- 50.Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations. 2022. Available from: https://openreview.net/forum?id=nZeVKeeFYf9.
- 51.Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. 2017.


