Skip to main content
GigaScience logoLink to GigaScience
. 2026 Jan 19;15:giag007. doi: 10.1093/gigascience/giag007

Federated knowledge retrieval elevates large language model performance on biomedical benchmarks

Janet Joy 1,, Andrew I Su 2,
PMCID: PMC12888809  PMID: 41553353

Abstract

Background

Large language models (LLMs) have significantly advanced natural language processing in biomedical research; however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.

Results

To overcome these limitations, we developed BioThings Explorer-Retrieval-Augmented Generation (BTE-RAG), a Retrieval-Augmented Generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BTE, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51 to 75.8% for GPT-4o mini and from 69.8 to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o. We additionally evaluated BTE-RAG alongside GeneGPT-based models on the GeneTuring gene–disease association benchmark and on our mechanistic gene benchmark, demonstrating that the BTE-RAG layer consistently improves accuracy relative to alternative approaches.

Conclusion

Federated knowledge retrieval provides transparent improvements in accuracy for LLMs, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.

Keywords: Retrieval-augmented generation (RAG), large language models (LLMs), biomedical knowledge graphs, federated knowledge retrieval, mechanistic reasoning, DrugMechDB benchmarking, BioThings Explorer, hallucination mitigation

Introduction

Large language models (LLMs) have rapidly advanced the state of natural-language processing, reaching or surpassing expert performance across a wide range of biomedical tasks, including cell type annotation, protein-structure prediction, and automated synthesis of clinical-trial results [1–6]. However, the underlying generative methodology of these models, which sequentially predict tokens based on statistical patterns learned from massive text corpora, renders them susceptible to hallucinations, defined as outputs that are syntactically fluent yet factually incorrect [7, 8]. Such inaccuracies pose significant risks in biomedicine, where even minor errors can misdirect research efforts, delay critical therapeutic discoveries, or compromise patient safety [7, 9–11]. Indeed, recent assessments underscore that hallucination rates remain too high for safe and effective deployment in clinical and research-intensive environments [12, 13].

Efforts to mitigate these hallucinations through domain-specific pre-training and prompt engineering have yielded only incremental improvements, as these approaches continue to embed knowledge implicitly within opaque model parameters and fail to reliably surface evidence provenance [14–16]. Retrieval-augmented generation (RAG) has emerged as a promising solution, explicitly grounding model-generated responses by dynamically incorporating external, verifiable evidence into prompts [17–19]. Within biomedical question-answering contexts, RAG approaches consistently reduce hallucinations and elevate factual accuracy compared to parameter-only models. Nonetheless, the efficacy of RAG hinges critically on the precision, comprehensiveness, and currency of the retrieved contextual evidence [20–22].

Knowledge graphs (KGs) are particularly compelling resources for RAG because they explicitly represent biological entities and their relationships, support multi-hop mechanistic reasoning, and maintain persistent identifiers that simplify provenance tracking [23–26]. Yet most biomedical KGs are tuned to a narrow slice of biology (for example, protein–protein interactions) or require extensive curation to remain current, limiting their utility for cross-domain mechanistic reasoning. To address these challenges, BioThings Explorer (BTE) integrates and federates 61 authoritative biomedical APIs into a continuously updated meta-knowledge graph that encompasses genes, pathways, drugs, diseases, phenotypes, and more [27]. The API-centric framework of BTE returns structured JSON triples annotated with semantic types and evidence citations from reputable biomedical databases such as Gene Ontology, DrugBank, and PubMed Central using Translator Reasoner API (TRAPI) specification [28–30].

Here, we introduce BTE-RAG (BioThings Explorer-Retrieval-Augmented Generation), a novel framework that integrates the conversational fluency and reasoning capabilities of advanced LLMs with the explicit, multi-domain mechanistic knowledge captured by BTE. BTE-RAG dynamically executes targeted, query-focused graph traversals to retrieve concise, mechanistically pertinent evidence, formulates this evidence into declarative context statements, and augments model prompts accordingly.

To rigorously assess the performance of BTE-RAG in biomedical question answering, we systematically created three specialized benchmark datasets from DrugMechDB, a curated knowledge base containing 5,666 expert-annotated mechanistic pathways with literature validation [31]. These datasets consist of gene-centric (n = 798), metabolite-centric (n = 201), and drug-centric (n = 842) question–answer pairs, each explicitly reflecting the causal flow from drug through intermediate biological nodes to disease outcomes. Across all three DrugMechDB-derived benchmarks, BTE-RAG robustly improves factual grounding, accelerates convergence to correct responses over diverse biomedical entities relative to an LLM-only baseline.

Collectively, these findings establish BTE-RAG as a powerful, practical tool for reducing hallucination risks and enhancing mechanistic clarity, significantly advancing the transparency, reliability, and utility of language model-driven biomedical discovery and clinical decision-making.

Materials and methods

BTE-RAG framework and baseline comparison

The BTE-RAG framework systematically compares two distinct inference routes to evaluate the impact of structured, mechanistic context on LLM outputs (Fig. 1A). The first inference route, labeled “LLM-only,” directly submits user-generated questions to the language model without external context augmentation. The second route, labeled “BTE-RAG,” integrates structured mechanistic evidence retrieved from BTE prior to submitting an enriched, evidence-supported prompt to the same language model. This dual-path design allows rigorous evaluation of how explicitly retrieved context influences both answer accuracy and the factual grounding of model-generated responses. The BTE-RAG architecture comprises three key phases: entity recognition, knowledge-graph-based retrieval via BTE, and generative inference utilizing context-augmented LLM prompting.

Figure 1.

Figure 1

Retrieval–augmented generation workflow and derivation of mechanistic evaluation benchmarks. (A) Schematic of the BTE-RAG pipeline, which augments LLM responses with context retrieved from the BioThings Explorer knowledge graph. In the LLM-only pathway, the model generates a response using only the input question. In contrast, BTE-RAG operates in two phases: a retrieval phase, where relevant entities are extracted from the question and queried against BTE to collect mechanistically relevant subject–predicate–object triples, and a generative phase, where this curated context is appended to the input question and passed to the LLM. The resulting outputs: LLM-only or BTE-RAG, can be directly compared to assess the impact of knowledge-augmented generation. (B) Construction of benchmark datasets from DrugMechDB, a curated biomedical knowledge graph of drug–disease mechanisms. Directed paths connecting a drug to a disease were mined and transformed into structured questions targeting different mechanistic facets: (i) gene nodes (mechanistic gene benchmark), (ii) biochemical entities or metabolites (metabolite benchmark), and (iii) drug–biological process–disease paths (drug benchmark). Each benchmark provides paired questions and gold–standard labels for rigorous, domain–specific evaluation of retrieval–augmented generation.

Entity recognition

The retrieval phase begins with precise identification of biomedical entities mentioned within each input question. For the current benchmarks, entities such as drugs, diseases, metabolites, and biological processes were pre-annotated and standardized to established knowledge graph identifiers, enabling automated recognition at runtime. Additionally, the framework includes a zero-shot entity extraction module that leverages a specialized task-oriented prompting approach. This module is currently optimized for retrieving drugs, diseases, and biological processes from queries, with potential to extend extraction capabilities to include other biomedical entities as needed. To validate the feasibility of automated entity extraction for real-world deployment, we evaluated this zero-shot GPT-3.5-based extraction module on all benchmark questions. The extractor achieved precision and recall >0.90 for all entity types, with joint accuracy (correctly extracting all required entities per question) ranging from 0.89 to 0.99 across the three benchmarks (Supplementary Table S1). These results show that automated entity extraction works reliably, though integrating it into the complete system and testing real-world unannotated questions remains important future work.

KG retrieval

Identified biomedical entities are translated into structured queries interfacing directly with BTE. BTE integrates 61 authoritative biomedical databases under a unified knowledge graph schema, accessible via the programmatic API endpoint (/v1/query). Each query to BTE specifies an input entity (e.g., disease, drug, or biological process) along with desired output entity categories, following the TRAPI query format. In response, BTE returns structured JSON data that includes a detailed knowledge graph containing two key components: “nodes,” which describe biomedical entities along with their semantic categories and standardized names; and “edges,” which specify the explicit relationships (predicates) between pairs of entities, supplemented by provenance details indicating the primary knowledge sources.

For each benchmark dataset, targeted queries were structured to retrieve mechanistically relevant context. Specifically, in the gene-centric benchmark, queries separately utilized disease and drug entities to retrieve directly linked gene and protein nodes. In the metabolite-centric benchmark, disease and chemical (drug) entities were queried independently to identify connected biochemical entities. For the drug–biological process benchmark, separate queries using disease entities and biological process entities were conducted to retrieve associated chemical entities (drugs). Upon receiving the structured knowledge graph responses from BTE, both node and edge information were systematically processed. Nodes were extracted along with their semantic categories and descriptive names, while edges were parsed to identify subject-object pairs, predicates, and associated primary knowledge sources. Nodes and edges were subsequently merged to construct coherent statements that succinctly describe each mechanistic relationship (e.g., “drug X inhibits gene Y”). These concise, natural-language context statements collectively formed the mechanistic evidence provided to the language models during the generative inference phase, significantly enhancing the transparency, interpretability, and accuracy of the generated outputs. Supplementary Fig. S1 provides a detailed schematic illustrating the complete BTE-RAG pipeline workflow, demonstrating a representative query and the subsequent processing and integration steps.

Context selection

Two distinct evidence-inclusion strategies were systematically assessed for each question. The first strategy incorporates the entire set of sentences retrieved by BTE, leveraging the extensive 128,000-token context window of GPT-4o [32]. The second strategy employs sentence-level cosine similarity filtering using “S-PubMedBert-MS-MARCO” embeddings, retaining only sentences whose similarity scores with the query exceed a predefined percentile threshold [33]. Running these two strategies concurrently enables a direct evaluation of the impact of comprehensive versus selectively pruned contextual evidence under identical experimental conditions.

Generative inference

For the generative phase, selected context sentences and the original query were concatenated to form an enriched prompt submitted to both GPT-4o and GPT-4o-mini models. Models were configured deterministically (temperature set to 0) to produce reproducible outputs. Parallel runs of the LLM-only baseline used identical questions without the BTE-derived context. To streamline downstream analyses and ensure objective comparisons, language models were instructed explicitly to output structured JSON responses devoid of extraneous explanatory text. Detailed system prompts are shown in Supplementary Table S7.

Evaluation

Generated outputs were benchmarked against curated ground-truth annotations from the constructed mechanistic datasets. For entity-specific tasks (e.g., gene identification), correctness was evaluated via exact, case-insensitive string matching. For semantically nuanced responses (e.g., metabolites and drugs), BioBERT-based embeddings (“BioBERT-mnli-snli-scinli-scitail-mednli-stsb”) quantified the semantic similarity between model-generated outputs and reference answers [34]. Answers surpassing a predetermined similarity threshold were classified as accurate. Collectively, these standardized evaluation methodologies ensure scalable, objective, and reproducible assessment of the fidelity and biological coherence of model predictions, rigorously testing the utility and impact of knowledge graph-enhanced prompting in biomedical reasoning contexts.

Embedding and scoring models

We use pritamdeka/S-PubMedBert-MS-MARCO, a PubMedBERT model fine-tuned on the MS MARCO passage-ranking task [35]. This model is optimized for high-recall biomedical information retrieval, having been trained on large-scale query-passage pairs to identify relevant content across lexically diverse API outputs. Its passage-ranking specialization makes it well-suited for the initial filtering stage, where we prioritize recall over precision to avoid prematurely discarding potentially relevant knowledge graph paths.

Similarity scoring: For semantic similarity evaluation, we use pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb, which extends BioBERT with additional fine-tuning on natural language inference (SNLI, MNLI, SciNLI, MedNLI) and semantic textual similarity (STS-B, SciTail) datasets [36]. This multi-task training enhances its ability to capture nuanced semantic relationships, including synonymy, paraphrasing, and entailment, critical for accurately assessing whether generated answers match gold-standard mechanistic explanations despite surface-level differences in phrasing.

Using distinct models for retrieval and evaluation aligns each component with its functional requirements: S-PubMedBERT-MS-MARCO prioritizes finding all potentially relevant passages (high recall), while the BioBERT-based model provides precise semantic equivalence judgments (high precision).

External baseline methods and benchmarks

In addition to evaluating BTE-RAG, we compared against established biomedical reasoning frameworks and benchmarks commonly used to assess LLMs in genomics.

GeneGPT models

We evaluated GeneGPT, a biomedical reasoning framework that enables LLMs to interact with structured biomedical knowledge through NCBI (National Center for Biotechnology Information) Web APIs using in-context learning [37]. GeneGPT couples curated API documentations and usage demonstrations with an inference procedure that integrates API calls directly into the model’s decoding process, allowing external knowledge to be retrieved and incorporated during answer generation.

In this study, we evaluated two GeneGPT configurations as described in the original GeneGPT framework. The GeneGPT-Full configuration (denoted as 111111) includes all available API documentations (Dc.1–Dc.2) and usage demonstrations (Dm.1–Dm.4) within the in-context prompt. The GeneGPT-Slim configuration (denoted as 001001) uses a reduced prompt consisting only of Demonstrations Dm.1 and Dm.4, thereby limiting contextual information while retaining core API usage examples.

To ensure a fair comparison with other methods evaluated using GPT-4o family models, all GeneGPT variants were adapted to use GPT-4o-mini or GPT-4o as the underlying language model, and the context window was expanded to 100,000 tokens, exceeding the original Codex-based implementation (18k tokens).

GeneTuring gene–disease association benchmark

To benchmark performance on a standardized genomics task, we used the GeneTuring benchmark, a curated question-answering dataset designed to evaluate LLMs on genomics knowledge [38]. From GeneTuring, we selected the functional analysis category, specifically focusing on the gene–disease association module, which assesses whether a model can correctly identify genes associated with a given disease.

Each query was reformatted into a standardized question template of the form: “What is the gene related to [disease]?” Model outputs were evaluated by comparing the predicted gene against the corresponding ground-truth gene annotations provided by GeneTuring. Predictions were scored as correct if the model-generated gene matched any of the accepted ground-truth answers for a given question, and incorrect otherwise, yielding a binary outcome suitable for accuracy-based evaluation.

Datasets from DrugMech database

Construction of mechanistic question–answer benchmarks from DrugMechDB

DrugMechDB is a rigorously curated biomedical knowledge graph designed to represent therapeutic mechanisms through explicit stepwise paths. These pathways originate from drug nodes, traverse biologically meaningful intermediate entities, and culminate at disease nodes, collectively delineating mechanisms underlying drug–disease interactions [31]. The current version of DrugMechDB contains 5,666 curated mechanistic pathways, providing comprehensive coverage for 4,583 distinct drug–disease indications. Each node within DrugMechDB is systematically mapped to a standardized Biolink category and anchored to stable identifiers, while each relationship (edge) is annotated with a controlled predicate [39]. This structured, granular, and provenance-rich resource enables robust benchmarking of computational models focused on mechanistic inference rather than simple associative or co-occurrence patterns.

To comprehensively evaluate the BTE-RAG framework across multiple levels of biological resolution, DrugMechDB was systematically transformed into three complementary mechanistic question–answer (QA) benchmarks, each highlighting a distinct biological focus: genes, metabolites, and drugs (Fig. 1B).

Gene-Centric Benchmark: Mechanistic pathways were initially filtered to retain those containing exactly one internal node annotated as a gene entity. Gene identifiers were resolved into standardized HGNC symbols using MyGene.info services; pathways containing deprecated or ambiguous identifiers were systematically excluded [40]. Each remaining mechanistic pathway was converted into a structured question of the form: “Which gene plays the most significant mechanistic role in how Drug “X” treats or impacts Disease “Y”?” The corresponding HGNC gene symbol served as the definitive ground truth. Following deduplication across different indications, this dataset comprised 798 unique QA pairs.

Metabolite-Centric Benchmark: To capture downstream biochemical effects, pathways exclusively containing taxonomic relationships (such as “subclass” predicates) were removed to ensure mechanistic specificity. Selected pathways included exactly one metabolite node, identified specifically by filtering node identifiers prefixed with “CHEBI:” to denote biochemical entities. Records containing multiple mechanistic pathways were excluded to maintain dataset simplicity and clarity. Each qualifying pathway was formulated into the structured question: “Which biochemical entity is affected by Drug “X” via its mechanism of action in treating Disease “Y”?” The metabolite node identified via CHEBI identifiers served as the ground truth answer, yielding a final dataset of 201 unique QA pairs.

Drug-Centric Benchmark: A third benchmark dataset was developed to evaluate the ability of computational models to infer therapeutic agents when provided with a disease and a mediating biological process. Pathways were selected specifically if they included exactly one BiologicalProcess node, and drugs lacking resolvable identifiers from DrugBank or MESH databases were excluded to ensure accurate and standardized identification. Each qualifying path was structured into the question: “Which drug can be used in the treatment of Disease “Y” by targeting Biological Process “P”?” The corresponding drug node served as the ground truth. After thorough harmonization and stringent quality control measures, this benchmark comprised 842 unique QA pairs.

The resulting benchmarks thus offer a robust, multiscale evaluation platform specifically designed to probe the mechanistic inference capabilities of knowledge-graph-augmented language models comprehensively and rigorously.

Use of LLMs

All natural-language processing steps were carried out with two OpenAI models, GPT-4o-mini (snapshot 2024-07-18) and GPT-4o (snapshot 2024-08-06) [32, 41]. Both models were invoked through the OpenAI API. The temperature parameter was fixed at 0.0 for every request, thereby forcing deterministic decoding and facilitating reproducible evaluation. Each model accepts up to 128,000 input tokens and can return a maximum of 16,384 completion tokens. Although GPT-4o-mini is substantially smaller in parameter count than GPT-4o, both models share the same context window size, permitting a controlled comparison of model capacity while holding prompt length constant [32, 41]. At the time the experiments were executed, GPT-4o-mini was priced at 0.15 USD per million input tokens and 0.60 USD per million output tokens [42]. The corresponding prices for GPT-4o were 2.50 USD and 10.00 USD, respectively [43]. Model versions were pinned by explicit snapshot identifiers to eliminate the possibility of version drift during the study period. ChatGPT was used to assist with grammar correction and to improve conciseness in the manuscript.

Prompt engineering

Each request began with a concise system prompt defining the model’s role [44–46]. Two distinct system prompts were prepared per dataset: one for the standalone LLM baseline, and one tailored for the retrieval-augmented BTE-RAG workflow. Queries were provided directly to the model without additional contextual examples, employing a zero-shot prompting approach. To facilitate efficient and accurate downstream processing, the model was instructed to produce responses strictly in a predefined JSON format, omitting supplementary explanatory text.

Results

We developed BTE-RAG, a retrieval-augmented generation framework designed to enhance LLMs by integrating mechanistic evidence from BTE, a federated biomedical knowledge graph. BTE-RAG embeds structured, graph-derived context into prompts to improve mechanistic accuracy, ensure explicit provenance, and facilitate higher-order reasoning. We benchmarked the performance of BTE-RAG across four distinct biomedical reasoning tasks: (1) GeneTuring’s gene-disease task, (2) gene identification from drug-disease mechanisms, (3) drug–metabolite-disease interactions, and (4) drug–biological-process-disease relationships. For gene-focused tasks (benchmarks 1 and 2), we compared BTE-RAG against three baseline approaches: LLM-only prompting, GeneGPT-Full, and GeneGPT-Slim. For metabolite and drug-biological process benchmarks, we compared BTE-RAG against LLM-only prompting.

GeneTuring gene-disease association benchmark

To establish baseline performance on a standardized biomedical association task, we evaluated BTE-RAG on the GeneTuring gene-disease association dataset alongside existing biomedical RAG and prompting-based approaches [38]. This benchmark focuses on identifying associations between genes and diseases, providing a complementary evaluation to our mechanistic reasoning benchmarks. We compared four approaches: (1) LLM-only prompting, (2) BTE-RAG, (3) GeneGPT-Full, and (4) GeneGPT-Slim.

Across both model scales, BTE-RAG achieved the highest accuracy, substantially outperforming all baselines (Fig. 2). On GPT-4o-mini, BTE-RAG increased accuracy from 33 to 77%, while on GPT-4o performance rose from 56 to 80%, despite the stronger baseline of the larger model. In both cases, GeneGPT-Full and GeneGPT-Slim achieved intermediate accuracy, indicating that API-level retrieval alone provides only partial gains.

Figure 2.

Figure 2

Method comparison on the GeneTuring gene–disease association benchmark. Accuracy is shown for four approaches: LLM-only prompting, BTE-RAG, GeneGPT-Full, and GeneGPT-Slim, evaluated on the set of GenTuring’s gene–disease association benchmark. Results are reported separately for the compact gpt-4o-mini and the larger gpt-4o models. For both the models, incorporating the BTE-RAG retrieval layer substantially increased accuracy relative to the LLM-only baseline. Bars indicate percentage accuracy. Statistical significance was assessed using paired McNemar’s tests, comparing BTE-RAG against each baseline within the same model. Significance levels are indicated as *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 1 × 10−4. (Exact statistics and flip counts reported in Supplementary Table S2 and Fig. S2).

All pairwise comparisons between BTE-RAG and baseline methods were statistically significant (Supplementary Table S2). With GPT-4o-mini, BTE-RAG’s advantage over LLM-only was 44 percentage points, over GeneGPT-Full was 48 points, and over GeneGPT-Slim was 45 points. With GPT-4o, the corresponding advantages were 24 points, 21 points, and 16 points.

Contingency-table analysis further revealed that these improvements were driven by asymmetric error correction (Supplementary Fig. S2). For GPT-4o-mini, BTE-RAG corrected 45–49 gene–disease associations that baseline methods failed, while introducing only 1–2 errors on questions answered correctly by the baselines. For GPT-4o, BTE-RAG corrected 21–27 additional associations with only 3–5 reversals. These results indicate that federated knowledge graph retrieval provides substantially more relevant and accurate biomedical context than either unstructured prompting or API-based retrieval alone for gene–disease association tasks.

Mechanistic gene prediction

We next assessed the impact of retrieval-augmented and structured knowledge-based inference on gene-level mechanistic reasoning, using 798 curated mechanistic drug–disease associations from DrugMechDB. Each query was formulated as: “Which gene plays the most significant mechanistic role in how Drug X treats or impacts Disease Y?” Both models, GPT-4o-mini and GPT-4o were evaluated across four approaches.

Under the LLM-only condition, GPT-4o-mini achieved an accuracy of 51.0% on the mechanistic gene benchmark (Fig. 3). Incorporating BTE-RAG substantially increased accuracy to 75.8%, corresponding to an absolute improvement of 24.8 percentage points, which was highly significant by paired analysis (McNemar P < 0.0001; Supplementary Table S3). In contrast, GeneGPT-Full and GeneGPT-Slim achieved lower accuracies of 37.2 and 45.2%, respectively. Direct pairwise comparisons confirmed that BTE-RAG significantly outperformed both GeneGPT variants across all tests (Supplementary Table S3).

Figure 3.

Figure 3

Retrieval–augmented generation with BTE-RAG markedly improves factual accuracy of gene-centric benchmark using GPT–4o models. Shown is the accuracy of four inference strategies applied to a mechanistic gene benchmark using gpt-4o-mini and gpt-4o. Across both model scales, BTE-RAG outperforms LLM-only prompting and GeneGPT-based approaches, demonstrating the advantage of query-conditioned retrieval for gene-level mechanistic reasoning. Bars indicate percentage accuracy. Statistical significance was assessed using paired McNemar’s tests, comparing BTE-RAG against each baseline within the same model. Significance levels are indicated as *P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 1 × 10−4. (Supplementary Table S3 and Fig. S3).

A similar pattern was observed for the larger GPT-4o model. LLM-only prompting achieved an accuracy of 69.8%, which increased to 78.6% with BTE-RAG, yielding an absolute gain of 8.8 percentage points (McNemar P < .0001; Fig. 3, Supplementary Table S3). As with GPT-4o-mini, GeneGPT-Full and GeneGPT-Slim exhibited substantially lower performance (42.7 and 44.0%, respectively), and BTE-RAG significantly outperformed both structured GeneGPT configurations in all pairwise comparisons (Supplementary Table S3).

Contingency-table analyses revealed that these gains were driven by strongly asymmetric error corrections. For GPT-4o-mini, BTE-RAG converted 245 previously incorrect predictions to correct, with only 47 reversals from correct to incorrect, while for GPT-4o, 119 incorrect predictions were corrected with 49 reversals (Supplementary Fig. S3). Similar asymmetries were observed when comparing BTE-RAG against both GeneGPT-Full and GeneGPT-Slim, accounting for the highly significant McNemar statistics across all pairwise comparisons.

Because knowledge–graph queries can return superfluous triples, we evaluated a simple similarity–based pruning strategy. Specifically, both the user queries and the context statements were embedded using the sentence embedding model “S-PubMedBert-MS-MARCO” [33]. Context statements were then ranked based on cosine similarity scores relative to the embedded query, and those statements falling within the lowest 10% similarity scores were removed to retain only the most relevant context lines. This lightweight filtering strategy preserved, and, in some cases, slightly enhanced performance across all evaluated accuracy metrics (Supplementary Figs S4A, S5A), suggesting that excluding the least relevant context statements can beneficially impact the accuracy of gene-level reasoning tasks.

Together, these results demonstrate that retrieval-augmented mechanistic context provided by BTE-RAG consistently and substantially improves gene-level reasoning, outperforming both LLM-only prompting and structured GeneGPT-based approaches across base model scales. These gains are particularly pronounced for smaller models such as GPT-4o-mini, where retrieval markedly amplifies mechanistic inference capabilities, but remain evident even for the larger GPT-4o model. The observed improvements indicate that state-of-the-art language models retain latent mechanistic knowledge gaps that can be effectively bridged through the integration of curated biomedical knowledge graphs and the selective inclusion of relevant contextual evidence.

Prediction of drug–metabolite relationships

To gauge whether retrieval augments the mechanistic fidelity of metabolite–level reasoning, we posed 201 queries of the form “Which biochemical entity is affected by Drug X via its mechanism of action in treating Disease Y?” using the DrugMechDB‐derived Drug → Metabolite → Disease paths. Because metabolite names are much less standardized than gene names, we scored the answer quality by computing a semantic concordance between each model answer and the gold standard metabolite. Semantic concordance was based on cosine similarity of text embeddings using the BioBERT-STSB text embedding model, a metric that rewards graded lexical and semantic overlap rather than exact string identity [34].

Rank–ordered similarity curves in Fig. 4A immediately reveal the effect of augmentation: for both gpt–4o–mini (orange) and gpt–4o (blue), the BTE–RAG trace (solid line) departs from the prompt–only baseline (dashed line) after ~130 ranked questions (cosine ≈ 0.70) and widens steadily, nearly doubling the number of answers that reach the high–fidelity zone (cosine ≥ 0.90). Paired per-item similarity analyses demonstrated a consistent improvement with retrieval augmentation across both model scales. BTE-RAG increased mean similarity by 0.066 (95% CI: 0.031–0.099) for GPT-4o-mini and by 0.063 (95% CI: 0.030–0.095) for GPT-4o. These improvements were statistically significant by Wilcoxon signed-rank tests (P < 0.001 for both models). Effect size estimation using Cliff’s delta (δ) indicated a small effect for GPT-4o-mini (δ = 0.239, 95% CI: 0.139–0.338) and a negligible effect for GPT-4o (δ = 0.065, 95% CI: −0.040–0.169), indicating that the smaller model benefited more substantially from retrieval augmentation.

Figure 4.

Figure 4

Retrieval–augmented context increases semantic concordance with ground–truth metabolites. (A) Cosine–similarity scores between each generated answer and the corresponding reference metabolite (sentence–transformer embeddings; see the Materials and methods section) are plotted for all 201 questions in the Metabolite Benchmark, ordered from lowest to highest similarity. Dashed traces represent the LLM–only baseline, whereas solid traces include BTE retrieval–augmented context. For both model sizes, BTE–RAG systematically shifts the similarity distribution upward, indicating improved semantic alignment with the curated biochemical ground truth. (B) Score distribution GPT-4o, LLM-only. Histogram of cosine-similarity scores for GPT-4o answers generated without external context. Bar heights and numeric labels denote the number of questions (n = 201) falling in each bin; the overlaid KDE line summarizes the distribution. (C) Score distribution GPT-4o + BTE-RAG. Same format as panel B but for GPT-4o answers generated with BTE-RAG’s context. The right-shifted, more peaked distribution highlights the improvement in semantic alignment achieved by RAG.

Histograms for the prompt–only condition (Fig. 4B, gpt4o; Supplementary Fig. S6, gpt-4o-mini) reveal a pronounced left–skew: both gpt–4o–mini and gpt–4o peak in the 0.15–0.30 similarity bins, with medians below 0.30. Only 15% of answers fall in the high–similarity regime (≥0.90), indicating that the LLMs frequently retrieve metabolites that are semantically distant from the curated ground truth.

To validate this threshold choice, we examined sensitivity across nearby cutoffs (0.85, 0.90, and 0.95) (Supplementary Table S4). For GPT-4o-mini, BTE-RAG increased high-fidelity answers from 15.4 to 26.4% at the 0.85 threshold, from 13.9 to 25.4% at 0.90, and from 11.9 to 23.4% at 0.95. GPT-4o showed similar patterns: 16.4 to 27.4% at 0.85, 14.9 to 26.4% at 0.90, and 11.4 to 25.9% at 0.95. The consistent gains across all thresholds confirm that retrieval augmentation robustly enhances high-fidelity predictions regardless of the specific cutoff selected.

Appending BTE evidence shifts the distributions rightward across similarity bins (Fig. 4C (gpt-4o), Supplementary Figs S7, S8). For GPT-4o-mini, applying a stringent context similarity threshold (>80th percentile) increased the number of high-fidelity answers (cosine similarity 0.90–1.00) from 28 to 51 (+82%). Similarly, GPT-4o exhibited an increase from 30 to 53 (+77%) under the same conditions. Simultaneously, counts in the mid–similarity interval (0.40–0.70) contract (Supplementary Figs S7, S8), confirming that retrieval largely converts borderline predictions into highly concordant hits rather than merely redistributing low–score failures.

Because voluminous context can inflate token budgets, we assessed performance when progressively discarding lower–ranked context lines (10th to 90th percentile cut–offs). Rank–ordered similarity traces (Supplementary Fig. S9) show that the BTE–RAG curves remain above or coincide with the prompt–only baseline throughout the distribution even when 90% of context is withheld. Histograms (Supplementary Figs S7, S8) reinforce this observation: the ≥0.90 similarity bin consistently retains ≥40 hits for both models across all pruning levels, demonstrating that a concise subset of top–ranked evidence lines is sufficient to drive the bulk of the performance gains.

Drug–biological process reasoning

We next asked 842 DrugMechDB questions of the form “Which drug can be used in the treatment of Disease Y by targeting Biological Process P?.” Answer fidelity was again scored with BioBERT–STSB cosine similarity [34].

In rank–ordered plots (Fig. 5A), the prompt–only (dashed) and BTE–RAG (solid) curves for both gpt–4o–mini (orange) and gpt–4o (blue) are nearly super–imposable through the first ≈ 600 ranked queries (cosine < 0.70). Beyond this inflection point, the BTE–augmented traces bend upward more steeply, yielding a clear margin in the high–fidelity zone (cosine ≥ 0.80). Thus, retrieval does not alter overall parity but selectively boosts the most mechanistically demanding subset of questions.

Figure 5.

Figure 5

RAG maintains overall parity yet excels in the high-fidelity regime of drug-centric mechanistic answers. (A) Cosine-similarity scores (sentence-transformer embeddings; see the Materials and methods section) between each generated answer and the reference drug→biological-process pathway are plotted for all 842 questions in the drug benchmark, ordered from lowest to highest similarity. Dashed traces (LLM-only) and solid traces (BTE-RAG) follow nearly overlapping trajectories across most of the distribution, indicating broadly comparable performance between the two inference modes. However, above a cosine similarity threshold of ≈ 0.7, both gpt-4o-mini and gpt-4o curves generated with BTE context surge ahead of their prompt-only counterparts, revealing a marked advantage in producing highly concordant mechanistic explanations. (B) Score distribution GPT-4o, LLM-only. Histogram of cosine-similarity scores for GPT-4o answers generated without external context. The hatched bar at 0.90–1.00 marks the high-fidelity zone, capturing 129 near-perfect matches produced by the baseline model. (C) Score distribution GPT-4o + BTE-RAG. Same format as panel B but for GPT-4o answers produced with BTE-RAG’s context. The distribution is right-shifted, and the solid bar in the 0.90–1.00 high-fidelity zone now contains 144 answers, highlighting the enrichment of top-tier mechanistic concordance achieved through RAG.

Prompt–only histograms (Fig. 5B; Supplementary Fig. S10, gpt-4o-mini) peak in the 0.20–0.35 range, with ~15% of answers falling in the ≥0.90 bin. Appending the full BTE context nudges the entire distribution rightward (Fig. 5C; Supplementary Fig. S11-top–left panels). The ≥0.90 bin increases by ≈ 5–10% for both model sizes. These shifts, though smaller than those seen for gene– and metabolite tasks, account for the late–stage separation observed in Fig. 5A.

Unlike the previous tasks, performance here depends on retaining a broad evidentiary window. When the lowest–ranked 10–20% of context lines are removed, the uplift in the ≥0.90 bin attenuates, and the rank–ordered curves progressively converge toward the baseline (Supplementary Figs S11S13). Deeper cuts (> 40%) essentially erase the retrieval advantage. This suggests that pathway–level questions draw on a more diffuse set of graph triples than gene or metabolite queries, and aggressive trimming can discard critical relational clues. For drug → biological–process reasoning, BTE–RAG delivers targeted gains in the top decile of similarity scores, provided the complete knowledge–graph context is supplied.

To assess sensitivity to similarity-threshold selection, we evaluated performance at cutoffs of 0.85, 0.90, and 0.95 (Supplementary Table S5). For GPT-4o-mini, BTE-RAG produced minimal changes across thresholds, with accuracy increasing from 13.3 to 13.8% at 0.85, from 13.1 to 13.7% at 0.90, and showing no change at 0.95. GPT-4o exhibited modest improvements, with accuracy increasing from 15.6 to 17.3% at 0.85, from 15.3 to 17.1% at 0.90, and from 15.3 to 16.5% at 0.95. The modest and inconsistent gains across thresholds underscore the inherent difficulty of this task, where linking drugs to biological processes requires more abstract mechanistic reasoning than gene- or metabolite-level queries.

Paired per-item similarity analysis showed no systematic advantage of BTE-RAG over the baseline model for drug-to-biological-process reasoning. Performance differences were small and not statistically distinguishable under paired nonparametric testing, with effect sizes indicating negligible practical impact.

These findings suggest that drug-to-biological-process reasoning represents a more challenging inference regime in which retrieval augmentation provides limited overall benefit.

We conducted a detailed error analysis of all instances in which BTE-RAG underperformed the LLM-only baseline across the three benchmarks for the GPT-4o-mini model. As summarized in Supplementary Table S6, most degradations arose from context dilution (relevant mechanistic entities retrieved but obscured by numerous less-specific entities) or evidence coverage gaps (missing or incomplete mechanistic edges in the federated knowledge graph), with only some contributions from filtering errors (relevant evidence pruned during similarity-based selection) or derivative mismatches (e.g., chloramphenicol palmitate returned instead of chloramphenicol).

These findings reinforce that optimal evidence granularity is task–dependent: concise, high–relevance snippets suffice for gene– and metabolite–level inference, whereas pathway–level queries benefit from a richer contextual fabric. The variation in retrieval effectiveness across tasks, from substantial improvements in gene-level binary classification to modest gains in drug-to-biological-process similarity, highlights that different query types pose distinct mechanistic reasoning challenges. By grounding LLM outputs within curated, biologically meaningful pathways, BTE-RAG consistently accelerates accurate inference, reduces residual errors, and demonstrates considerable promise for advancing automated biomedical hypothesis generation and therapeutic repurposing workflows.

Discussion

The rapid advancement of LLMs has profoundly reshaped biomedical natural language processing [47]. Despite these advances, current LLMs predominantly operate as opaque systems with implicit knowledge representation, rendering their factual accuracy challenging to verify and limiting their applicability in high-stakes biomedical environments. Recent efforts, such as the knowledge-graph augmented retrieval approach [21], have successfully enhanced biomedical reasoning by integrating disease-specific embeddings from specialized knowledge graphs such as SPOKE [48]. We developed BTE-RAG, a novel RAG pipeline that strategically incorporates explicit mechanistic evidence from BTE [27]. By leveraging the extensive and federated biomedical KG of BTE, our method substantially broadens the applicability of knowledge-graph augmented strategies to address diverse query types, including those involving genes, proteins, metabolites, biological processes, diseases, and chemical substances. This capability allows BTE-RAG to support complex, multi-domain biomedical inquiries, significantly extending beyond disease-centric queries alone. Our comparative analysis, utilizing a direct “LLM-only” approach versus the BTE-augmented strategy (Fig. 1A) across three rigorously constructed DrugMechDB benchmarks (Fig. 1B), demonstrates that incorporating explicit, structured context significantly elevates answer accuracy, enhances transparency, and allows smaller, more computationally efficient models to perform competitively with leading-edge systems. The granularity, explicit mechanistic grounding, and high-quality source attribution inherent in these benchmarks uniquely position them for probing the causal inference capabilities of language models. Comparable mechanistically focused datasets remain scarce in the biomedical domain, as existing resources like PubMedQA or Natural Questions predominantly target document-level retrieval or summarization rather than deep mechanistic inference [49, 50].

Traditional LLMs accumulate domain-specific knowledge implicitly during pre-training by statistically modeling large collections of biomedical texts. Although this method yields linguistically coherent responses, it inherently exposes models to the risk of hallucinations, particularly in scenarios involving sparse biomedical facts or multi-step mechanistic reasoning. By contrast, RAG explicitly anchors model predictions in verifiable external sources, constraining generation to well-substantiated evidence. BTE-RAG advances this paradigm by dynamically federating 61 authoritative biomedical APIs into a single cohesive meta-graph, thereby enabling real-time inclusion of newly curated knowledge in generated responses and ensuring reproducible benchmarking through cached retrievals.

Four critical design principles underpin the efficacy of the BTE-RAG framework. First, the framework leverages an API-centric federation layer that integrates trusted biomedical data sources, including MyGene.info, Gene Ontology, CTDbase, Pubmed Central, CHEBI, Disease-Ontology, DrugBank, and more, through unified interface of BTE [29, 40, 51–53]. Second, it employs semantic query templates aligned with the TRAPI standard, selectively retrieving only the most relevant relationships for each question, thereby avoiding extraneous contextual noise. Third, retrieved knowledge graph triples are translated into succinct, directionally explicit declarative statements, seamlessly integrating structured knowledge with natural-language prompts. Fourth, BTE-RAG incorporates flexible context-selection strategies; full-context utilization; and cosine similarity-based pruning for scenarios requiring concise, highly relevant context subsets.

Across diverse mechanistic tasks, including gene-centric, metabolite-centric, and drug-centric benchmarks derived from DrugMechDB [31]. BTE-augmented prompting consistently outperformed the LLM-only approach. Notably, the smaller GPT-4o-mini model achieved over sixty percent improvement in accuracy on the gene-centric task and eighty-two percent improvement on the metabolite task when provided with structured BTE evidence. Even GPT-4o, the larger flagship model, demonstrated substantial accuracy gains, underscoring that high-quality, explicit mechanistic context can effectively mitigate the need for extremely large model sizes, suggesting a cost-efficient pathway toward domain-specific accuracy.

While BTE offers comprehensive coverage across numerous biomedical domains, certain areas such as single-cell data, epigenomic profiles, and microbiome interactions remain sparsely represented. Furthermore, variations in curation quality across federated APIs could inadvertently propagate erroneous edges into model-generated contexts. Although our evaluation leveraged the meticulously curated, high-confidence knowledge graph of DrugMechDB, real-world applications may require strategies for managing lower-confidence or conflicting evidence. A limitation of our current evaluation is that the benchmarks define a single ground-truth mechanistic entity per question, whereas real biological mechanisms often involve multiple interacting components; extending BTE-RAG to multi-entity mechanistic reasoning represents an important direction for future work. Our study employed deterministic prompting to maintain comparability; exploring guided, chain-of-thought prompting strategies could further enhance complex reasoning capabilities but may simultaneously reintroduce hallucinatory risks.

Future developments of BTE-RAG may involve integration into autonomous agent systems capable of iterative querying, generation, self-critiquing, and re-querying, thus facilitating automated self-verification workflows. Expanding the underlying knowledge graph to incorporate resources such as LINCS transcriptomic signatures, tissue-specific interaction networks, and multi-omics datasets would further enrich the mechanistic coverage and broaden applicability [54]. Expanding benchmarking efforts beyond DrugMechDB to encompass open-world biomedical queries could rigorously evaluate and strengthen the capacity of BTE-RAG for reliable, contextually grounded inference. Furthermore, adopting frameworks like the Model Context Protocol could harmonize comparisons across diverse generative models, facilitate rigorous auditing, and support real-time decision-making in clinical and regulatory contexts.

In conclusion, BTE-RAG demonstrates the substantial value derived from strategically integrating explicit mechanistic evidence into biomedical language modeling workflows. By significantly improving answer accuracy, interpretability, and computational efficiency, this approach provides a scalable, transparent, and robust foundation for future biomedical AI systems, effectively balancing accuracy, affordability, and trustworthiness.

Availability of source code and requirements

Project name: BTE-RAG

Project home page: https://github.com/janjoy/BTE-RAG

Operating system(s): Linux (Ubuntu)

Programming language: Python 3.10.9

Other requirements: OpenAI Python SDK (openai==1.61.0); additional dependencies listed in requirements file in the repository.

License: Apache-2.0 license

RRID:SCR_027297

Supplementary Material

giag007_Supplemental_File
giag007_Authors_Response_To_Reviewer_Comments_original_submission
giag007_GIGA-D-25-00307_Original_Submission
giag007_GIGA-D-25-00307_Revision_1
giag007_Reviewer_1_Report_Original_Submission

Christopher Tabone, Ph.D. -- 9/3/2025

giag007_Reviewer_1_Report_Revision_1

Christopher Tabone, Ph.D. -- 12/27/2025

giag007_Reviewer_2_Report_Original_Submission

Sajib Acharjee Dip -- 9/4/2025

giag007_Reviewer_2_Report_Revision_1

Sajib Acharjee Dip -- 1/2/2026

Acknowledgments

We thank Willow Callaghan, Mikhael Astorga, and Karthik Soman for insightful discussions, and Everaldo Rodolpho for technical support with high-performance computing resources and server infrastructure.

Contributor Information

Janet Joy, Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA.

Andrew I Su, Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd, La Jolla, CA, 92037, USA.

Additional files

Supplementary Figure S1: Detailed pipeline for BTE-RAG.

Supplementary Figure S2: Contingency table analysis of BTE-RAG performance gains on the GeneTuring benchmark.

Supplementary Figure S3: Contingency table analysis of BTE-RAG performance gains on the mechanistic gene benchmark.

Supplementary Figure S4: Performance of BTE–RAG versus an LLM–only baseline on the gene–centric benchmark using gpt–4o–mini.

Supplementary Figure S5: Performance of BTE–RAG versus an LLM–only baseline on the gene–centric benchmark using gpt–4o.

Supplementary Figure S6: Cosine-similarity profile for the metabolite-centric benchmark using GPT-4o-mini in LLM-only mode.

Supplementary Figure S7: Distribution of answer similarities for the metabolite-centric benchmark using GPT-4o-mini in BTE-RAG mode.

Supplementary Figure S8: Distribution of answer similarities for the metabolite-centric benchmark using GPT-4o in BTE-RAG mode.

Supplementary Figure S9: Rank-ordered cosine similarities between model predictions and ground-truth answers on the metabolite-centric benchmark, across context filtering thresholds.

Supplementary Figure S10: Cosine-similarity profile for the drug-centric benchmark using GPT-4o-mini in LLM-only mode.

Supplementary Figure S11: Distribution of answer similarities for the drug-centric benchmark using GPT-4o-mini in BTE-RAG mode.

Supplementary Figure S12: Distribution of answer similarities for the drug-centric benchmark using GPT-4o in BTE-RAG mode.

Supplementary Figure S13: Rank-ordered cosine similarities between model predictions and ground-truth answers on the drug-centric benchmark, across context filtering thresholds.

Supplementary Table S1: Zero-shot entity recognition performance.

Supplementary Table S2: Statistical comparison of BTE-RAG with baseline methods on the GeneTuring gene–disease association benchmark.

Supplementary Table S3: Statistical comparison of BTE-RAG with baseline methods on the mechanistic gene benchmark.

Supplementary Table S4: Sensitivity analysis of similarity thresholds for high-fidelity drug–metabolite predictions.

Supplementary Table S5: Sensitivity analysis of similarity thresholds for high-fidelity drug–biological process-disease predictions.

Supplementary Table S6: Analysis of degradation cases across gene, metabolite, and drug–biological process benchmarks.

Supplementary Table S7: System prompts.

Author contributions

J.J. and A.S. conceived the project and proposed the benchmark curation. J.J. implemented the code, created the benchmarks and wrote the manuscript. A.S. supervised the study. All authors read and approved of the final manuscript.

Funding

Support for this work was provided by the National Institute on Aging (NIA, award R01AG066750), and by the National Center for Advancing Translational Sciences through the Biomedical Data Translator program (NCATS, awards 1OT2TR003427 and 1OT2TR005710) and the Scripps Research Translational Institute (SRTI, award UM1TR004407). Any opinion expressed in this document does not necessarily reflect the views of NIA, NCATS, SRTI, NIH (National Institute of Health), individual Translator team members, or affiliated organizations and institutions.

Data availability

All BTE retrieval caches, LLM outputs, and evaluation metrics used in this study are archived in our public GitHub repository [55]. Cached files include complete API responses with query identifiers, entity identifiers, predicates, source names, and all intermediate analysis results.

Competing interests

The authors declare no competing interests.

References

  • 1. Hou  W, Ji  Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nat Methods. 2024;21:1462–1465. 10.1038/s41592-024-02235-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Rives  A, Meier  J, Sercu  T  et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118:e2016239118. 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Lin  Z, Akin  H, Rao  R  et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 4. Meier  J. Language models enable zero-shot prediction of the effects of mutations on protein function. 2021. bioRxiv. Preprint at 2021.07.09.450648. 10.1101/2021.07.09.450648. [DOI]
  • 5. Zheng  Y, Koh  HY, J  Ju  et al.  Large language models for drug discovery and development. Patterns. 2025. 10.1016/j.patter.2025.101346. [DOI] [PMC free article] [PubMed]
  • 6. Miller  K, Bedrick  S, Lu  Q  et al.  Dynamic few-shot prompting for clinical note section classification using lightweight, open-source large language models. J Am Med Inform Assoc. 2025;32:1164–1173. 10.1093/jamia/ocaf084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ji  Z, Lee  N, Frieske  R  et al.  Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55:1–38. 10.1145/3571730. [DOI] [Google Scholar]
  • 8. Vaswani  A, Shazeer  N, Parmar  N  et al.  Attention is all you need. 2023. Preprint at 10.48550/arXiv.1706.03762. [DOI]
  • 9. Kim  Y, Jeong  H, Chen  S  et al.  Medical hallucinations in foundation models and their impact on healthcare. 2025. Preprint at 10.48550/arXiv.2503.05777. [DOI]
  • 10. Idnay  B, Xu  Z, Adams  WG  et al.  Environment scan of generative AI infrastructure for clinical and translational science. npj Health Syst. 2025;2:4. 10.1038/s44401-024-00009-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ibrahim  M, Khalil  YA, Amirrajab  S  et al.  Generative AI for synthetic data across multiple medical modalities: a systematic review of recent developments and challenges. Comput Biol Med. 2025;189:109834. 10.1016/j.compbiomed.2025.109834. [DOI] [PubMed] [Google Scholar]
  • 12. Maynez  J, Narayan  S, Bohnet  B  et al.  On faithfulness and factuality in abstractive summarization. 2020. Preprint at 10.48550/arXiv.2005.00661. [DOI]
  • 13. Yang  Y, Jin  Q, Huang  F  et al.  Adversarial attacks on large language models in medicine. 2024. Preprint at 10.48550/arXiv.2406.12259. [DOI]
  • 14. Luo  R, Sun  L, Xia  Y  et al.  BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinf. 2022;23:bbac409. 10.1093/bib/bbac409. [DOI] [PubMed] [Google Scholar]
  • 15. Kojima  T, Gu  SS, Reid  M  et al.  Large language models are zero-shot reasoners. 2023. Preprint at 10.48550/arXiv.2205.11916. [DOI]
  • 16. Brown  TB, Mann  B, Ryder  N  et al.  Language models are few-shot learners. 2020. Preprint at 10.48550/arXiv.2005.14165. [DOI]
  • 17. Lewis  P, Perez  E, Piktus  A  et al.  Retrieval-augmented generation for knowledge-intensive NLP tasks. 2021. Preprint at 10.48550/arXiv.2005.11401. [DOI]
  • 18. Izacard  G, Grave  E. Leveraging passage retrieval with generative models for open domain question answering. 2021. Preprint at 10.48550/arXiv.2007.01282. [DOI]
  • 19. Karpukhin  V, Oguz  B, Min  S  et al.  Dense passage retrieval for open-domain question answering. 2020. Preprint at 10.48550/arXiv.2004.04906. [DOI]
  • 20. Zhang  G, Xu  Z, Jin  Q  et al.  Leveraging long context in retrieval augmented language models for medical question answering. NPJ Digital Med. 2025;8:239. 10.1038/s41746-025-01651-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Soman  K, Rose  PW, Morris  JH  et al.  Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics. 2024;40:btae560. 10.1093/bioinformatics/btae560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hou  W, Ji  Z. GeneTuring tests GPT models in genomics. 2023. bioRxiv. Preprint at 10.1101/2023.03.11.532238. [DOI]
  • 23. Bizon  C, Cox  S, Balhoff  J  et al.  ROBOKOP KG and KGB: integrated knowledge graphs from federated sources. J Chem Inf Model. 2019;59:4968–4973. 10.1021/acs.jcim.9b00683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Mungall  CJ, McMurry  JA, Köhler  S  et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:D712–D722. 10.1093/nar/gkw1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Pan  S, Luo  L, Wang  Y  et al.  Unifying large language models and knowledge graphs: a roadmap. IEEE Trans Knowl Data Eng. 2024;36:3580–3599. 10.1109/TKDE.2024.3352100. [DOI] [Google Scholar]
  • 26. Evangelista  JE, Clarke  DJB, Xie  Z  et al.  Toxicology knowledge graph for structural birth defects. Commun Med. 2023;3:98. 10.1038/s43856-023-00329-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Callaghan  J, Xu  CH, Xin  J  et al.  BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs. Bioinformatics. 2023;39:btad570. 10.1093/bioinformatics/btad570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Carbon  S, Douglass  E, Good  BM  et al.  The Gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–D334. 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Knox  C, Wilson  M, Klinger  CM  et al.  DrugBank 6.0: the DrugBank Knowledgebase for 2024. Nucleic Acids Res. 2024;52:D1265–D1275. 10.1093/nar/gkad976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Fecho  K, Thessen  AE, Baranzini  SE  et al.  Progress toward a universal biomedical data translator. Clin Transl Sci. 2022;15:1838–1847. 10.1111/cts.13301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Gonzalez-Cavazos  AC, Tanska  A, Mayers  M  et al.  DrugMechDB: a curated database of drug mechanisms. Sci Data. 2023;10:632. 10.1038/s41597-023-02534-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. GPT-4o System Card. https://openai.com/index/gpt-4o-system-card/. Accessed 17 December 2025.
  • 33. Deka  P, Jurek-Loughrey  A, Padmanabhan  D. Improved methods to aid unsupervised evidence-based fact checking for online heath news. J Data Intell. 2022;3:474–504. 10.26421/JDI3.4-5. [DOI] [Google Scholar]
  • 34. Deka  P, Jurek-Loughrey  A, P  D. Evidence extraction to validate medical claims in fake news detection. In: Traina  A., eds. Health Information Science, vol. 13705. Cham: Springer Nature Switzerland; 2022: 3–15. [Google Scholar]
  • 35. pritamdeka/S-PubMedBert-MS-MARCO · Hugging Face. https://huggingface.co/pritamdeka/S-PubMedBert-MS-MARCO. Accessed 17 December 2025.
  • 36. pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb · Hugging Face. https://huggingface.co/pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb. Accessed 17 December 2025.
  • 37. Jin  Q, Yang  Y, Chen  Q  et al.  GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics. 2024;40:btae075. 10.1093/bioinformatics/btae075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Shang  X, Liao  X, Ji  Z, W  Hou. Benchmarking large language models for genomic knowledge with GeneTuring. Brief Bioinform.  2025. 10.1093/bib/bbaf492. [DOI] [PMC free article] [PubMed]
  • 39. Unni  DR, Moxon  SAT, Bada  M  et al.  Biolink Model: a universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci. 2022;15:1848–1855. 10.1111/cts.13302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Wu  C, Macleod  I, Su  AI. BioGPS and MyGene.Info: organizing online, gene-centric information. Nucleic Acids Res. 2013;41:D561–D565. 10.1093/nar/gks1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Open  AI, Achiam  Josh, Adler  Steven  et al.  GPT-4 technical report. 2024. Preprint at 10.48550/arXiv.2303.08774. [DOI]
  • 42. GPT-4o mini Model | OpenAI API. https://platform.openai.com. Accessed 17 December 2025.
  • 43. GPT-4o Model | OpenAI API. https://platform.openai.com.  Accessed 17 December 2025.
  • 44. Chen  Y, Gao  J, Petruc  M  et al.  Iterative Prompt Refinement for Mining Gene Relationships from ChatGPT.
  • 45. Marvin  G, Hellen Raudha  N, Jjingo  D  et al.  Prompt engineering in large language models. In: Data Intelligence and Cognitive Informatics. ICDICI 2023. Algorithms for Intelligent Systems. 2024, 387–402. 10.1007/978-981-99-7962-2_30. [DOI]
  • 46. Sahoo  P, Singh  AK, Saha  S  et al.  A systematic survey of prompt engineering in large language models: techniques and applications. 2025. Preprint at 10.48550/arXiv.2402.07927. [DOI]
  • 47. Li  M, Kilicoglu  H, Xu  H  et al.  BiomedRAG: a retrieval augmented large language model for biomedicine. J Biomed Inform. 2025;162:104769. 10.1016/j.jbi.2024.104769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Morris  JH, Soman  K, Akbas  RE  et al.  The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information. Bioinformatics. 2023;39:btad080. 10.1093/bioinformatics/btad080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Jin  Q, Dhingra  B, Liu  Z  et al.  PubMedQA: a dataset for biomedical research question answering. 2019. Preprint at 10.48550/arXiv.1909.06146. [DOI]
  • 50. Kwiatkowski  T, Palomaki  J, Redfield  O  et al.  Natural questions: a benchmark for question answering research. Trans Assoc Comput Linguist. 2019;7:452–466. [Google Scholar]
  • 51. Davis  AP, Wiegers  TC, Johnson  RJ  et al.  Comparative Toxicogenomics Database (CTD): update 2023. Nucleic Acids Res. 2023;51:D1257–D1262. 10.1093/nar/gkac833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Schriml  LM, Arze  C, Nadendla  S  et al.  Disease ontology: a backbone for disease semantic integration. Nucleic Acids Res. 2012;40:D940–D946. 10.1093/nar/gkr972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Bateman  A, Martin  M-J, Orchard  S  et al.  UniProt: the Universal protein knowledgebase in 2025. Nucleic Acids Res. 2025;53:D609–D617. 10.1093/nar/gkae1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Pilarczyk  M, Fazel-Najafabadi  M, Kouril  M  et al.  Connecting omics signatures and revealing biological mechanisms with iLINCS. Nat Commun. 2022;13:4678. 10.1038/s41467-022-32205-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. BTE-RAG/data/analysis_results at main · janjoy/BTE-RAG. GitHub. https://github.com/janjoy/BTE-RAG/tree/main/data/analysis_results. Accessed 17 December 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Jin  Q, Dhingra  B, Liu  Z  et al.  PubMedQA: a dataset for biomedical research question answering. 2019. Preprint at 10.48550/arXiv.1909.06146. [DOI]

Supplementary Materials

giag007_Supplemental_File
giag007_Authors_Response_To_Reviewer_Comments_original_submission
giag007_GIGA-D-25-00307_Original_Submission
giag007_GIGA-D-25-00307_Revision_1
giag007_Reviewer_1_Report_Original_Submission

Christopher Tabone, Ph.D. -- 9/3/2025

giag007_Reviewer_1_Report_Revision_1

Christopher Tabone, Ph.D. -- 12/27/2025

giag007_Reviewer_2_Report_Original_Submission

Sajib Acharjee Dip -- 9/4/2025

giag007_Reviewer_2_Report_Revision_1

Sajib Acharjee Dip -- 1/2/2026

Data Availability Statement

All BTE retrieval caches, LLM outputs, and evaluation metrics used in this study are archived in our public GitHub repository [55]. Cached files include complete API responses with query identifiers, entity identifiers, predicates, source names, and all intermediate analysis results.


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES