Skip to main content
NPJ Digital Medicine logoLink to NPJ Digital Medicine
. 2026 Jan 21;9:152. doi: 10.1038/s41746-025-02336-0

HealthContradict: Evaluating biomedical knowledge conflicts in language models

Boya Zhang 1,, Alban Bornet 1, Rui Yang 2, Nan Liu 3,4, Douglas Teodoro 1
PMCID: PMC12901028  PMID: 41565976

Abstract

How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models’ contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.

Subject terms: Computational biology and bioinformatics, Health care, Mathematics and computing

Introduction

Language models are susceptible to generating reasonable yet nonfactual content1. This issue raises concerns about the reliability of language models in providing medical advice, as there are significant risks when they generate convincing but incorrect information, which could influence people’s health-related decisions2. Additionally, knowledge and misinformation in the biomedical domain both evolve rapidly, especially during medical crises3, with unverified information spreading quickly across the internet4, impacting pre-training and in-context learning for these models.

Existing methods to mitigate misinformation use static fact sources for hallucination detection5,6 or verified evidence to refute false claims710. These strategies have often been combined in retrieval-augmented generation (RAG)11 pipelines, as one of the most effective methods to attenuate hallucinations in the biomedical domain12. Despite some attempts considering information quality13, current approaches for biomedical RAG1416 primarily focus on improving relevance in the retrieval pipeline1721. However, in the real world, contradictory sources could be used to verify the same claim, leading to knowledge conflicts22 in RAG paradigms. For instance, consider the situation illustrated in Fig. 1, where a language model has its own parametric knowledge, i.e., learned during pre-training, stating that coffee aids in weight loss. However, when utilizing an in-context learning approach, the model is presented with two contradictory passages as contextual knowledge, i.e., information from the external source material, while answering the question. In this case, the conflicts arise from the contradictions between Passage 1 and Passage 2, as well as between the model’s parametric knowledge and Passage 2.

Fig. 1. Biomedical knowledge conflicts make language models confused.

Fig. 1

Conflicting knowledge makes the model confused when answering a health question.

The behavior of language models is influenced by knowledge conflicts22, which are the contradictions within parametric knowledge learned at training time23 and contextual knowledge given at inference time2427. Language models are receptive to coherent contextual knowledge when it conflicts with parametric knowledge28. Multi-turn persuasive conversations as contextual knowledge can even manipulate language models’ factual parametric knowledge29. On the other hand, language models are biased toward parametric knowledge when the contextual knowledge is self-contradictory28,30. They also have difficulty generating answers that reflect the self-contradiction of the contextual knowledge31, especially for implicit conflicts that require reasoning32. Besides, language models struggle with self-contradictions in long documents that require more nuance and context33.

Various context-aware methods were proposed to overcome language models’ confusion regarding knowledge conflicts. While context-aware decoding overrode a model’s parametric knowledge when it contradicts the contextual knowledge34, ContextCite traced back the parts of the contextual knowledge that led a model to generate a particular statement to improve the explainability of language models35. However, these methods only focused on either parametric or contextual knowledge. COMBO36, on the other hand, leveraged both the parametric and contextual knowledge by using discriminators trained on silver labels to assess passage compatibility. In addition, DisentQA37 trained a model that predicts two types of answers, one based on contextual knowledge and one on parametric knowledge for a given question. Contrastive decoding further maximizes the difference between logits under knowledge conflicts and calibrates the model’s probability in the correct answer30. Solutions were also proposed to mitigate the harmful behavior of language models. At the training phase, counterfactual and irrelevant contexts were injected into standard supervised datasets to perform knowledge-aware fine-tuning to enhance language models’ robustness38. Meanwhile, in-context pre-training enhanced language models’ performance in complex contextual reasoning. At the inference phase, defense strategies of misinformation detection, vigilant prompting, and reader ensembles were proposed to mitigate misinformation generated by language models39. In addition, query augmentation was used to search for robust answers to defend against poisoning attacks40. Furthermore, fact duration prediction identifies which facts are prone to rapid change and helps models avoid reciting outdated information41. Current approaches prioritize mitigating either contextual conflicts or harmful behaviors of language models. However, both context-awareness and truthfulness are important in improving the answers of language models in the biomedical domain.

Language models’ behavior on general-domain knowledge conflicts has been evaluated with synthetic datasets featuring explicit and simple contradictions31,33,4244, as well as real-world datasets featuring implicit and complex contradictions32,45. While research on knowledge conflicts primarily focuses on general domains, its impact on the biomedical domain remains underexplored. Conflicts in biomedical knowledge are complex due to the domain’s distinctive lexicon and the complex syntax of long sentences46. ManConCorpus47 collected contradictory claims from biomedical literature addressing 24 cardiovascular research questions. Meanwhile, COVID-19 NLI48 automatically identified contradictory claims about COVID-19 drug efficacy from the subset of CORD-1949. In addition, ClashEval44 sampled drug information pages from UpToDate.com and modified the numerical drug dosages with GPT-4o to create contradictions. On a broader range of medical topics, MedNLI50 was manually curated by creating contradicting, entailing, and neutral sentences paired with clinical descriptions. In contrast to manually curated sentences51, focused on identifying naturally occurring sentences containing clinical outcomes and detecting potential contradictions using the SNOMED-CT ontology52. These datasets aim to identify contradictions in biomedical sentences but lack evidence to determine the correct claims. Furthermore, systems to identify sentence-level contradictions are not helpful when contradictions are conveyed across multiple sentences in longer texts.

To address these limitations, we propose HealthContradict, a dataset consisting of 920 unique instances, each comprising a health-related question and two documents with contradictory stances. In addition, each instance has a factual answer supported by scientific evidence. Using HealthContradict, we evaluated several language models, from 1B to 8B parameters, including general-domain and its biomedical counterpart, to answer health-related questions in the presence of knowledge conflicts. When provided with a biomedical context and a language model, our benchmark evaluates: (i) How do language models answer biomedical questions in the presence of knowledge conflicts? (ii) How does the biomedical context provided to the language models act as a causal factor in inducing the answer? To do so, we include correct, incorrect or contradictory context in different prompt scenarios, and assess models’ accuracy and probability distribution in answering health-related questions.

Our contributions are the following: (i) A novel dataset—HealthContradict—designed to evaluate language models in presence of conflicting information in the biomedical domain; (ii) We perform a comprehensive evaluation of language models against HealthContradict, assessing their ability to reason over long, conflicting biomedical contexts using interpretable quantitative metrics; and (iii) Moreover, we compare general-domain language models vs. their fine-tuned biomedical counterparts and reveal that the strength of the latter lies in their ability to exploit correct while resisting incorrect contextual knowledge.

Results

HealthContradict benchmark

We created the HealthContradict benchmark, a dataset of 920 instances. Each instance is a health-related question with a factual answer supported by scientific evidence, paired with two documents presenting contradictory stances. Each document appears only once in the entire dataset to ensure unbiased evaluation. Table 1 shows an example, where for a given health-related question (“Can coffee help you lose weight?”), two contradictory documents are provided (yes: “... useful for weight loss...” vs. no: “... Of Course Not!...”), together with the factual answer (yes) supported by scientific evidence “.. caffeine intake might promote weight, BMI and body fat reduction...”).

Table 1.

An example instance from HealthContradict dataset

Field Content Stance
Question Can coffee help you lose weight?
Correct document ... Green coffee can be taken before or after meals, and it is also known to be useful for weight loss... Yes
Incorrect document ... Does coffee can magically lose you weight? Of course not!... No
Scientific evidence ... Overall, the current meta-analysis demonstrated that caffeine intake might promote weight, BMI and body fat reduction. Yes

Each instance includes a health question, two contradictory documents, the factual answer, and supporting scientific evidence.

In total, the dataset contains 81 questions, each addressing a health issue and a potential treatment. These issues span 50 disease and condition categories, such as “Cancer”, “Low back pain in adults”, and “AIDS”, as well as one general well-being category, “Other,” which covers topics such as weight management. Figure 2 illustrates 10 example disease and condition categories most commonly addressed by the health questions. The complete list of all questions and their corresponding categories is provided in the Supplementary Table S4.

Fig. 2. Example disease and condition categories in HealthContradict.

Fig. 2

The 10 most common disease and condition categories are represented in the dataset’s health questions, covering a diverse range of clinical topics.

The questions are associated with 1840 documents (920 with stance yes and 920 with stance no). The factual answers are “yes” for 26.5% and “no” for 73.5% of the instances. We chose not to balance the dataset, as real-world scenarios are often imbalanced, and our goal is to evaluate models under practical use cases. The average document length is 2347 words, varying from 23 to 30,444. See Supplementary Table S1 and Table S2 for further information on the dataset.

To investigate how language models respond to real-world biomedical knowledge conflicts, we develop five prompt templates to evaluate their performance under different question-answering scenarios. As illustrated in Table 2, for each annotated instance from the HealthContradict dataset, we generate five question prompts based on these pre-defined templates. These prompts were presented to each model independently, and the model did not retain any state or output from other prompts. The controlled set of prompt templates enables us to evaluate the effect of correct, incorrect, or conflicting context without introducing additional variability in the phrasing of the prompt. We note that this design choice is consistent with prior work, such as WikiContradict53, which also employed minimal instructional phrasing to perform comparative analysis. Specifically, Prompt NC evaluates models’ parametric knowledge (control template), while Prompt CC and IC examine their performance with a single document provided as context (correct and incorrect, respectively). Prompt CIC and ICC, on the other hand, assess a model’s ability to handle health questions in the presence of conflicting contextual information. The difference between Prompt CIC and ICC aims to evaluate if the position of the contradictory document influences the model’s answer. These include one without context (Prompt NC) and four with varying context configurations. For each of the 920 instances, we applied 4 context-based prompt templates. Additionally, we include 81 prompts for each question from the Prompt NC, leading to a total of 3761 prompts used for comparative evaluation. We focused on yes/no questions to enable controlled analysis of model behavior under conflicting contexts, as the binary classification offers a clear setting for evaluation in this scenario.

Table 2.

Prompt templates for contextual evaluation in HealthContradict

Prompt template
NC Instruction: Answer the following question with only YES or NO based on your parametric knowledge. Question: {Question}
CC Instruction: Answer the following question with only YES or NO based on the given contextual knowledge. Question: {Question} Context: {Correct Document}
IC Instruction: Answer the following question with only YES or NO based on the given contextual knowledge. Question: {Question} Context: {Incorrect Document}
CIC Instruction: Answer the following question with only YES or NO based on the given contextual knowledge. Question: {Question} Context: {Correct Document} {Incorrect Document}
ICC Instruction: Answer the following question with only YES or NO based on the given contextual knowledge. Question: {Question} Context: {Incorrect Document} {Correct Document}

NC no context, CC correct context, IC incorrect context, CIC correct + incorrect context, ICC Incorrect + correct Context.

Baseline models

Our baseline selection was motivated by two objectives: (i) to evaluate whether biomedical fine-tuning enhances performance within the biomedical domain and (ii) to assess whether increasing model size leads to performance gains. We also selected language models with extended context lengths to process long documents. We show details of selected models in Table 3. Each biomedical model was obtained by fine-tuning the general-domain model presented in the same row. Instruct refers to models that have been fine-tuned to follow user instructions (i.e., instruction-tuned). We do not finetune any models and perform zero-shot inference for the selected baseline models. We evaluate open-source language models ranging from 1B to 8B parameters for reproducibility in resource-limited healthcare settings.

Table 3.

Domains, parameter sizes and context lengths of selected language models

Domain Size Context length
General Biomedical
Llama-3.2-1B-Instruct64 BioMed-Llama-3.2-1B65 1B 128K
Qwen2.5-7B66 Meditron3-Qwen2.5-7B67 7B 131K
Llama-3.1-8B-Instruct68 Meditron3-8B69 8B 128K

Baseline benchmarks

We compare our benchmark to three widely used multiple-choice question-answering benchmarks. MedMCQA18, MedQA-4-Option17 and PubMedQA54. MedMCQA18 and MedQA-4-Option17 are derived from medical exam questions and evaluate the model's performance on clinical medical knowledge. PubMedQA54 is derived from PubMed55 articles and evaluates the model's performance on theoretical medical knowledge.

We evaluate selected baseline models using Language Model Evaluation Harness56. All results are reported using accuracy, which measures the proportion of questions answered correctly. For MedMCQA18, we use the validation split, which contains 4183 four-option multiple-choice questions. For MedQA-4-Option, we evaluate on the test split, comprising 1273 four-option multiple-choice questions. For PubMedQA54, we use a test split, which includes 500 three-option multiple-choice questions.

As shown in Table 4, the difference among larger language models (7-8B) is minor. Moreover, fine-tuned biomedical model MEDITRON3-QWEN2.5-7B underperforms QWEN2.5-7B. These state-of-the-art evaluation benchmarks are weak at comprehending the differences among models’ capabilities because they primarily assess a model’s parametric knowledge.

Table 4.

Existing medical QA benchmarks show limited discriminative power across language models

Model MedMCQA MedQA PubMedQA
Llama-3.2-1B-Instruct64 41.33 39.59 60.20
BioMed-Llama-3.2-1B65 34.88 37.39 60.40
Qwen2.5-7B66 60.10 64.49 75.20
Meditron3-Qwen2.5-7B67 57.14 61.82 74.40
Llama-3.1-8B-Instruct68 56.99 60.25 74.20
Meditron3-8B69 57.83 63.00 76.80

Bold values indicate the highest accuracy among the compared models.

Evaluations on HealthContradict

We assess how language models use biomedical contextual knowledge to provide a complementary view of their actual performance. We report Accuracy and Macro F1 as evaluation metrics (definitions in Eq. (7), (8), and (9)). Accuracy is a simple measurement of the correctness of predictions, while Macro F1 offers a balanced evaluation by equally weighting the performance of each class.

As shown in Table 5, when the correct context is provided, i.e., Prompt CC, MEDITRON3-8B achieves the highest accuracy (91.1%) in the HealthContradict benchmark, outperforming its parametric knowledge (Prompt NC—control) by 8.7 percentage points (p < 0.001). The second-best-performing model, MEDITRON3-QWEN2.5-7B, achieves an accuracy of 87.6%, shows an improvement of 3.8 percentage points, adding correct context (Prompt CC) over using only its parametric knowledge (Prompt NC—control) (p < 0.001). As expected, the worst-performing scenario for all the models is when only an incorrect context is provided (Prompt IC). For example, the correct parametric knowledge of MEDITRON3-8B is overridden by the incorrect context (i.e., a reduction in performance of the control by 21.6 percentage points, p < 0.001). Interestingly, when a conflicting context is provided (Prompt CIC and ICC), all models drop performance compared to the parametric setting (Prompt NC). This effect is less pronounced in MEDITRON3-8B, with a drop in accuracy between 2.6 percentage points (Prompt ICC, p < 0.001) and 2.8 percentage points (Prompt CIC, p < 0.001).

Table 5.

Accuracy and macro F1(%) for language models across prompt templates on HealthContradict

Model Prompt NC Prompt CC Prompt IC Prompt CIC Prompt ICC
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
Llama-3.2-1B-Instruct64 38.3 36.9 28.6 24.1 26.0 20.8 27.0 22.1 27.6 22.9
BioMed-Llama-3.2-1B65 33.8 31.3 48.3 48.2 35.2 35.1 51.4 48.6 57.3 54.8
Qwen2.5-7B66 83.3 77.2 76.7 73.0 54.7 52.0 66.0 62.1 68.5 64.3
Meditron3-Qwen2.5-7B67 83.8 78.9 87.6 85.0 41.5 39.9 70.2 67.7 63.2 60.2
Llama-3.1-8B-Instruct68 77.9 76.5 70.0 69.1 37.8 37.3 59.6 59.2 54.4 54.1
Meditron3-8B69 82.4 77.2 91.1 88.0 60.8 54.2 79.6 72.5 79.8 70.9
GPT-4.1-mini70 80.2 78.0 95.5 94.5 42.3 40.8 83.3 80.1 78.8 75.6
GPT-4o71 77.5 76.1 97.5 96.9 64.5 61.1 94.5 93.1 81.1 77.7

† indicates the highest accuracy or macro F1 among the compared models.

Bold values indicate the highest accuracy or macro F1 among the compared open-source models.

Underlined values indicate the lowest accuracy or macro F1 among the compared models.

The smaller language models, LLAMA-3.2-1B-INSTRUCT and BIOMED-LLAMA-3.2-1B, have the lowest performance, with only 38.3% and 33.8% accuracy when using their parametric knowledge in Prompt NC. When these models are provided with correct context in Prompt CC, the accuracy drops 9.7 percentage points for the general model (p < 0.001), with predictions “yes” most of the time (high recall at 1), whereas for the biomedical model, the accuracy increases by 14.5 percentage points (p < 0.001). In Prompt IC, the biomedical model BIOMED-LLAMA-3.2-1B can resist the incorrect context with an accuracy 9.2 percentage points higher than the general model LLAMA-3.2-1B-INSTRUCT (p < 0.001). The biomedical model benefits from a later position of the correct document in conflicting contexts, with a 5.9 percentage points difference between Prompt CIC and ICC (p = 0.003), and shows a strong ability to exploit long biomedical context when provided with conflicting context.

The larger language models show better performance. When using only parametric knowledge, MEDITRON3-8B outperforms LLAMA-3.1-8B-INSTRUCT by 4.5 percentage points (p = 0.006), whereas MEDITRON3-QWEN2.5-7B shows a 0.5 percentage points difference from QWEN2.5-7B (p = 0.551). It is hard to tell whether biomedical domain fine-tuning has improved the models’ capacity under this setting. However, when the correct biomedical context is introduced, MEDITRON3-8B outperforms LLAMA-3.1-8B-INSTRUCT by 21.1 percentage points (p < 0.001), and MEDITRON3-QWEN2.5-7B also outperforms QWEN2.5-7B by 10.9 percentage points (p < 0.001). These results suggest that the fine-tuned biomedical models can exploit correct context much better than their general-domain counterparts.

Although introducing incorrect biomedical context reduces performance across all models, MEDITRON3-8B remains more robust and achieves 60.8% accuracy, which is 23.0 percentage points higher than that of LLAMA-3.1-8B-INSTRUCT (p < 0.001). In contrast, MEDITRON3-QWEN2.5-7B does not show the same resistance. These results suggest the instruction-fine-tuned biomedical model MEDITRON3-8B, exhibits greater robustness under misleading context compared to non-instruction-fine-tuned biomedical model MEDITRON3-QWEN2.5-7B. However, MEDITRON3-QWEN2.5-7B performs 7.0 percentage points better under Prompt CIC than Prompt ICC (p < 0.001), indicating that it benefits from the earlier position of the correct document in conflicting contexts.

We show the performance of commercial LLMs. Both GPT-4.1-MINI and GPT-4O show the same performance patterns as the open-source models. They are positively influenced by correct contextual knowledge and negatively influenced by incorrect contextual knowledge. When both correct and incorrect contextual knowledge are present, GPT-4.1-MINI and GPT-4O lean towards better performance when the correct contextual knowledge appears before the incorrect contextual knowledge. Compared to open-source biomedical models, when the models utilize parametric knowledge, MEDITRON3-QWEN2.5-7B can outperform both GPT-4.1-MINI and GPT-4O. Moreover, MEDITRON3-8B can outperform GPT-4.1-MINI in resisting incorrect contextual knowledge. However, GPT-4O performs the best throughout the contextual prompts.

We next focus on the two best-performing open-source models, MEDITRON3-8B and its general-domain counterpart LLAMA3.1-8B-INSTRUCT, comparing their error types and contextual reasoning abilities, and illustrating the findings with a case study.

Error types

We analyze two failure modes with conditional failure rates detailed in the Methods section. The first, over-reliance on parametric knowledge, occurs when the model fails to update an incorrect answer even when provided with correct context (Eq.(5)). The second, vulnerability to contextual knowledge, occurs when the model initially answers correctly but changes to an incorrect answer after being shown misleading context (Eq. (6)). As shown in Fig. 3, the biomedical model Meditron3-8B exhibits over-reliance on parametric knowledge in 38.3% of cases (vs. 66.5% for its base model, Llama 3.1-8B-Instruct) and is misled by incorrect context in 31.9% of cases (vs. 58.7% for its base model). Overall, the biomedical model makes fewer errors, and language models are more likely to fail due to over-reliance on parametric knowledge than due to misleading context.

Fig. 3. Failure rates across error types.

Fig. 3

The biomedical model makes fewer errors, and both models are more likely to fail due to over-reliance on parametric knowledge than due to misleading context.

To further characterize these error patterns, we analyze models’ predictions across Prompt Templates. Figure 4a shows the percentage of the context-induced answers. Using the correct biomedical context, MEDITRON3-8B switches to the correct answer on 10.9% instances, while LLAMA3.1-8B-INSTRUCT is very confused and switches to incorrect answers on 15.3% instances. When the incorrect context is introduced, MEDITRON3-8B shows a much lower rate of switching to incorrect answers (23.5%) compared to LLAMA3.1-8B-INSTRUCT (40.3%). Meanwhile, when they encounter contradictory context, MEDITRON3-8B switches to correct answers in 8.8% of cases under Prompt CIC and 8.2% under Prompt ICC, and switches to incorrect answers in 6.0% (Prompt CIC) and 5.5% (Prompt ICC). LLAMA3.1-8B-INSTRUCT shows a much higher rate of confusion, which switches to incorrect answers in 24.1% (Prompt CIC) and 27.8% (Prompt ICC) of cases.

Fig. 4. Consistency of predictions across prompt templates.

Fig. 4

a Impact of Context on Model Predictions reports the percentage of instances where contextual information induces a switch between correct and incorrect answers, while b Prediction Agreement between Prompt Templates presents pairwise prediction agreement across different prompt templates.

We calculate the agreement of predictions across different templates to complement the error-type analysis. As shown in Fig. 4b, for MEDITRON3-8B, the most significant difference is between Prompt CC and IC, with an agreement of 0.66. Prompt CIC and ICC show the closest agreement of 0.90. For LLAMA3.1-8B-INSTRUCT, the most significant difference is between Prompt CC and Prompt IC, at 0.62. Moreover, Prompt CC and Prompt CIC show the closest agreement of 0.78.

Context reasoning

As shown in Fig. 5, when there is no context, both models answer health questions with their parametric knowledge, and MEDITRON3-8B has a 0.7 percentage points performance improvement on the Macro F1 compared to LLAMA-3.1-8B-INSTRUCT. However, with the context of a correct document, MEDITRON3-8B outperforms by 18.9 percentage points. Although the incorrect context induces both models, MEDITRON3-8B can resist the incorrect context and differentiate from LLAMA-3.1-8B-INSTRUCT by 16.9 percentage points. When encountering contradictory contextual knowledge, MEDITRON3-8B also outperforms LLAMA-3.1-8B-INSTRUCT by 13.3 percentage points and 16.8 percentage points, respectively. The comparative analysis indicates that models fine-tuned for the biomedical domain can exploit correct while resisting incorrect biomedical context. HealthContradict can differentiate models’ capacity for long-context biomedical reasoning, particularly in generating factual answers when presented with conflicting biomedical contextual knowledge.

Fig. 5. Macro F1(%) of MEDITRON3-8B and LLAMA-3.1-8B-INSTRUCT across prompt templates.

Fig. 5

Model performance varies with biomedical fine-tuning and different prompt settings.

As shown in Fig. 6, the x-axis represents the predicted probability p^i of the model’s answer yi^ on the HealthContradict dataset, ranging from 0.5 (low probability) to 1.0 (high probability). The predicted probability p^i is obtained by extracting the output logits and computing softmax probabilities over the candidate labels YES and NO (Eq.(1), (2) and (3)). The y axis represents the estimated probability density f(p) (Eq.(4)), which shows how often the model produces predictions at different probability levels. A higher density value indicates that a larger fraction of predicted probabilities is concentrated within that probability range.

Fig. 6. Model probability distributions across prompt templates.

Fig. 6

The probability distributions indicate how the model modulates its prediction probability based on the provided context.

LLAMA-3.1-8B-INSTRUCT exhibits consistently high probability across all templates, with distributions concentrated near 1.0, suggesting high-probability predictions even when the provided context is factually incorrect or contradictory. In contrast, MEDITRON3-8B shows an adaptive probability. For Prompt CC, which includes correct biomedical context, MEDITRON3-8B shows a right shift in its probability scores, indicating increased certainty in its predictions. For Prompt IC, which includes an incorrect biomedical context, the probability scores shift left, indicating decreased certainty in its predictions. Furthermore, when presented with conflicting contextual knowledge in Prompt CIC and ICC, MEDITRON3-8B has a broader spread of predicted probabilities. The probability distributions indicate that the biomedical domain-adapted model modulates its prediction probability based on the factuality of the provided context.

We further examine how the models’ contextual reasoning abilities change as context length increases. We partitioned HealthContradict into four groups of equal sample size based on the range of the context length within each prompt template. In Fig. 7, both models perform best when the context length is short. But Meditron3-8B shows robustness compared to its base model on the longer context lengths.

Fig. 7. Impact of input length on macro-F1.

Fig. 7

Both models perform better on shorter contexts, and MEDITRON3-8B is more robust on longer contexts.

We then show the context reasoning in an interpretable format, a case study for the question "Can cell phones cause cancer?”—is illustrated in Table 6. According to scientific evidence, the factual answer is no. Using parametric knowledge, both MEDITRON3-8B and LLAMA-3.1-8B-INSTRUCT predict the correct label (y^=no) with probability scores of 0.74 and 0.72, respectively. Adding the correct context, MEDITRON3-8B increases its probability to 0.96, while LLAMA-3.1-8B-INSTRUCT reduces its probability to 0.57. Adding the incorrect context, MEDITRON3-8B maintains the correct answer with moderate probability (0.59), whereas LLAMA-3.1-8B-INSTRUCT is misled and predicts yes with high probability (0.87). Adding the contradictory context, both models perform better when the correct information appears later. MEDITRON3-8B further gives the correct answer with probability scores of 0.62 (Prompt CIC) and 0.86 (Prompt ICC). In contrast, LLAMA-3.1-8B-INSTRUCT fails to identify the correct label in Prompt CIC and only recovers it in Prompt ICC with a lower probability score of 0.58. The case study shows, in an interpretable way, the fine-tuned biomedical language models’ capability to better integrate correct contextual knowledge while refusing incorrect knowledge in the answer.

Table 6.

Model probability scores for MEDITRON3-8B and LLAMA-3.1-8B-INSTRUCT on the question “Can cell phones cause cancer?”

graphic file with name 41746_2025_2336_Tab1_HTML.gif

Discussion

The HealthContradict benchmark evaluates language models’ robustness when encountering biomedical knowledge conflicts. Unlike conventional QA tasks, our use case-oriented design assumes that users may provide incorrect context and often cannot verify the factual accuracy of retrieved documents. The Prompt IC setting specifically tests whether the model follows the instruction to use context or defaults to parametric knowledge. Our findings show that biomedical domain-adapted language models can outperform general-purpose models in biomedical applications when users provide incomplete or incorrect information and lack the expertise to verify factual accuracy. This work shows promising directions for developing safer and more inclusive digital health systems. However, the current accuracy is not yet sufficient for communities with limited medical expertise to rely on. The results demonstrate the critical role of human experts and accountability, especially in settings where misinformation can have serious consequences.

To ensure reproducibility in resource-limited healthcare environments, we focused on language models between 1B and 8B parameters. While larger models such as ChatGPT are of considerable interest, their limited transparency, fine-tuning restrictions, and unresolved privacy concerns limit them for clinical deployment.

This study has limitations. We were unable to perform experiments on even larger language models because of resource constraints. Our analysis did not fully capture the reasoning processes behind model predictions in human language, as models generated justifications may not reflect the actual decision process57 and limited access to human evaluators. Our prompt templates are grounded in prior work53. However, variables exist between the Prompt NC (using only parametric knowledge) and the other four prompts (different formats of contextual knowledge), which could influence model performance. While the benchmark is constrained by the scope of the TREC resources and may not fully capture evolving public health issues, it approximates real-world scenarios as closely as possible. We relied on the existing TREC annotations and did not conduct clinical risk analysis due to limited access to clinicians. The question topics were selected by TREC organizers. The documents were judged by NIST assessors following the official guidelines58. NIST does not disclose the number of assessors used in the TREC Health Misinformation track.

In conclusion, we present a new benchmark HealthContradict, which uses interpretable quantitative metrics to evaluate language models’ ability to reason over long and conflicting biomedical contexts. Compared to state-of-the-art medical QA benchmarks, HealthContradict better captures the difference in models’ performance. In our evaluation, language models adapted to the biomedical domain show improved ability to (i) leverage correct contextual information, (ii) resist incorrect contextual information, and (iii) decipher between conflicting contextual information.

Methods

Data collection

Our benchmark uses expert-annotated questions and documents from the TREC Health Misinformation Track 2019, 2021, and 20225961. The selected tracks focused on questions of people seeking health advice online. Each question consists of a health treatment and a health issue. The document pools are ClueWeb12-B1362 for the 2019 track and a no-clean version of the C4 dataset63 for the 2021 and 2022 tracks. Experts annotated each question with a factual answer supported by a separate, credible webpage that referenced relevant scientific evidence. Web documents were retrieved using either manual or automated retrieval methods and were annotated by experts based on assessments of relevance, efficacy, and credibility. We excluded the 2020 Track as it is incompatible with the settings of the other years, focusing on COVID-19 and using CommonCrawl News (January–April 2020) as the document pool.

To the best of our knowledge, the TREC Health Misinformation tracks remain the only publicly available resource that provides (1) expert-curated health questions with ground-truth answers and supporting scientific evidence, (2) document pools containing both supporting and refuting evidence, verdicted by experts. These components are essential for constructing health questions with a ground-truth answer, using pairs of contradictory documents to evaluate biomedical knowledge conflicts in language models.

In HealthContradict, we focus on relevant and credible documents that indicate the efficacy of a document in supporting the answer to the query. We unified labels by mapping the 2019 annotations “effective”/"ineffective” and the 2021 annotations “supportive”/"dissuades” to the “yes”/"no” format, consistent with the 2022 labels. We also consider that if two documents for the same question have opposite stances—one yes, one no—they are considered a contradictory pair. Moreover, we define a document as correct if its stance aligns with the scientific evidence and incorrect if its stance contradicts the scientific evidence.

The original collections include 130 expert-annotated questions (50 in 2019, 35 in 2021, and 45 in 2022). Questions with yes/no stance annotations, as well as those that have both supporting and refuting documents, are included. Applying this criterion reduces the set to 110 questions. We pair supporting and refuting documents into contradiction pairs, ensuring that each document appears only once because some documents may be associated with multiple questions. We exclude questions that cannot form at least one contradiction pair due to insufficient unique documents. This results in a final set of 81 questions and 920 pairs of contradictory documents.

Answer prediction

Each prompt xi is tokenized and processed. We extract the output logits z at the final token position and compute softmax probabilities over the candidate labels YES and NO:

p(yixi)=ezyiezYES+ezNOforyi{YES,NO}. 1

The predicted label y^i with the highest probability for the model’s prediction is defined as:

yi^=argmaxy{Y ES,NO}p(yixi). 2

We denote by p^i the corresponding predicted probability value:

p^i=p(yi^xi). 3

To visualize the overall distribution of the predicted probabilities, we computed their empirical probability density using normalized histograms. For each model and prompt template, the predicted probabilities p^i were grouped into bins of width Δp, and the density in each bin center pj was defined as

f(pj)=njNΔp, 4

where nj denotes the number of predictions within bin j, N is the total number of predictions, and Δp is the width of the bin. The densities were normalized so that ∑jf(pj)Δp = 1.

Failure modes

We defined two failure modes to further analyze the model’s performance:

Over-reliance on parametric knowledge (OR) occurs when the model fails to update an incorrect answer even after being provided with correct context (i.e., comparing Prompt NC and Prompt CC). OR shows how much the model depends on its parametric knowledge and how resistant it is to incorporating correct contextual knowledge. To compute it, we denote by INC the set of instances with incorrect model answers for Prompt NC, ICC the set of instances with incorrect model answers for Prompt CC, and N the number of instances. The OR rate is then defined as:

POR=PICCINC=PICCINCPINC=NICCINCNINC. 5

Vulnerability to misleading context (VM) occurs when the model initially provides a correct answer using parametric knowledge but changes to an incorrect answer after being provided with an incorrect context (i.e., comparing Prompt NC and Prompt IC). VM shows how vulnerable the model is to being misled by incorrect contextual knowledge. To compute it, we denote by CNC the set of instances with correct model answers for Prompt NC, IIC the set of instances with incorrect model answers for Prompt IC, and N the number of instances. Then, the VM rate is defined as:

PVM=PIICCNC=PIICCNCPCNC=NIICCNCNCNC. 6

Evaluation metrics

We report model performance using two standard metrics: accuracy and macro F1. Accuracy measures the proportion of questions for which the model predicts the correct answer, and is defined as:

Accuracy=1Ni=1N1(y^i=yi), 7

where N is the total number of instances, yi is the ground-truth label, and y^i is the predicted label. Precision, Recall and F1-score are given by:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2PrecisionRecallPrecision+Recall, 8

where TP, FP, and FN denote true positives, false positives, and false negatives for each class. Macro-F1 is then obtained by averaging F1-score across the two classes:

Macro - F1=12F1YES+F1NO. 9

Statistical analysis

The statistical significance was assessed using McNemar’s test, which is appropriate for paired binary predictions. For each comparison, we constructed a 2 × 2 contingency table from instance-level correctness and reported the chi-square statistic and two-sided p values with continuity correction.

Experiments and hardware

During the evaluation, we use pre-trained language models and tokenizers from HuggingFace’s transformers library. Models are loaded with FP16 precision and Flash Attention 2. All experiments are run on a single NVIDIA A100 40GB GPU.

Supplementary information

Supplementary information (111.3KB, pdf)

Acknowledgements

We thank Dr. Irene Li for her valuable feedback during the early development of this manuscript.

Author contributions

B.Z., A.B., and D.T. contributed to the concept, B.Z. conducted the experiments, analyzed the results and drafted the manuscript. B.Z., A.B., R.Y., N.L., and D.T. contributed to the review and revision of the paper.

Data availability

All code used in this study is openly available at https://github.com/tinaboya/HealthContradict. The repository contains all scripts necessary to interpret, replicate, and build upon the findings reported in this study. The HealthContradict dataset is openly available at https://github.com/tinaboya/HealthContradict. The repository contains all data required to interpret, replicate, and extend the analyses conducted in this study.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41746-025-02336-0.

References

  • 1.Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43, 42:1–42:55 (2025). [Google Scholar]
  • 2.Pogacar, F. A., Ghenai, A., Smucker, M. D. & Clarke, C. L. The positive and negative influence of search results on people’s decisions about the efficacy of medical treatments. In Proc. ICTIR ’17 209–216 (2017).
  • 3.Boutron, I. & Ravaud, P. Misrepresentation and distortion of research in biomedical literature. Proc. Natl. Acad. Sci.115, 2613–2619 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang, Y., McKee, M., Torbica, A. & Stuckler, D. Systematic literature review on the spread of health-related misinformation on social media. Soc. Sci. Med.240, 112552 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature630, 625–630 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Li, J. et al. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics. Vol. 1, 10879–10899 (Association for Computational Linguistics, 2024).
  • 7.Thorne, J., Vlachos, A., Christodoulopoulos, C. & Mittal, A. FEVER: a large-scale dataset for fact extraction and VERification. In Proc. NAACL-HLT, 809–819 (ACM, 2018).
  • 8.Wadden, D. et al. Fact or fiction: verifying scientific claims. In Proc. EMNLP 2020 7534–7550 (2020).
  • 9.Stammbach, D., Zhang, B. & Ash, E. The choice of textual knowledge base in automated claim checking. ACM J. Data Inf. Qual.15, 1–22, 10.1145/3561389 (2023). [Google Scholar]
  • 10.Vladika, J., Schneider, P. & Matthes, F. HealthFC: Verifying health claims with evidence-based medical fact-checking. In Proc. LREC-COLING 2024 8095–8107 (2024).
  • 11.Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst.33, 9459–9474 (2020). [Google Scholar]
  • 12.He, J. et al. Retrieval-augmented generation in biomedicine: a survey of technologies, datasets, and clinical applications. Avalable at https://arxiv.org/abs/2505.01146 (2025).
  • 13.Zhang, B., Naderi, N., Mishra, R. & Teodoro, D. Online health search via multidimensional information quality assessment based on deep language models: algorithm development and validation. JMIR AI3, e42630 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jeong, M., Sohn, J., Sung, M. & Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics40, i119–i129 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li, M., Kilicoglu, H., Xu, H. & Zhang, R. Biomedrag: a retrieval augmented large language model for biomedicine. J. Biomed. Inform.162, 104769 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yang, R. et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst.2, 2 (2025). [Google Scholar]
  • 17.Jin, D. et al. What disease does this patient have? A large-scale open-domain question answering dataset from medical exams. Appl. Sci.11, 6421 (2021).
  • 18.Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: a large-scale multi-subject multi-choice dataset for medical domain question answering. Proc. Mach. Learn. Res.174, 248–260 (2022).
  • 19.Taboureau, O. et al. Chemprot: a disease chemical biology database. Nucleic Acids Res.39, D367–D372 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Segura-Bedmar, I., Martínez, P. & Herrero-Zazo, M. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Proc. SemEval 2013 341–350 (2013).
  • 21.Gurulingappa, H. et al. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform.45, 885–892 (2012). [DOI] [PubMed] [Google Scholar]
  • 22.Xu, R. et al. Knowledge conflicts for LLMs: A survey. In Proc. EMNLP 2024 8541–8565 (2024).
  • 23.Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv.55, 1–35 (2023).
  • 25.Zaghir, J. et al. Prompt engineering paradigms for medical applications: scoping review. J. Med. Internet Res.26, e60501 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Fan, W. et al. A survey on rag meeting llms: towards retrieval-augmented large language models. In Proc. KDD ’24 6491–6501 (2024).
  • 27.Shi, W. et al. REPLUG: retrieval-augmented black-box language models. In Proc. NAACL 2024 8371–8384 (2024).
  • 28.Xie, J., Zhang, K., Chen, J., Lou, R. & Su, Y. REPLUG: retrieval-augmented black-box language models. In Proc. NAACL 2024 8371–8384 (2024).
  • 29.Xu, R. et al. The earth is flat because...: Investigating LLMs’ belief towards misinformation via persuasive conversation. In Proc. ACL 2024 16259–16303 (2024).
  • 30.Jin, Z. et al. Tug-of-war between knowledge: exploring and resolving knowledge conflicts in retrieval-augmented language models. In Proc. LREC-COLING 2024 16867–16878 (2024).
  • 31.Chen, H.-T., Zhang, M. & Choi, E. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proc. EMNLP 2022 2292–2307 (2022).
  • 32.Hou, Y. et al. Wikicontradict: a benchmark for evaluating LLMs on real-world knowledge conflicts from wikipedia. In NeurIPS 2024 Datasets and Benchmarks Track (2024).
  • 33.Li, J., Raheja, V. & Kumar, D. ContraDoc: Understanding self-contradictions in documents with large language models. In Proc. NAACL 2024 6509–6523 (2024).
  • 34.Shi, W. et al. Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 783–791 (2024).
  • 35.Cohen-Wang, B., Shah, H., Georgiev, K. & Madry, A. Contextcite: Attributing model generation to context. In ICML 2024 Workshop on Foundation Models in the Wild (2024).
  • 36.Zhang, Y. et al. Merging generated and retrieved knowledge for open-domain QA. In Proc. EMNLP 2023 4710–4728 (2023).
  • 37.Neeman, E. et al. DisentQA: disentangling parametric and contextual knowledge with counterfactual question answering. In Proc. ACL 2023 10056–10070 (2023).
  • 38.Li, D. et al. Large language models with controllable working memory. Findings of ACL 2023 1774–1793 (2023).
  • 39.Pan, Y. et al. On the risk of misinformation pollution with large language models. Findings of EMNLP 2023 1389–1403 (2023).
  • 40.Weller, O., Khan, A., Weir, N., Lawrie, D. & Van Durme, B. Defending against disinformation attacks in open-domain question answering. In Proc. EACL 2024 402–417 (2024).
  • 41.Zhang, M. & Choi, E. Mitigating temporal misalignment by discarding outdated facts. In Proc. EMNLP 2023 14213–14226 (2023).
  • 42.Longpre, S. et al. Entity-based knowledge conflicts in question answering. In Proc. EMNLP 2021 7052–7063 (2021).
  • 43.Wang, Y. et al. Resolving knowledge conflicts in large language models. In Proc. CoLM 2024 (2024).
  • 44.Wu, K., Wu, E. & Zou, J. Y. Clasheval: quantifying the tug-of-war between an llm’s internal prior and external evidence. Adv. Neural Inf. Process. Syst.37, 33402–33422 (2024). [Google Scholar]
  • 45.Hsu, C., Li, C.-T., Saez-Trumper, D. & Hsu, Y.-Z. WikiContradiction: detecting self-contradiction articles on wikipedia. In Proc. IEEE Big Data 2021 427–436 (2021).
  • 46.Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am. Med. Inform. Assoc.29, 1976–1988 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Alamri, A. & Stevenson, M. A corpus of potentially contradictory research claims from cardiovascular research abstracts. J. Biomed. Semant.7, 1–9 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Sosa, D., Suresh, M., Potts, C. & Altman, R. Detecting contradictory COVID-19 drug efficacy claims from biomedical literature. In Proc. ACL 2023 694–713 (2023).
  • 49.Wang, L. L. et al. CORD-19: The COVID-19 open research dataset. In Proc. NLP for COVID-19 at ACL 2020 (2020).
  • 50.Romanov, A. & Shivade, C. Lessons from natural language inference in the clinical domain. In Proc. EMNLP 2018 1586–1596 (2018).
  • 51.Makhervaks, D., Gillis, P. & Radinsky, K. Clinical contradiction detection. In Proc. EMNLP 2023 1248–1263 (2023).
  • 52.Stearns, M. Q., Price, C., Spackman, K. A. & Wang, A. Y. Snomed clinical terms: overview of the development process and project status. Proc. AMIA Symp. 662 (2001). [PMC free article] [PubMed]
  • 53.Hou, Y. et al.Wikicontradict: a benchmark for evaluating llms on real-world knowledge conflicts from wikipedia. In NeurIPS 2024 (2024).
  • 54.Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. Pubmedqa: a dataset for biomedical research question answering. In Proc. EMNLP-IJCNLP 2019 2567–2577 (2019).
  • 55.Pubmed. https://pubmed.ncbi.nlm.nih.gov/ (2025).
  • 56.Gao, L. et al. The language model evaluation harness https://zenodo.org/records/12608602 (2024).
  • 57.Chen, Y. et al. Reasoning models don’t always say what they think https://arxiv.org/abs/2505.05410 (2025).
  • 58.Trec health misinformation track. https://trec-health-misinfo.github.io/ (2025).
  • 59.Abualsaud, M., Lioma, C., Maistro, M., Smucker, M. D. & Zuccon, G. Overview of the TREC 2019 decision track https://api.semanticscholar.org/CorpusID:221857114 (2019).
  • 60.Clarke, C. L. A., Maistro, M. & Smucker, M. D. Overview of the TREC 2021 health misinformation track. In Proc. TREC 2021 (2021).
  • 61.Clarke, C. L. A., Maistro, M., Seifikar, M. & Smucker, M. D. Overview of the TREC 2022 health misinformation track. In Proc. TREC 2022 (2022).
  • 62.Clueweb12. https://lemurproject.org/clueweb12/ (2025).
  • 63.allenai/c4. https://huggingface.co/datasets/allenai/c4 (2025).
  • 64.Meta. Llama-3.2-1b-instruct. https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (2024).
  • 65.ContactDoctor. Bio-medical-1b-cot. https://huggingface.co/ContactDoctor/Bio-Medical-Llama-3-2-1B-CoT-012025 (2025).
  • 66.Alibaba Group. Qwen-2.5-7b. https://huggingface.co/Qwen/Qwen2.5-7B (2025).
  • 67.Open Meditron team. Meditron3-qwen-2.5-7b. https://huggingface.co/OpenMeditron/Meditron3-Qwen2.5-7B (2025).
  • 68.Meta. Llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct (2024).
  • 69.Open Meditron team. Meditron3-8b. https://huggingface.co/OpenMeditron/Meditron3-8B (2025).
  • 70.Gpt-4.1-mini. https://platform.openai.com/docs/models/gpt-4-1 (2025).
  • 71.Gpt-4o. https://platform.openai.com/docs/models/gpt-4o (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary information (111.3KB, pdf)

Data Availability Statement

All code used in this study is openly available at https://github.com/tinaboya/HealthContradict. The repository contains all scripts necessary to interpret, replicate, and build upon the findings reported in this study. The HealthContradict dataset is openly available at https://github.com/tinaboya/HealthContradict. The repository contains all data required to interpret, replicate, and extend the analyses conducted in this study.


Articles from NPJ Digital Medicine are provided here courtesy of Nature Publishing Group

RESOURCES