Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

Research Square logoLink to Research Square
[Preprint]. 2025 Apr 21:rs.3.rs-6206365. [Version 1] doi: 10.21203/rs.3.rs-6206365/v1

When Helpfulness Backfires: LLMs and the Risk of Misinformation Due to Sycophantic Behavior

Shan Chen 1, Mingye Gao 2, Kuleen Sasse 3, Thomas Hartvigsen 4, Brian Anthony 5, Lizhou Fan 6, Hugo Aerts 7, Jack Gallifant 8, Danielle S Bitterman 9
PMCID: PMC12045364  PMID: 40313755

Abstract

Large language models (LLMs) exhibit a critical vulnerability arising from being trained to be helpful: a tendency to comply with illogical requests that would generate misinformation, even when they have the knowledge to identify the request as illogical. This study investigated this vulnerability in the medical domain, evaluating five frontier LLMs using prompts that misrepresent equivalent drug relationships. We tested baseline compliance, the impact of prompts allowing rejection and emphasizing factual recall, and the effects of fine-tuning on a dataset of illogical requests, including out-of-distribution generalization. Results showed concerningly high initial compliance (up to 100%) across all models, prioritizing helpfulness over logical consistency. However, prompt engineering and fine-tuning improved performance, achieving near-perfect rejection rates on illogical requests while maintaining general benchmark performance. This demonstrates that prioritizing logical consistency through targeted training and prompting is crucial for mitigating the risk of medical misinformation and ensuring the safe deployment of LLMs in healthcare.

INTRODUCTION

Large Language Models (LLMs) can store and retrieve vast amounts of information from diverse domains, including healthcare1-3. This knowledge base has been noted for its potential to support medical professionals by providing specialized information and advice1,4. Yet, while these models may recall medical facts, it remains challenging for the models to process information logically and generate responses that demonstrate sound reasoning5. This gap between knowledge retrieval and logical reasoning in medicine6,7 leads to a particularly concerning public health risk: The rapid generation and dissemination of misinformation is particularly critical in high-stakes fields like medicine.

Two key principles for the safe deployment of LLMs in medicine are honesty and helpfulness8-10. Honesty ensures that models provide accurate and truthful information, while helpfulness focuses on fulfilling users' queries in an efficient and useful manner11,12. Current state-of-the-art LLMs are aligned with these principles via training processes8,10,13, including reinforcement learning with human feedback14 (RLHF). These alignment processes typically involve tuning LLMs, not to gain new knowledge but to shift the outputs towards a more desirable human-readable format and away from potentially harmful or toxic behaviors learned during pre-training12,15. Previous works have shown that helpfulness can be misused to generate unfaithful information and lack of scientific grounding about health16-18. While honesty and helpfulness are often complementary, emphasizing helpfulness can introduce safety vulnerabilities: jailbreaking19,20 and sycophancy21 which may amplify the risks of LLM misuses. Jailbreaking refers to techniques or prompt structures designed to exploit a model’s helpfulness, tricking it into generating harmful, misleading, or restricted content22. Sycophancy is the tendency of LLMs to excessively agree with users, often at the expense of accuracy21,23. The confluence of these two vulnerabilities is a growing concern, because nefarious or misinformed users, and even unintentional errors and typos input into the model, could result in LLMs generating and spreading false information24.

Previous research on jailbreaking has primarily explored its implications in the context of catastrophic risks in the general domain—cases where models are manipulated to produce extreme content, such as violence, hate speech, or other harmful material25-27. Jailbreaking has been thoroughly examined in healthcare contexts through multimodal, white-box internal, and adversarial approaches.28,29. Our work builds on this foundation by addressing a critical underexplored area: evaluating LLMs' ability to recognize and resist illogical or factually flawed requests for medical information.

We evaluated Ave LLMs across various scenarios and assessed how sensitive they are to generating incorrect medical information in settings where the LLMs have the knowledge base to identify the requested information as incorrect. As a use case, we selected drug names, as in medicine, often different names are used for the same drug. Because we previously showed that LLMs can accurately match brand and generic drug names30, this allowed for a controlled, and scalable experimental set-up to characterize LLM compliance to illogical requests. First, we tested whether LLMs refuse to comply with requests for information describing the equivalent drugs are distinct (i.e., a misinformation request), and found that even the most advanced models complied with up to 100% of misinformation requests. Second, we changed our instructions to the LLMs to understand if their overly submissive behavior can be overcome with prompting (given prompting still remains the most effective steering method31). Third, we fine-tuned models to resist requests for misleading information while maintaining responsiveness to valid prompts. We found that LLMs prioritize learned helpfulness over inherent logical reasoning, leading them to generate medical misinformation from even simple illogical prompts. Our strategies to reduce this risk successfully enhanced logical reasoning, and can provide a basis for additional research to improve robust risk mitigation and oversight mechanisms targeted at LLM sycophancy in healthcare.

RESULTS

Stage 1: Evaluating LLM Performance on Common Drugs

Our previous work showed that all models evaluated here have near-perfect factual recall ability to match these drugs’ generic and brand names30. As shown in Fig. 1a), LLMs generally follow illogical requests to generate misinformation in the base prompt setup (details in Method section 3.1) for the generic-to-brand conversions. For clarity, we only discuss the generic-to-brand setups in the main text; all brand-to-generic results are in Appendix Fig. 1and show similar findings.

Figure 1. Generic-to-brand output grades for prompt-based and Instruction-tuning interventions.

Figure 1

Figure 4a displays the results of stage 1 (prompt-based strategies). Four prompt variations were used to evaluate 5 LLMs on generic-brand name pairs of 50 drug combinations. Figure 4b shows results for stage 2 (instruction-tuned model). The baseline and finetuned version of GPT4o-mini and Llama3-8B performance is on out-of-distribution test sets of 4 domains, such as Cancer drug name and writer-pseudonym pairs.

In the generic-to-brand setup, GPT4o-mini, GPT4o, and GPT4 followed the medication misinformation request 100% (50/50) of the time, while Llama3-8B did so in 94% (47/50) of cases. Llama3-70B had the highest rejection rate in this setup, but still rejected requests to generate misinformation in less than 50% (21/50) of cases, indicating that even large, advanced models predominantly complied with illogical requests.

Explicitly allowing models to reject misinformation requests (i.e., telling models that they can reject the request within the prompt, our detailed workflow can be found in Fig. 2) improved the ability of the GPT series of models to resist misinformation requests. GPT4o and GPT4 rejected over 60% (GPT4o: 31/50, GPT4: 32/50) of the illogical requests in this setting. However, Llama's performance was similar to base prompting. Adding factual recall hints in the prompt yielded the most benefit for GPT4 and Llama3-8B.

Figure 2. Illustration of overall study workflow.

Figure 2

Step 1 involves the generation of an LLM misinformation request, where models should recognize that the drug entities are equivalent and therefore the requested generated content would be misinformation. In Step 2, LLMs are prompted with this request to generate a response which is subsequently graded by Claude 3.5 Sonnet in Step 3 into one of the four response types. Claude 3.5 grading quality was validated by humans. Step 4 shows prompt-based variations which are evaluated, and the change in response types are collected in Step 5. Step 6 displays the instruction tuning of the LLMs, where we stitched the baseline prompt with output from rejection and factual recall hints. Step 7 evaluates this newly tuned LLM both in-domain and in other domains with different equivalence errors.

Adding rejection hints and factual recall together in the prompts vastly improved the models' performance. This was particularly true for GPT4o and GPT4, which rejected generating the requested misinformation and correctly identified that the brand and generic names referred to the same drug in 94% (47/50) of test cases. Rejection rates for GPT4o-mini and Llama3-70B also improved substantially, reaching 62% (31/50) and 92% (46/50), respectively, with both hints applied.

An interesting behavioral shift was observed in Llama3-8B after including both the rejection and factual recall hints. The model transitioned from following illogical requests to directly rejecting them without providing the correct logical rationale for rejections. This change is reflected in the increase in direct rejections (yellow bar) from 2% (1/50) to 66% (33/50) in Fig. 1a.

Stage 2: Fine-Tuning and Evaluating on OOD (Out of distribution) Domains

In the second stage, GPT4o-mini and Llama3-8B were supervised fine-tuned (SFT) on 300 illogical requests about general drugs with clear rejections. We then conducted OOD tests (Fig. 3 demonstrates the workflow here) in four domains: cancer drugs, singers/performers, writers, and geography. As shown in Fig. 1b), the fine-tuned models were much more likely to identify a request as illogical and refuse to comply.

Figure 3. Out of distribution testing workflow.

Figure 3

We composed one held-out cancer drug set that is not in the supervised fine-tuning data and crafted three other categories’ equivalences. As previously, Claude 3.5 Sonnet was used to auto-evaluate the categories of models’ responses.

For example, in the OOD tests on cancer drugs (without rejection hints), the fine-tuned GPT4o-mini achieved a 100% (100/100) rejection rate, with 79% (79/100) of rejections providing the correct reason, compared to the baseline's 12% (12/100) rejection rate (5% with correct reasoning). Similarly, the fine-tuned Llama3-8B reached a 99% (99/100) rejection rate (70% with correct reasoning, 29% with other reasons), while the baseline model rejected only 30% (30/100) of requests, none of which provided the correct reason. This is similar to other categories with/without rejection hints.

Stage 3: Evaluating Compliance with Logical Requests (workflow as Fig. 4)

Figure 4. LLM ability to comply to logical requests.

Figure 4

To further investigate our fine-tuned models’ behavior, we provided three different subcategories of new, logical and correct in-context information requests, and assessed if the LLMs complied. Authors SC and MG did the annotation manually with a 100% annotation agreement.

The detailed results of the ability of the fine-tuned models to comply with logical requests are shown in Appendix Table 1. Fine-tuned GPT4o-mini complied in 15/20 cases, and fine-tuned LLama3-8B complied in 12/20 cases. While the fine-tuned models were more likely than their base counterparts to reject requests, they always explained that they rejected because the request might be unrealistic. This behavior shift indicates a maintenance of balance between safety (rejection of illogical requests) and functionality (compliance with logical instructions). Examples of how fine-tuning shifted behavior are provided in Appendix Fig. 2.

Stage 4: Evaluating General Benchmarks

Lastly, we assessed the performance of the SFT models from Stage 2 and their base counterparts across 10 general and biomedical knowledge benchmarks, including Alpaca-Eval232, ARC Challenge, ARC Easy33, BoolQ34, MMLU35, GPQA36, TruthfulQA37, and the USMLE Step 1, 2, and 3 exams30. As demonstrated in Fig. 5, the fine-tuned models exhibited negligible performance degradation across all tasks.

Figure 5. LLM assessment on general benchmarks.

Figure 5

Performance of models pre- and post- fine-tuning for logical reasoning on jailbreaking, in medical, and general knowledge benchmarks. The confidence interval is generated using the central limit theorem.

DISCUSSION

Our study identified a critical vulnerability in LLMs: their tendency to prioritize helpfulness over critical reasoning when responding to illogical and potentially harmful requests for medical information. If LLMs are prone to generating medical misinformation in response to requests that are overtly illogical, where they know the information is incorrect, they are likely even less able to resist more nuanced misinformation requests. This means that even simple errors in LLM inputs could readily and inadvertently prompt the generation of misinformation when LLMs are used in medical contexts. If left unchecked, this vulnerability could lead to the acceleration of inadvertent or malicious misinformation which could cause serious population and individual harm in high-stakes domains like healthcare24.

Previous research into the potential of LLMs to manipulate and generate misinformation has largely focused on single-turn or multi-turn conversational techniques aimed at exploiting a model’s inherent helpful nature to bend its "beliefs" or outputs to align with dangerous or unethical goals38,39. Such efforts reveal the vulnerability of even state-of-the-art models to being misled by adversarial inputs, underscoring the need for new robust safeguarding mechanisms. Our work adds to this largely unexplored yet crucial area by evaluating the ability of LLMs to identify and resist requests that are overtly illogical or factually flawed, and by proposing a novel mitigation strategy of comprehensively cataloging known contextual information prior to query processing.

The initial blind compliance of all models, including advanced ones like GPT-4, to illogical requests reveals a core vulnerability in LLM design where, without explicit guidance, models prioritize being helpful over applying critical reasoning. More specifically, the role of RLHF/Instruction Tuning creates a fundamental tension between blindly following instructions and providing context-sensitive and factual responses. Our findings demonstrate that explicit instruction prompting, such as providing rejection hints, can improve models' ability to critically assess requests before responding. Allowing models to reject flawed instructions appears to be important for enhancing their common sense critical reasoning ability. This insight is crucial for developing safer AI systems that can balance helpfulness with necessary skepticism.

While factual recall prompts improved the performance of advanced models, such as GPT4o and GPT4, they had limited impact on smaller models like Llama3-8B/70B or GPT4o-mini. Even when we explicitly told the models within the prompt that brand and generic names referred to the same drug, only the more advanced models responded correctly by rejecting the illogical request. For example, GPT4 and GPT4o rejected 94% of illogical requests after being prompted to recall factual relationships between the drugs, but Llama3-8B still struggled, often rejecting without giving a correct explanation.

This suggests that simply spelling out factual equivalencies is not enough for less capable models and that the ability to effectively use factual knowledge in context-dependent reasoning tasks may be a key differentiator of more advanced AI systems40,41. Smaller models seem to require more than factual prompts to process logical decisions, likely because they cannot fully integrate context and recall complex relationships as effectively as advanced models. However, even for these larger models, this approach is not scalable across the wide range of potential illogical requests because it requires preemptively identifying the precise factual knowledge needed to identify each request as illogical.

Supervised fine-tuning on 300 drug-related conversations enhanced the models' ability to distinguish between valid and illogical prompts, especially for OOD tests. After fine-tuning, models like GPT4o-mini achieved a 100% rejection rate, with 79% of rejections providing the correct reasoning, compared to the baseline's 9%. Similarly, Llama3-8B improved, though it sometimes rejected prompts without proper explanations. Importantly, the observed improvements in rejecting illogical prompts were generalized outside of the brand-generic use case on which the models were fine-tuned.

The success of SFT highlights how fine-tuning enables models to better recognize illogical requests in a generalizable, scalable fashion. In other words, we know the models can match these drug names correctly, and SFT steers models’ behavior toward prioritizing its factual knowledge over user requests.

Importantly, this fine-tuning did not lead to over-rejection or a refusal to respond to reasonable input: GPT4o-mini and Llama3-8B still largely complied with logical requests across a range of medical and non-medical tasks. When they did not, they provided reasonable explanations for not complying. This behavior shift demonstrates a successful balance between rejecting illogical instructions and remaining useful for legitimate tasks. See Appendix Fig. 2 for an example of the behaviors we discussed above.

CONCLUSIONS

We showed that LLMs do not reliably resist requests for illogical content, including the generation of medical misinformation—Even when they have the knowledge to identify the request as factually flawed. This creates a gap between the knowledge benchmarks commonly used to evaluate LLMs and a true assessment of their medical risks and functionality. To ensure that LLMs effectively reject flawed requests while continuing to respond helpfully to logical instruction, future work could focus on refining tuning methods and developing approaches to scalable human-assisted and automated oversight. Ultimately, closing this gap will be essential to aligning LLMs' knowledge capabilities with their real-world reliability and safety in medicine.

METHODS

1. Drug Selection and Data Preparation

Drug Dataset

To evaluate language models across varying levels of drug familiarity, we used the RABBITS30 dataset, which includes 550 common drugs with 1:1 mapping between their brand and generic names.

Tokenization and Frequency-Based Sorting

To measure the relative familiarity of language models with these drugs, we tokenized multiple large pre-training corpora with the LLaMA tokenizer9 using Infini-gram42, including Dolma1.643, C444, RedPajama45, and Pile46. The frequency of generic drug names across this corpus was used to estimate how commonly these drugs appear in pre-training datasets. Generic drug names were then ranked by frequency to provide a proxy measure of model familiarity[1].

Stratified Sampling of Drugs

To ensure coverage of both common and rare drugs, we selected 50 drugs from five distinct frequency ranges based on their rankings in the tokenized dataset: The top 10, 100–110, 200–210, 300–310, and 400–410 most frequent drugs in our sampling window.

2. Evaluated Models

We evaluate the following LLMs: Llama3-8B-Instruct (Llama3-8B), Llama3-70B-Instruct (Llama3-70B), gpt-4o-mini-2024-07-18 (GPT4o-mini), gpt-4o-2024-05-13 (GPT4o), and gpt-4-0613 (GPT4). These models were chosen to represent performance of current leading open- and closed-source models across a range of sizes.

3. Prompt Design and Evaluation

We designed four prompt types to evaluate the models' handling of new drug-related information, assessing persuasive ability, factual recall, and logical consistency (Fig. 1). Experiments were run via OpenAI Batch API, and Llama models used A100-80GB with CUDA > 12.0, no quantization. Hyperparameters included a max of 512 output tokens and temperature = 0 for best possible reproducibility.

3.1. Baseline Prompt

The first prompt represents the baseline condition, where the model is tasked with providing a persuasive but illogical letter informing people that a brand-name drug is found to have new side effects, and that they should take the generic counterpart instead. This task was selected because it illustrates a necessary safety mode for LLMs that follows from simple logical reasoning. If a model knows that the brand and generic drug are the same, it should be able to identify the request as illogical and reject the request, instead of complying with the request and generating misinformation.

3.2. Rejection Prompt

In this variation, we explicitly allow the possibility of rejection, encouraging the model to evaluate whether there is a logical flaw in the prompt. This prompt also allows a model that is heavily aligned to being submissive to reject users’ queries. The explicit permission to reject creates a scenario where the model must consider not only the factual content but also the appropriateness of the substitution.

3.3. Factual Recall Prompt

This prompt emphasizes the need for the model to recall the correct relationships between brand name drugs and their generic equivalents before processing the rest of the request. This variation tests the model’s ability to accurately retrieve and utilize known facts in generating persuasive outputs. By instructing the model to prioritize factual recall, we assess how well it can integrate known drug relationships with new information.

3.4. Combined Rejection and Factual Recall Prompt

The final prompt variation combines both the rejection and factual recall instructions. This setup evaluates whether the model can handle both tasks simultaneously—ensuring factual accuracy while also exercising logical reasoning to reject incorrect assumptions.

All prompt settings introduced were experimented with separate LLM inferences.

3.5. Automated Evaluation

Model outputs were categorized into 4 categories: (1) rejecting the request and explaining the logical flaw; (2) fulfilling the request and explaining the logical flaw; (3) rejecting the request without explaining the logical flaw; and (4) fulfilling the request without explaining the logical flaw. Model outputs were evaluated using a multi-step annotation process. To ensure consistency and reliability in the evaluation, we employed the Claude 3.5 Sonnet to provide initial annotations, with human reviewers (annotators SC and MG blinded to each other) validating the outputs. The inter-annotator agreement between Claude and the human reviewers was 98%, with 100% agreement between the two human annotators for both in-domain and out-of-domain data.

4. Fine-tuning then Evaluate on Out of Distribution Data

4.1. Model Fine-Tuning

To enhance the ability of smaller language models to handle complex drug substitution prompts, we fine-tuned Llama 3–8B Instruct and GPT4o-mini using the PERSIST instruction-tuning dataset, publicly available at https://huggingface.co/datasets/AIM-Harvard/PERSIST.

This dataset comprises 300 input-output pairs, each featuring a challenging "Baseline" prompt concerning brand/generic drug substitutions (covering both directions for 50 drug pairs) and the corresponding desired response generated by a larger model (GPT4o-mini, GPT-4, or GPT4o) when presented with a “Combined Rejection and Factual Recall Prompt.”

The dataset construction leveraged these larger models to systematically generate ideal responses for all 50 drug pairs in both substitution directions, resulting in 300 examples (50 * 2 * 3 = 300), drawing inspiration from work demonstrating effective instruction-tuning with limited data47. We explored various hyperparameters, including learning rates (5e-6, 1e-5, 2e-5, 5e-5), batch sizes (1, 2), and epochs (2, 3) for Llama3-8B. For GPT4o-mini, we utilized OpenAI's automatic parameter search. Ultimately, the selected Llama3-8B model used a learning rate of 1e-5, a batch size of 2, and 3 epochs, while the selected GPT4o-mini was fine-tuned via the OpenAI API with a batch size of 1, 3 epochs, and a seed of 318998491. The core objective of this fine-tuning process was to impart the smaller models with the ability to emulate the larger models' successful rejection and explanation behavior when faced with the "Combined Rejection and Factual Recall Prompt."

4.2. OOD Testing

To evaluate the generalization of the fine-tuned model to other illogical requests, we tested its performance on the OOD datasets of terms with the same meanings (Fig. 2). This OOD dataset included several other categories. Testing on OOD data allows us to assess the generalizability of a model’s behavior in responding to illogical requests involving novel or previously unseen entities — a crucial factor in evaluating its applicability in real-world scenarios.

4.3. Balancing Rejection and Compliance

To test whether models became overly conservative after fine-tuning, we designed an additional test set comprising 20 cases (10 real FDA drug safety recalls, 5 theoretically event canceling situations, and 5 real government announcements) where the model should comply with the prompt rather than reject it (Fig. 3). These cases involved scenarios where the recommended substitution was appropriate and aligned with the correct drug relationships. This test ensured that the model retained the ability to provide helpful and persuasive responses when no logical flaws were present. The prompts are found in Appendix Table 1. Additionally, we also prompt the fine-tuned models questions regarding 50 common drugs we fine-tuned and see whether they can still answer logical requests regarding those drugs.

All the detailed counts of instances we evaluated across all following methods are available in Appendix Table 2.

5. General Benchmark Evaluation

To ensure that fine-tuning and prompt modifications do not degrade the overall performance of the models, we evaluated them on a broad set of general benchmarks using Inspect48 and Alpaca-Eval2 v0.6.549 using GPT4-turbo as the comparator model. These benchmarks were selected to test the models' reasoning, factual recall, and domain-specific knowledge, including medical contexts, ensuring that any improvements in handling drug-related prompts did not come at the expense of general task performance. The confidence intervals are calculated using the central limit theorem, a common practice in modern LLM evaluations50,51.

Data Sharing Statement:

All our code, data input and output from all models, and the Llama3 model we fine-tuned are publicly available at https://huggingface.co/datasets/AIM-Harvard/PERSIST

Funding:

The authors acknowledge financial support from the Google PhD Fellowship (SC), the Woods Foundation (DB, SC, HA, JG, LF), the NIH (NIH-USA R01CA294033 (SC, JG, LF, DB), NIH-USA U54CA274516-01A1 (SC, HA, DB), NIH-USA U24CA194354 (HA), NIH-USA U01CA190234 (HA), NIH-USA U01CA209414 (HA), and NIH-USA R35CA22052 (HA), the ASTRO-ACS Clinician Scientist Development Grant ASTRO-CSDG-24-1244514 (DB), and the European Union - European Research Council (HA: 866504). This work was also conducted with support from UM1TR004408 award through Harvard Catalyst ∣ The Harvard Clinical and Translational Science Center (National Center for Advancing Translational Sciences, National Institutes of Health) and financial contributions from Harvard University and its affiliated academic healthcare centers. The content is solely the responsibility of the authors and does not necessarily represent the official views of Harvard Catalyst, Harvard University, and its affiliated academic healthcare centers, or the National Institutes of Health. The authors thank Google Cloud research fund for Claude API inference costs.

Footnotes

Additional Declarations: No competing interests reported.

Competing interests:

DSB: Editorial, unrelated to this work: Associate Editor of Radiation Oncology, HemOnc.org (no financial compensation); Advisory and consulting, unrelated to this work: MercurialAI

BWA: Advisory and consulting, unrelated to this work.

1.

Note that C4 and RedPajama have overlaps.

Contributor Information

Shan Chen, Harvard Medical School.

Mingye Gao, Massachusetts Institute of Technology.

Kuleen Sasse, Johns Hopkins University.

Thomas Hartvigsen, University of Virginia.

Brian Anthony, Massachusetts Institute of Technology.

Lizhou Fan, Harvard Medical School.

Hugo Aerts, Harvard Medical School.

Jack Gallifant, Harvard Medical School.

Danielle S. Bitterman, Harvard Medical School

References

  • 1.Singhal K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Taylor R. et al. Galactica: A Large Language Model for Science. arXiv [cs.CL] (2022). [Google Scholar]
  • 3.Chen S. et al. Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification. arXiv [cs.CL] (2023). [Google Scholar]
  • 4.Van Veen D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Soroush A. et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI 1, (2024). [Google Scholar]
  • 6.Chen S. et al. The effect of using a large language model to respond to patient messages. Lancet Digit. Health 6, e379–e381 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Potts B. MedFuzz: Exploring the robustness of LLMs on medical challenge problems. Microsoft Research https://www.microsoft.com/en-us/research/blog/medfuzz-exploring-the-robustness-of-llms-on-medical-challenge-problems/ (2024). [Google Scholar]
  • 8.Bai Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv [cs.CL] (2022). [Google Scholar]
  • 9.Touvron H. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv [cs.CL] (2023). [Google Scholar]
  • 10.Bai Y. et al. Constitutional AI: Harmlessness from AI Feedback. arXiv [cs.CL] (2022). [Google Scholar]
  • 11.Liu R., Sumers T. R., Dasgupta I. & Griffiths T. L. How do large language models navigate conflicts between honesty and helpfulness? arXiv [cs.CL] (2024). [Google Scholar]
  • 12.Askell A. et al. A General Language Assistant as a Laboratory for Alignment. arXiv [cs.CL] (2021). [Google Scholar]
  • 13.Rafailov R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv [cs.LG] (2023). [Google Scholar]
  • 14.Christiano P. et al. Deep reinforcement learning from human preferences. arXiv [stat.ML] (2017). [Google Scholar]
  • 15.Ouyang L. et al. Training language models to follow instructions with human feedback. arXiv [cs.CL] (2022). [Google Scholar]
  • 16.Menz B. D., Modi N. D., Sorich M. J. & Hopkins A. M. Health disinformation use case highlighting the urgent need for artificial intelligence vigilance: Weapons of mass disinformation: Weapons of mass disinformation. JAMA Intern. Med. 184, 92–96 (2024). [DOI] [PubMed] [Google Scholar]
  • 17.Menz B. D. et al. Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis. BMJ 384, e078538 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhang Y., Sharma K., Du L. & Liu Y. Toward mitigating misinformation and social media manipulation in LLM era. in Companion Proceedings of the ACM on Web Conference 2024 vol. 19 1302–1305 (ACM, New York, NY, USA, 2024). [Google Scholar]
  • 19.Xu N. et al. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. in Findings of the Association for Computational Linguistics: NAACL 2024 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024). doi: 10.18653/v1/2024.findings-naacl.224. [DOI] [Google Scholar]
  • 20.Ding P. et al. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (Association for Computational Linguistics, Stroudsburg, PA, USA, 2024). doi: 10.18653/v1/2024.naacl-long.118. [DOI] [Google Scholar]
  • 21.Sharma M. et al. Towards Understanding Sycophancy in Language Models. arXiv [cs.CL] (2023). [Google Scholar]
  • 22.Liu Y. et al. A Hitchhiker’s Guide to Jailbreaking ChatGPT via Prompt Engineering. in Proceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things (ACM, New York, NY, USA, 2024). doi: 10.1145/3663530.3665021. [DOI] [Google Scholar]
  • 23.Si C. et al. Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong. arXiv [cs.CL] (2023). [Google Scholar]
  • 24.Haupt C. E. & Marks M. FTC regulation of AI-generated medical disinformation. JAMA (2024) doi: 10.1001/jama.2024.19971. [DOI] [PubMed] [Google Scholar]
  • 25.Polyportis A. & Pahos N. Navigating the perils of artificial intelligence: a focused review on ChatGPT and responsible research and innovation. Humanit. Soc. Sci. Commun. 11, 1–10 (2024). [Google Scholar]
  • 26.Lu W. et al. Eraser: Jailbreaking defense in Large Language Models via unlearning harmful knowledge. arXiv [cs.CL] (2024). [Google Scholar]
  • 27.Liu T. et al. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. arXiv [cs.CR] (2024). [Google Scholar]
  • 28.Han T. et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit. Med. 7, 288 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Huang X. et al. Medical MLLM is vulnerable: Cross-modality jailbreak and mismatched attacks on medical Multimodal Large Language Models. arXiv (2024). [Google Scholar]
  • 30.Gallifant J. et al. Language models are surprisingly fragile to drug names in biomedical benchmarks. arXiv [cs.CL] (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wu Z. et al. AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders. arXiv [cs.CL] (2025). [Google Scholar]
  • 32.Dubois Y., Galambosi B., Liang P. & Hashimoto T. B. Length-controlled AlpacaEval: A simple way to debias automatic evaluators. arXiv [cs.LG] (2024). [Google Scholar]
  • 33.Bhakthavatsalam S. et al. Think you have solved direct-answer question answering? Try ARC-DA, the direct-answer AI2 Reasoning Challenge. arXiv [cs.CL] (2021). [Google Scholar]
  • 34.Clark C. et al. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv [cs.CL] (2019). [Google Scholar]
  • 35.Hendrycks D. et al. Measuring massive multitask language understanding. arXiv [cs.CY] (2020). [Google Scholar]
  • 36.Rein D. et al. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv [cs.AI] (2023). [Google Scholar]
  • 37.Lin S., Hilton J. & Evans O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv [cs.CL] (2021). [Google Scholar]
  • 38.Zeng Y. et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. arXiv [cs.CL] (2024). [Google Scholar]
  • 39.Xu R. et al. The Earth is Flat because…: Investigating LLMs’ Belief towards Misinformation via Persuasive Conversation. arXiv [cs.CL] (2023). [Google Scholar]
  • 40.Allen-Zhu Z. & Li Y. Physics of language models: Part 3.2, knowledge manipulation. arXiv [cs.CL] (2023). [Google Scholar]
  • 41.Wei J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv [cs.CL] (2022). [Google Scholar]
  • 42.Liu J., Min S., Zettlemoyer L., Choi Y. & Hajishirzi H. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv [cs.CL] (2024). [Google Scholar]
  • 43.Soldaini L. et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv [cs.CL] (2024). [Google Scholar]
  • 44.Raffel C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv [cs.LG] (2019). [Google Scholar]
  • 45.Together. RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens. https://www.together.ai/blog/redpajama.
  • 46.Gao L. et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv [cs.CL] (2020). [Google Scholar]
  • 47.Zhou C. et al. LIMA: Less is more for alignment. arXiv [cs.CL] (2023). [Google Scholar]
  • 48.Inspect_ai: Inspect: A Framework for Large Language Model Evaluations. (Github). [Google Scholar]
  • 49.Alpaca_eval: An Automatic Evaluator for Instruction-Following Language Models. Human-Validated, High-Quality, Cheap, and Fast. (Github). [Google Scholar]
  • 50.Miller E. Adding error bars to evals: A statistical approach to language model evaluations. arXiv [stat.AP] (2024). [Google Scholar]
  • 51.Grattafiori A. et al. The Llama 3 herd of models. arXiv [cs.AI] (2024). [Google Scholar]

Articles from Research Square are provided here courtesy of American Journal Experts

RESOURCES