Abstract
To determine how subjective or authoritative misinformation embedded in user prompts affects large language model (LLM) accuracy on a clinical question with a known gold-standard answer (the treatment line of aripiprazole). Five leading LLMs answered the clinical question under three prompt conditions: (1) neutral, (2) an incorrect “self-recalled” memory, and (3) an incorrect statement attributed to an authority. Each model–scenario pair was repeated ten times (250 total responses). Accuracy differences were tested with χ² and Cramér’s V, and score shifts were analyzed with van Elteren tests. All models were correct under the neutral prompt (100% accuracy). Accuracy dropped to 45% with self-recall prompts and to 1% with authoritative prompts, indicating a strong prompt–accuracy association (Cramér’s V = 0.75, P < 0.001). Efficacy and tolerability ratings fell in parallel, yet models’ self-rated confidence under authoritative prompting stayed high and was statistically indistinguishable from baseline. LLMs are highly susceptible to misleading cues, especially those invoking authority, while remaining overconfident. These findings call for stronger validation standards, user education, and design safeguards before deploying LLMs in healthcare.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-026-38019-3.
Keywords: Authority effect, Artificial intelligence, Clinical inquiry, Large language model, Bias, AI in medicine
Subject terms: Health care, Medical research, Psychology, Psychology
Introduction
Large Language Models (LLMs), a key development in generative artificial intelligence (AI), have demonstrated remarkable capabilities across various domains. In the medical field, advanced models have achieved performance in professional assessments that are comparable to or even surpass those of human experts1–3, highlighting their vast knowledge base and potential.
This technological maturation is catalyzing a paradigm shift in how clinical knowledge is acquired and disseminated across the entire healthcare ecosystem. It empowers patients to move beyond passive information reception toward proactive health inquiry. It enables learners to transition from static textbooks to dynamic, inquiry-based learning4,5. For clinicians, LLMs offer a readily accessible digital consultant for exploring clinical scenarios, complementing traditional peer consultation. The sheer accessibility and immediacy of LLMs ensure that these active, self-directed inquiries are becoming integral to modern healthcare6.
However, beneath this promise lies a critical question of reliability, especially when LLMs operate outside the sterile conditions of benchmark evaluations. While LLMs excel at answering neutral, fact-based questions, their vulnerability to intrinsic biases and factual hallucinations is well-documented7,8. However, existing research has predominantly evaluated their knowledge using standardized, neutral question sets, thereby overlooking the complexity of authentic interactions9. In practice, inquiries from clinicians, learners, or patients are rarely context-free10. They are frequently framed by personal impressions, pre-existing beliefs, or critically cues from perceived authorities. These inaccurate hidden assumptions can reinforce misconceptions11, a risk that is amplified in interactions with a seemingly objective AI. While research has explored LLM vulnerabilities through adversarial attacks, these prompts are often artificial and lack the naturalistic context of genuine clinical inquiries12. To date, research has largely neglected to evaluate LLM performance under these noisy and non-neutral conditions, leaving a significant blind spot in our understanding of their real-world safety and utility.
Therefore, our study employed a simulated medical student inquiry as a controlled experimental model to investigate a fundamental question: How do subjective impressions and authoritative cues embedded in user prompts influence the accuracy and reliability of LLM responses in a healthcare context? By systematically introducing these biases into inquiries regarding major depressive disorder management, we evaluate potential vulnerabilities in five state-of-the-art LLMs. For readers both within and outside the medical and AI domains, our findings aim to enhance critical AI literacy among healthcare stakeholders and the wider public, provide a foundation for developing more robust AI-assisted tools, and ultimately promote safer and more effective human–AI collaboration.
Methods
As this research did not involve human subjects or their data, it was waived from institutional review board (IRB) review. We collected data for this study on May 22 and 23, 2025.
Selection of LLMs
The subjects of the experiment were five state-of-the-art (SOTA) models ranked at the top of the large model arena (LMArena) at the time of the study13. LMArena is a platform that ranks LLMs based on community-driven comparative evaluations. The selected five models were OpenAI’s GPT-4o and o3, and Google AI’s Gemini 2.5 Pro, and two variants of Google’s Gemini 2.5 Flash, one with standard reasoning capabilities and one configured in user interface setting for non-thinking (non-reasoning), direct-answer generation. We selected these models for three reasons: (1) they represented the current pinnacle of publicly available technology; (2) they originated from two major development institutions, ensuring representativeness; and (3) their high-accessibility user interfaces aligned with the real-world usage scenarios of medical learners.
Experimental task and gold standard
Our study employed a clinical question with a definitive gold standard answer for evaluation. The question was based on the 2023 Canadian Network for Mood and Anxiety Treatments (CANMAT) guidelines for management of major depressive disorder7,14, which are widely adopted in psychiatry. A key strength of the CANMAT guidelines is their transparent, evidence-based framework, which classifies treatments into lines (e.g., First, Second, Third) based on a dual-axis system: the quality of scientific evidence (Levels 1–4) and expert consensus on clinical factors like tolerability and safety. Crucially, a first-line designation is the highest recommendation, reserved for treatments supported by top-tier evidence (Level 1 or 2), which typically signifies data from multiple randomized controlled trials (RCTs) or meta-analyses. These guidelines clearly label aripiprazole as a first-line adjunctive therapy for difficult-to-treat depression (DTD), a standard of care widely reflected in clinical practice. The clarity and clinical significance of this fact made it an ideal benchmark for assessing the knowledge accuracy of LLMs.
Prompt design and experimental condition
To investigate the influence of prior impressions and authoritative cues, we designed three distinct informational conditions within each prompt. Each was prefaced with a simulated background status for the LLM: “I am a medical student using a language model to assist in learning about adjunctive pharmacological treatments for difficult-to-treat depression.” The baseline prompt (control group) used a neutral, non-leading question to directly inquire about the treatment line of aripiprazole in DTD adjunctive therapy. The prior impression prompt embedded a simulated student’s incorrect memory, such as, “As far as I remember, aripiprazole is considered a second-line treatment…” (referred to as a self-recalled impression). The authority effect prompt cited an authoritative figure, for example, “My teacher mentioned that, according to expert consensus, aripiprazole is considered a third-line treatment…” The prior impression and authority effect prompts included variations suggesting either second-line or third-line placement. The detailed prompt structures are available in the Supplementary Table 1.
Data collection and outcome measures
Considering the stochastic nature of LLM generation, we tested each model 10 times under each of the five prompt scenarios (one baseline, two prior impression variants, and two authority effect variants). This resulted in a total of 250 data points (5 models × 5 scenarios × 10 repetitions). To facilitate a multi-dimensional assessment, we instructed each LLM to provide four metrics for the medication: Treatment Line & Confidence, where the model had to specify a treatment line (1, 2, or 3) and rate its confidence on a scale of 0 to 10; Efficacy & Tolerability Scores, where the model provided two 0-to-10 scores based on (1) evidence-based efficacy and (2) tolerability, safety, and feasibility, with each metric clearly defined in the prompt.
Statistical analysis
We conducted all data analysis using RStudio (version 2025.05.1 + 513) (Posit Software, PBC, Boston, MA) based on R (version 4.5.0) and set statistical significance at P < 0.05 and two-tailed. We used descriptive statistics to calculate the accuracy of treatment line classification and to present efficacy, tolerability, and confidence scores as mean with standard deviation. To test whether prompt conditions affected accuracy, we performed a Chi-squared Test and calculated Cramér’s V as a measure of effect size15. To compare the scores for efficacy, tolerability, and confidence across conditions, we employed a stratified Wilcoxon signed-rank test (van Elteren test). This non-parametric test is suitable for paired repeated-measures data, and we used the LLM model as a stratifying variable to control for inherent differences between models. We executed this analysis using the coin package in R16.
Results
Impact of information style on response accuracy
The study first evaluated the overall impact of the three different prompt styles on LLM response accuracy (Table 1; Fig. 1). In the baseline condition, all models correctly identified aripiprazole as a first-line treatment with 100% accuracy. However, the introduction of misleading prompts led to a substantially decreased in performance. Under the authoritative prompt condition, accuracy fell to nearly zero (1.0%), even when the authoritative cue incorrectly suggested a second- or third-line placement. In the self-recall condition, model performance was intermediate, with an overall accuracy of 45%. A Chi-squared test confirmed a statistically significant association between prompt style and response accuracy (χ²(2) = 141.2, P < 0.001), with a Cramér’s V of 0.75, indicating a large effect size.
Table 1.
Summary of large Language model response accuracy and scores of efficacy, tolerability and confidence by information style.
| Information style | Information line | Number of samples | Correct rate | Efficacy mean (Standard deviation) | Tolerability mean (Standard deviation) | Confidence mean (Standard deviation) |
|---|---|---|---|---|---|---|
| Authoritative | 2 | 50 | 0.02 | 8.08 (0.53) | 6.10 (0.46) | 8.96 (0.78) |
| 3 | 50 | 0 | 7.76 (0.52) | 5.22 (0.65) | 8.82 (1.14) | |
| Overall | 100 | 0.01 | 7.92 (0.54) | 5.66 (0.71) | 8.89 (0.97) | |
| Self-Recall | 2 | 50 | 0.38 | 8.48 (0.50) | 6.20 (0.49) | 8.66 (0.56) |
| 3 | 50 | 0.52 | 8.31 (0.58) | 5.74 (0.63) | 8.26 (0.85) | |
| Overall | 100 | 0.45 | 8.40 (0.55) | 5.97 (0.61) | 8.46 (0.74) | |
| Baseline | Overall | 50 | 1 | 8.74 (0.44) | 6.72 (0.61) | 8.84 (0.42) |
Fig. 1.
Stacked bar charts with correct and incorrect response counts by information style.
Analysis of individual model performance
Further analysis of individual model performance revealed substantial variations in their resilience to misleading prompts (Fig. 2). In the authoritative prompt condition, most models, including GPT-4o and Gemini 2.5 Pro/Flash, showed very low accuracy, consistently adopting the incorrect suggestion. The only exception was OpenAI’s o3, which provided the correct answer in one instance. The response to the self-recall prompt was more divergent. OpenAI’s GPT-4o had the lowest accuracy among the reasoning models in this condition. In contrast, Google’s Gemini 2.5 Flash achieved an accuracy greater than 50%, which was higher than its larger counterpart, Gemini 2.5 Pro. Notably, the non-reasoning model was unaffected by the self-recall prompt, maintaining 100% accuracy.
Fig. 2.
Correct-answer rate by large language model and prompt style. NT non-thinking, Auth authoritative, Self self-recall, Base baseline.
Analysis of confidence, efficacy, and tolerability scores
To investigate the reasoning process behind the LLM responses, we analyzed the models’ confidence, efficacy, and tolerability scores (Fig. 3). The van Elteren test revealed that prompt style significantly influenced both efficacy and tolerability ratings. Compared to the baseline (efficacy mean = 8.74, tolerability mean = 6.72), the authoritative prompt significantly lowered the models’ mean scores for efficacy (mean = 7.92) and tolerability (mean = 5.66) (P < 0.001). The self-recall prompt had an intermediate effect, resulting in scores significantly lower than the baseline but higher than those in the authoritative condition (Fig. 3). A key observation was made regarding the confidence scores. Despite the near-zero accuracy in the authoritative prompt condition, the mean confidence score (mean = 8.89) was not statistically different from that of the baseline condition (mean = 8.84) (P = .14). In the self-recall condition, the mean confidence score (mean = 8.46) was significantly lower than in both the baseline and authoritative conditions (P < 0.001).
Fig. 3.
Mean scores with standard error for confidence, efficacy, and tolerability by information style. Stratified Wilcoxon test (van Elteren test) with Bonferroni-adjusted calculated as statistical significance. ***P < 0.001.
Discussion
To our knowledge, this study was the first to systematically quantify the impact of prior impressions and the authority effect on the reliability of LLM responses in a simulated clinical inquiry context. Our findings showed that performance on a clinical question answered with 100% accuracy in a baseline scenario suffered a dramatic collapse in performance when presented with a misleading prompt. This phenomenon appeared to highlight a critical vulnerability of LLMs when facing non-neutral inquiries. This type of query, which incorporates a user’s existing knowledge or an authority’s opinion, is common in real-world learning scenarios. Users, including clinicians, researchers, and patients, often frame questions around what they think they know precisely because they are uncertain17. Our research suggests that this natural mode of interaction may expose a key weakness of current-generation LLMs.
The nature of the misleading cue determined the severity of this vulnerability in our study. Authoritative prompts were particularly potent, with accuracy dropping from 45% in the self-recall condition to just 1% in the authoritative condition. When we embedded cues such as “my teacher mentioned” or “according to expert consensus” in the prompt, nearly all tested SOTA models appeared to abandon their internal knowledge and conform to the erroneous information. This behavior echoes the known AI problem of sycophancy, where models tend to align their responses with the user’s presented viewpoint18,19. In a high-stakes field like healthcare, such deference could be exceptionally dangerous. For clinicians, it could lead to flawed clinical reasoning and adver se patient outcomes. For patients and caregivers, it could result in the adoption of harmful self-care advice or a misunderstanding of their conditions. In contrast, the influence of a self-recalled memory was more heterogeneous. Models exhibited partial resistance, yet the substantial drop in accuracy still underscored how easily LLM stability can be disrupted by subjective user input in this context.
A deeper analysis of individual model performance yielded some counterintuitive results. For instance, the architecturally leaner Gemini 2.5 Flash demonstrated greater robustness to misleading self-recall prompts, achieving 52% accuracy compared with 18% for its larger Gemini 2.5 Pro counterpart. Similar observations have been reported in other research7, and this may suggest that for tasks involving such noisy inputs, a more complex reasoning chain could become a liability, causing the model to incorrectly integrate irrelevant prompt elements into its decision-making process. Furthermore, the slight resistance shown by the OpenAI o3 model was intriguing. Its reasoning trace revealed that the model had accessed the correct CANMAT guidelines via web search, yet in most cases, its final output chose to ignore this evidence in favor of the flawed authoritative cue. This serves as a reminder that even when LLMs possess tool-use or web-browsing capabilities, their final information integration and judgment mechanisms remain imperfect. While a transparent reasoning process is valuable, users must maintain a critical perspective on the final conclusion.
Our findings posed a serious challenge to the notions of LLM explainability and confidence. Some have proposed that one cannot simply ask an LLM for its reasoning to understand its decision18. Similarly, our research suggested that such explanations are likely unreliable post-hoc rationalizations. When a model was misled into providing an incorrect treatment line, its accompanying scores for efficacy and tolerability also systematically shifted to support that erroneous conclusion. More alarmingly, the self-assessed confidence of LLMs was highly misleading. When conforming to incorrect authoritative prompts, the models’ average confidence was statistically indistinguishable from their confidence when providing correct answers in the baseline condition. This means an LLM can appear just as confident when disseminating misinformation as it does when stating facts. Although confidence dropped slightly but significantly in the self-recall condition, its absolute value remained high (mean > 8.4), aligning with previous findings that suggest that LLM confidence scores are often inflated and cannot be used as a reliable proxy for accuracy20.
Based on these findings, we propose several implications for different stakeholders in the health care ecosystem and for the wider public engaging with LLMs on health-related topics. (1) For healthcare professionals and other users, it is imperative to cultivate AI literacy such as prompt hygiene when using LLMs as knowledge-seeking tools21. They should frame questions as neutrally and objectively as possible, avoiding the inclusion of subjective guesses, personal memories, or secondhand authoritative opinions. Crucially, professionals must critically appraise the generated output, recognizing its inherent limitations. Users must recognize that an LLM’s response, even if delivered with high confidence and seemingly sound reasoning, may be entirely incorrect. (2) For health informatics developers and implementers, our results suggest the importance of implementing guardrails in LLM-based educational applications22. A system could be designed to actively detect authoritative or subjective cues within a user’s prompt, such as the phrases “my doctor said,” “I read that.“. Upon detection, the interface could generate an advisory, cautioning the user that their phrasing might compromise the model’s objectivity and recommending they reframe the question neutrally to receive the most accurate information. Furthermore, model selection should be flexible. For tasks requiring definitive answers, a simpler or specially fine-tuned model may be more cost-effective and stable than the most complex SOTA model. (3) For policymakers, regulations for LLM applications in education could draw from frameworks like the European Urion (EU) AI Act for high-risk AI systems23. This must include a clear obligation for developers and educational institutions to disclose potential risks and limitations to users. Evaluation criteria must extend beyond static knowledge accuracy to include stability, interpretability, and resilience to interference in dynamic, real-world interactive scenarios.
Limitations
Our study had several limitations. First, we employed an in-depth analysis of a single clinical question, representing a deductive inquiry designed to expose a specific LLM vulnerability. The generalizability of these findings requires validation across a broader range of clinical scenarios and knowledge domains. Second, LLM technology is evolving rapidly, and today’s SOTA models may soon be replaced. However, we believe the core issues we uncovered, such as sycophancy and the illusion of confidence, are intrinsic to current LLM architectures and will remain relevant in the near future. Finally, our study’s use of simulated prompts within a medical student context represents a limitation. While designed to approximate real-world queries, this setting may not fully capture the dynamic nature of authentic interactions, and the identified vulnerability to authority cues likely poses even more acute risks for practicing clinicians and patients seeking direct medical advice.
Conclusions
Our research demonstrated that embedding prior impressions and authoritative cues in prompts could have a devastating impact on the reliability of LLM responses. This finding offers critical insight into the potential risks of LLMs and underscores the urgent need to foster critical AI literacy. While LLMs exhibit astonishing potential in their knowledge capacity, deficiencies in their stability and explainability could be significant barriers to their integration into the broader healthcare system. Future research should continue to explore the latent flaws of LLMs in complex interactions and focus on developing more robust and reliable models. Only then can this transformative technology be leveraged to safely and effectively empower the entire spectrum of healthcare participants, from providers to patients.
Supplementary Information
Below is the link to the electronic supplementary material.
Author contributions
Conceptualization, Y.C., M.-H. H. and C.-C. C.; Writing—Original Draft, Y.C. ; Writing—Review and Editing, C.-C. C.; Visualization, P.-C. J.; Supervision, C.-C. C. All authors have read and agreed to the published version of the manuscript.
Funding
No funding.
Data availability
The data that support the findings of this study are available on request from the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Chang, Y., Huang, S. S., Hsu, W. Y. & Liu, Y. C. Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning. PLOS ONE. 20, e0330303 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Chang, Y., Su, C. Y. & Liu, Y. C. Assessing the performance of chatbots on the Taiwan psychiatry licensing examination using the Rasch model. Healthcare12, 2305 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Goh, E. et al. Large Language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw. Open.7, e2440969 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Li, Z. et al. Large Language models and medical education: a paradigm shift in educator roles. Smart Learn. Environ.11, 26 (2024). [Google Scholar]
- 5.Boscardin, C. K., Gin, B., Golde, P. B. & Hauer, K. E. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad. Med.99, 22 (2024). [DOI] [PubMed] [Google Scholar]
- 6.Han, Y., Hong, S., Li, Z. & Lim, C. Defining and classifying the roles of intelligent learning companion systems: A scoping review of the literature. TechTrends69, 567–581 (2025). [Google Scholar]
- 7.Chang, Y., Liu, Y. C., Huang, S. S. & Hsu, W. Y. Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression. Psychiatry Res.348, 116501 (2025). [DOI] [PubMed] [Google Scholar]
- 8.Huang, L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43, 42:1–42:55 (2025). [Google Scholar]
- 9.Bedi, S. et al. Testing and evaluation of health care applications of large Language models: A systematic review. JAMA10.1001/jama.2024.21700 (2024). doi:10.1001/jama.2024.21700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Motzkus, C. et al. Pre-clinical medical student reflections on implicit bias: implications for learning and teaching. PLOS ONE. 14, e0225058 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Dwyer, C. P. An evaluative review of barriers to critical thinking in educational and Real-World settings. J. Intell.11, 105 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. Npj Digit. Med.8, 149 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chiang, W. L. et al. Chatbot Arena: An Open platform for evaluating LLMs by human preference. Preprint at 10.48550/arXiv.2403.04132 (2024).
- 14.Lam, R. W. et al. Canadian network for mood and anxiety treatments (CANMAT) 2023 update on clinical guidelines for management of major depressive disorder in adults: Réseau canadien pour les traitements de l’humeur et de l’anxiété (CANMAT) 2023: mise à Jour des Lignes directrices cliniques pour La prise En charge du trouble dépressif Majeur Chez les adultes. Can. J. Psychiatry. 69, 641–687 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sun, S., Pan, W. & Wang, L. L. A comprehensive review of effect size reporting and interpreting practices in academic journals in education and psychology. J. Educ. Psychol.102, 989–1004 (2010). [Google Scholar]
- 16.Hothorn, T., Hornik, K., van de Wiel, M. A. & Zeileis, A. Implementing a class of permutation tests: the coin package. J. Stat. Softw.28, 1–23 (2008).27774042 [Google Scholar]
- 17.Chin, C., Osborne, J. & and Students’ questions: a potential resource for teaching and learning science. Stud. Sci. Educ.44, 1–39 (2008). [Google Scholar]
- 18.Rueda, A. et al. Understanding LLM scientific reasoning through promptings and model’s explanation on the answers. ArXiv org.https://arxiv.org/abs/2505.01482v1 (2025).
- 19.Sharma, M. et al. Towards understanding sycophancy in language models. ArXiv org. https://arxiv.org/abs/2310.13548v4 (2023).
- 20.Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the confidence of large Language models in answering clinical questions: Cross-Sectional evaluation study. JMIR Med. Inf.13, e66917 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Walter, Y. Embracing the future of artificial intelligence in the classroom: the relevance of AI literacy, prompt engineering, and critical thinking in modern education. Int. J. Educ. Technol. High. Educ.21, 15 (2024). [Google Scholar]
- 22.Lexman, R. R., Krishna, A. & Sam, M. P. AI guardrails in business and education: bridging minds and markets. Dev. Learn. Organizations: Int. J. ahead-of-print (2025).
- 23.Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying down Harmonised Rules on Artificial Intelligence and Amending Regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and Directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (Text with EEA Relevance) (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author.



