Disagreement between human and AI evaluation of treatment plans

Dipayan Sengupta; Saumya Panda

doi:10.1038/s41598-026-35406-8

. 2026 Jan 7;16:4798. doi: 10.1038/s41598-026-35406-8

Disagreement between human and AI evaluation of treatment plans

Dipayan Sengupta ^1,^✉, Saumya Panda ²

PMCID: PMC12873255 PMID: 41501144

Abstract

Large language models (LLMs) are increasingly advocated for clinical decision support, yet it remains unclear how their recommendations are evaluated relative to human-authored plans when presentation cues are controlled. We conducted a vignette-based study in dermatology in which experienced clinicians (n = 10) and two LLMs (a generalist model and a deliberative reasoning model) drafted treatment plans for five de-identified cases. All plans were normalized for structure, length and tone before being blindly scored on a rubric by the same clinician cohort and by an AI judge. The primary outcome was the difference in plan scores as a function of evaluator identity (human vs. AI). We observed a consistent evaluator effect: clinician raters tended to assign higher scores to clinician-authored plans, whereas the AI judge tended to assign higher scores to AI-authored plans. Because plans were standardized and authorship was masked, this divergence is unlikely to be explained solely by surface presentation, suggesting that humans and AI apply partly different internal criteria when judging plan quality. The study is exploratory and limited by the small sample, synthetic vignettes and the use of a single AI judge; scores reflect perceived quality rather than patient outcomes. These findings indicate that evaluator identity systematically shapes judgments of clinical plans under controlled conditions and motivate multi-metric, context-aware evaluation frameworks that capture the multidimensional nature of clinical reasoning, along with human–AI interfaces that make assumptions and criteria explicit.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-35406-8.

Keywords: Clinical decision support, Large language models, Evaluator effect, Blinded evaluation, Dermatology, Human–AI collaboration, Benchmarking

Subject terms: Computational biology and bioinformatics, Health care, Mathematics and computing, Medical research, Scientific community

Introduction

The integration of artificial intelligence (AI), particularly large language models (LLMs), into medicine is accelerating across education, decision support and administration, where reported performance remains mixed but promising^1,2, even as ethical and practical concerns persist about reliability, transparency and accountability in deployment³. Beyond technical performance, a central challenge is to understand how LLMs should be positioned within the cognitive and contextual fabric of clinical decision-making, rather than assuming a single role a priori.

Clinical accuracy is necessary but insufficient: Therapeutic reasoning typically requires reconciling multiple objectives—maximizing efficacy while minimizing adverse effects and cost^4,5—in problems that frequently admit no single correct answer⁶. Substantial inter-practitioner variation has long been documented, even among experienced clinicians making similar decisions in similar contexts⁷ and across health systems⁸, including within informatics-focused studies⁹. Such variability indicates that clinical reasoning is not merely computational; it is adaptive and context-rich.

Under uncertainty and time pressure, clinicians lean on heuristics and pattern recognition shaped by experience, which can introduce bias yet enable timely action^10,11, especially when cognitive or temporal limits are reached^12,13. LLMs, in contrast, operate under a different rationality bound: they optimize over broad textual corpora rather than situational context¹⁴, and their capabilities remain constrained by the scope and limitations of the underlying literature¹⁵. Consistent with these differences, empirical analyses show that LLMs exhibit bias and error profiles that diverge from human patterns (e.g., non-causal “shortcut” strategies rather than experience-grounded heuristics)^16,17, a gap also noted from a comparative cognition perspective¹⁸. Consequently, even with identical case details, humans and LLMs may reach plausible plans via potentially different pathways.

Evaluator preferences can further compound these differences. Humans often down-weight algorithmic recommendations after observing errors—even when algorithms are, on an average, more accurate¹⁹—while LLMs can prefer model-generated outputs over human-generated ones when judging text²⁰. Whether such evaluator effects persist once stylistic cues are carefully normalized is yet to be seen.

Context remains central to clinical reasoning. Clinicians integrate tacit and explicit factors—patient preferences, resource constraints and cultural norms—at scales that are difficult to enumerate and harder to encode in text^21,22, with recent reviews cataloguing hundreds to thousands of contextual influences on decisions²³. Despite decades of scholarship, there is no single, universally accepted theory of clinical or therapeutic reasoning^24,25, though most accounts converge on the view that heuristic, context-sensitive judgment complements formal evidence appraisal in routine care^26,27.

These considerations expose the limits of traditional question–answer benchmarking for medical LLMs: Fixed answer keys track knowledge retrieval but only partially reflect the quality of reasoning that integrates evidence with context^28,29. Recent work therefore proposes human evaluation frameworks that emphasize clinical applicability, transparent criteria and practical utility alongside correctness³⁰. In short, assessments that mirror real clinical work must look beyond single best answers toward comparative, context-aware evaluation, ideally represented by a holistic instrument that captures clinical appropriateness, safety, feasibility, and patient-centred considerations.

Dermatology offers a suitable testbed for such inquiry: Beyond image recognition, the field involves complex, longitudinal management of chronic and inflammatory conditions where plans must balance multiple modalities against comorbidities, access and patient preferences. In this study, we compare plans authored by experienced dermatologists with plans generated by two LLMs that differ in design emphasis—a generalist model (GPT-4o) and a deliberative reasoning model (o3)—and then evaluate these plans under two lenses: blinded clinician peers and an AI judge. Our objective is not to adjudicate superiority but to examine whether evaluator identity (human vs. AI) systematically shifts perceived plan quality in a realistic, rubric-based setting, and to quantify the extent of such divergence, while noting emergent evaluations of the o-series reasoning family in clinical contexts^31–33.

Results

The study revealed a definite, statistically significant divergence in the evaluation of treatment plans, contingent entirely on the identity of the evaluator. Human experts and the AI judge held diametrically opposed views on the quality of plans generated by human versus AI agents.

Phase 1: human experts favour Peer-Generated plans

When evaluated by the ten human dermatologists, treatment plans authored by their peers were scored significantly higher than those generated by either AI model. The 450 evaluations of human-generated plans yielded an aggregate mean score of 7.62 (SD 1.74), whereas the 100 evaluations of AI-generated plans produced a mean score of 7.16 (SD 1.73). A Wilcoxon signed-rank test confirmed that this difference was statistically significant (p = 0.0313).

The participant rankings reflected this preference, with human experts securing the top five positions. The AI models ranked in the lower half; GPT-4o ranked 6th with a mean score of 7.383 (SD 1.562), while the advanced reasoning model, o3, ranked 11th with a mean score of 6.974 (SD 1.836). The full ranking for all 12 participants is detailed in Table 1.

Table 1.

Full summary of human evaluation (Phase 1).

Rank	Plan Source	Plan Source ID	Mean	Standard deviation	count
1	Human	uid_ss84Ly70	8.172	1.408	45
2	Human	uid_adB7pL21	7.838	1.546	45
3	Human	uid_agH7mQ29	7.812	1.558	45
4	Human	uid_ms91Dw04	7.713	1.379	45
5	Human	uid_ag65Re18	7.428	1.904	45
6	AI_GPT4o	uid_gpt4_W1 × 2	7.383	1.562	50
7	Human	uid_ac9ZkM54	7.289	1.67	45
8	Human	uid_saH3vN63	7.274	1.884	45
9	Human	uid_sd52Px46	7.252	1.629	45
10	Human	uid_sp42qT88	7.196	1.537	45
11	AI_o3	uid_o3_Y6Z5	6.974	1.836	50
12	Human	uid_sb34UxQ9	6.872	2.27	45

Open in a new tab

To quantify the consistency of scoring among the human evaluators, we calculated the intraclass correlation coefficient. The analysis revealed an ICC of 0.561 (95% CI [0.385, 0.741]), indicating a moderate level of agreement among raters.

To further dissect these findings, a linear mixed-effects model was employed, controlling for variability introduced by different cases and raters. The model confirmed that the source of the plan was a significant predictor of the score. Specifically, it showed that plans generated by the o3 model were scored significantly lower than those from human experts (p = 0.012), even after accounting for other variables.

Phase 2: AI judge favors AI-Generated plans

The evaluation conducted by the Gemini 2.5 Pro model produced a substantial reversal of the Phase 1 findings. The AI judge tended to score AI-generated plans higher than human-written plans (Wilcoxon signed-rank p = 0.0313 across five cases), with considerable overlap in score distributions. The mean score for AI-generated plans was 7.75 (SD 1.67), compared to 6.79 (SD 1.74) for human-generated plans.

This reversal was most evident in the participant rankings. The o3 model, which was ranked 11th by humans, was elevated to 1st place by the AI judge with a mean score of 8.20 (SD 1.77). GPT-4o was ranked 2nd with a mean score of 7.30 (SD 1.51). Consequently, all ten human experts were ranked below the two AI models. The full ranking from the Gemini evaluation is shown in Table 2.

Table 2.

Full summary of LLM judge evaluation (Phase 2).

Rank	Plan Source	Plan Source ID	Mean	Standard Deviation	Count
1	AI_o3	uid_o3_Y6Z5	8.2	1.772	5
2	AI_GPT4o	uid_gpt4_W1 × 2	7.3	1.512	5
3	Human	uid_saH3vN63	7.22	1.482	5
4	Human	uid_sp42qT88	7.16	1.099	5
5	Human	uid_ms91Dw04	7.12	1.489	5
6	Human	uid_agH7mQ29	7	1.938	5
7	Human	uid_ac9ZkM54	6.9	1.821	5
8	Human	uid_adB7pL21	6.86	2.116	5
9	Human	uid_ag65Re18	6.84	1.808	5
10	Human	uid_sd52Px46	6.64	2.032	5
11	Human	uid_ss84Ly70	6.2	1.859	5
12	Human	uid_sb34UxQ9	6	2.155	5

Open in a new tab

Evidence of an evaluator effect

In summary, we observe an evaluator effect: Rankings differ by evaluator type (human vs. AI judge). The preferred treatment plans were significantly dependent on the evaluator’s nature. A comprehensive comparison of this shift in rankings for each participant between the two evaluation phases is presented in Table 3 and a visual summary is presented in Figure 1.

Table 3.

Comparison of rankings and scores across evaluation Phases.

Plan Source	Rank (Human Eval)	Mean Score (Human Eval)	Rank (Gemini Eval)	Mean Score (Gemini Eval)	Rank Change (Human - Gemini)
AI_o3	11	6.974	1	8.2	+ 10
AI_GPT4o	6	7.383	2	7.3	+ 4
Human	1	8.172	11	6.2	-10
Human	2	7.838	8	6.86	-6
Human	3	7.812	6	7	-3
Human	4	7.713	5	7.12	-1
Human	5	7.428	9	6.84	-4
Human	7	7.289	7	6.9	0
Human	8	7.274	3	7.22	+ 5
Human	9	7.252	10	6.64	-1
Human	10	7.196	4	7.16	+ 6
Human	12	6.872	12	6	0

Open in a new tab

Fig. 1 — Divergent ranking reveals evidence of evaluator effect.

The complete, disaggregated scoring data for all 550 human evaluations and 60 AI judge evaluations are available as Supplementary Data 2.

Discussion

We observed an evaluator effect under blinded, rubric-based scoring: Clinician raters tended to award higher scores to clinician-authored plans, whereas the AI judge tended to award higher scores to AI-authored plans. Because plan structure, length and tone were normalized before scoring and authorship was blinded, this pattern is unlikely to be explained by surface presentation. Rather, it suggests that humans and AI apply partly different internal criteria for what constitutes a good clinical plan, which we treat as exploratory and hypothesis-generating, given the sample size and presence of single AI judge.

The direction of preferences is consistent with prior demonstrations that evaluators can favour outputs aligned with their own mode of reasoning or communication: Humans reduce reliance on algorithms after observing errors—even when algorithms remain, on an average, more accurate¹⁶—and large language models can prefer model-generated text when judging text quality²⁰. In our setting, these alignment preferences persisted despite stylistic normalization, which points toward cognitive rather than stylistic differences in evaluation. Clinician judgments plausibly privilege plans that cohere with experiential heuristics and local feasibility constraints, whereas an AI judge, operating over broad textual distributions, plausibly privileges statistical regularities and evidence patterns^14,15. We do not claim this mechanism is conclusively established here; it is an interpretation to be tested with explicit manipulations (e.g., surfacing or suppressing local constraints) and with multiple, diverse AI judges in follow-up studies^31–33.

Our observations have two practical implications. First, evaluation objectives for medical LLMs need to be rethought. Fixed question–answer tests are informative for knowledge retrieval, but they under-represent the construct of interest when the goal is to judge multi-objective, context-sensitive reasoning quality in treatment planning^28,29. Calls for alternative evaluation regimes increasingly emphasize multi-metric, human-centered protocols that make scoring criteria explicit and clinically meaningful³⁰. From the perspective of measurement theory, scores are only interpretable if the evaluation represents the intended construct and does not inadvertently privilege human-typical solution paths^34–36, and modern machine learning (ML) cautions that comparable benchmark scores can mask different off-benchmark behaviour (underspecification)³⁷. Methodologically, this argues for a multi-lens, context-aware evaluation: Evaluate multiple outcome dimensions (e.g., feasibility, safety flags, cost-awareness, patient-centeredness), analyze the structure of disagreement rather than collapsing to a single metric, and complement single-model judges with multi-model or ensemble judges. In concrete terms, federated, real-world benchmarking platforms (e.g., MedPerf) demonstrate how multi-site, privacy-preserving evaluation can align testing with clinical deployment across diverse contexts³⁸. In addition, there is growing recognition that widely used medical exam-style benchmarks are fundamentally limited as surrogates for clinical reasoning and should not be over-interpreted for deployment decisions³⁹. Holistic paradigms that combine multi-metric evaluation with richer task taxonomies (e.g., HELM) offer a principled direction for future healthcare-specific suites⁴⁰.

Second, the observed divergence suggests that AI-generated plans may act as a useful counterpoint that helps clinicians reflect on—and, at times, reconsider—their own decision patterns, creating opportunities for constructive clinician–AI interaction. Human expertise contributes contextualization, feasibility judgments and ethical salience; AI contributes scale, internal consistency and coverage of diffuse evidence. Realizing this complementarity requires interactions that calibrate trust and make evaluative criteria legible to clinicians, and contestable where appropriate. Evidence indicates that clinician trust depends on perceived reliability, transparency and contextual fit^41,42, and that explanation quality can either increase or decrease trust depending on design^43,44. There is also a risk that human–AI feedback loops, if poorly designed, can amplify bias⁴⁵, whereas interactionist approaches may promote mutual error correction⁴⁶. Practically, interfaces should expose assumptions about availability, affordability and regulatory status; allow controllable weighting of criteria so recommendations align with patient and system constraints; provide contestable rationales clinicians can interrogate; and, encourage reflective checks such as “consider-the-opposite,” which have been shown to debias judgments⁴⁷. This direction is consonant with the hybrid intelligence vision, in which complementary strengths are deliberately orchestrated rather than incidentally juxtaposed⁴⁸, and motivates prospective studies comparing collaboration modes (advice-taking, co-authoring, or adjudication) with downstream clinical utility endpoints—for example, workflows in which clinicians and AI systems independently score treatment recommendations and any scoring discrepancies trigger a focused joint review.

Several limitations qualify our conclusions. The study used ten clinicians and five dermatology vignettes, which constrains power and generalizability. Vignettes, while necessary for blinding and control, cannot reproduce the full contextual richness of real clinical encounters; we also could not link plan ratings to patient outcomes, so our endpoint is perceived quality rather than clinical effectiveness. We used a single AI judge, raising the possibility of model-specific effects; multi-judge or ensemble judging would help assess robustness. Although we normalized plan style and tone to reduce detectability of origin, residual signatures may persist. A formal manipulation check—for example, a pilot test of normalization effectiveness with raters—of source detectability as well as brief measures of rater attitudes toward AI (e.g., trust calibration, prior exposure) would help clarify remaining confounders and should be incorporated into future studies of this kind. Qualitative inquiry (e.g., think-aloud while scoring) could illuminate the cognitive processes that produce divergence, and expanding to additional specialties and health systems would test external validity.

In summary, under blinded, rubric-based scoring, clinicians and an AI judge applied systematically different evaluative logics to clinical treatment plans, yielding an evaluator effect that persisted beyond surface cues. The result is not a verdict on superiority but an empirical indication that judgment varies by evaluator identity. Developing evaluation practices—and practical decision-support interfaces—that acknowledge and productively harness these differences may offer a safer, more trusted and more context-sensitive path to clinical AI.

Methods

Study design and participants

This study was conducted as a comparative, observational analysis with two distinct evaluation phases. The research involved hypothetical clinical scenarios and did not include any real patient data or intervention; therefore, formal institutional review board approval was not sought. Ten board-certified dermatologists practising in India, with post-residency experience ranging from 1 to 15 years, were recruited via personal invitation. Participation was voluntary and uncompensated. All participants provided informed consent and were aware of the study’s purpose and design.

Clinical case scenarios

Five complex clinical vignettes were developed (Supplementary Note 1) by the study authors to represent common yet challenging treatment decisions in dermatology. The cases were designed to be open-ended, lacking a single correct answer, thereby requiring nuanced clinical judgment. The scenarios included: (1) a 52-year-old male with psoriasis and multiple comorbidities; (2) a 29-year-old female with PCOS-related acne intending to conceive; (3) a 34-year-old female with severe atopic dermatitis and recent tuberculosis exposure; (4) a frail, elderly woman with bullous pemphigoid and significant osteoporosis; and, (5) a 40-year-old woman with Fitzpatrick skin type V and melasma.

Treatment plan generation

The five case scenarios were presented to the 10 human experts and two AI models. The AI models were accessed via the OpenAI API on April 18, 2025. The models used were gpt-4o-latest and o3-2025-04-16. A single, direct prompt was used for all cases and models. The exact prompts used for plan generation are provided in the Supplementary Note 2.

Response normalization

To mitigate biases from variations in writing style and length, all 60 responses underwent a blinded, two-step normalization process. First, each response was processed using a separate GPT-4 API call to standardize language and length while preserving core clinical information. This was followed by a manual author review to verify accuracy. Both the original, unedited versions and the final normalized versions of all treatment plans are available for comparison in Supplementary Data 1.

Evaluation protocol

The study proceeded in two evaluation phases using a detailed scoring rubric provided to all evaluators (full rubric available in the Supplementary Information). The rubric domains were adapted from widely used clinical assessment constructs (efficacy, safety, feasibility, patient-centredness) and were reviewed by both study authors for face validity prior to deployment.

Phase 1: Human Expert Evaluation. The 60 normalized responses were anonymized and presented in a randomized order to the 10 human experts via a custom web form; experts were blinded to the source and did not score their own submissions. Experts assigned a single overall quality score between 0.0 and 10.0 based on a holistic assessment of efficacy, safety, patient-centeredness, feasibility, and overall clinical appropriateness.
Phase 2: AI Judge Evaluation. Following human evaluation, all 60 responses were evaluated by the gemini-2.5-pro-preview-05-06 model, accessed on June 3, 2025. The AI judge was instructed to perform a blind assessment using the exact same holistic scoring instructions (Supplementary Note 2) provided to the human experts.

Statistical analysis

All statistical analyses were performed using Python (v.3.9) with the pandas, scipy, pingouin and statsmodels libraries. Descriptive statistics (mean, standard deviation) were calculated for all scores. To assess consistency among human raters, the intraclass correlation coefficient (ICC) was calculated using a two-way random-effects model based on a single rater and absolute agreement (ICC2).

To determine the statistical significance of differences between scores for human-generated versus AI-generated plans, a two-sided Wilcoxon signed-rank test was used. This non-parametric test was chosen due to the small sample size of cases (n = 5). A linear mixed-effects model (LMM) was applied to the human evaluation data to control for variability from different cases and raters. In the LMM, Score was the dependent variable, Plan_Source_Type (Human, GPT-4o, o3) was the fixed effect, and Case_ID and Evaluator_ID were included as random effects. A p-value < 0.05 was considered statistically significant.

LLM usage in manuscript Preparation

A large language model (Gemini by Google) was used to assist in data analysis, Python code generation for statistical tests and figure creation and for drafting and editing portions of the manuscript. All outputs were reviewed, verified and edited by the authors to ensure accuracy and to reflect the authors’ own voice and interpretation.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(9.7KB, xlsx)}

Supplementary Material 2^{(18.7KB, xlsx)}

Supplementary Material 3^{(216KB, docx)}

Acknowledgements

We are grateful to the following dermatology experts for their invaluable time and clinical insights: Dr. Ananya Chandra, Dr. Aniruddha Ghosh, Dr. Aparajita Ghosh, Dr. Arunima Dhabal, Dr. Mahimanjan Saha, Dr. Shatanik Bhattacharya, Dr. Shreya Poddar, Dr. Sk Shahriar Ahmed, Dr. Souvik Sardar, Dr. Swastika Debbarma.

Author contributions

D.S.: Conceptualization, Methodology, Investigation, Data Curation, Formal Analysis, Writing – Original Draft. S.P.: Conceptualization, Investigation, Supervision, Writing – Review & Editing. All authors reviewed and approved the final manuscript.

Data availability

The complete, disaggregated scoring data generated and analyzed during this study are included as **Supplementary Data** . Further datasets are available from the corresponding author on reasonable request.

Code availability

The custom Python code used for the statistical analysis and figure generation in this study is available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Vrdoljak, J., Boban, Z., Vilović, M., Kumrić, M. & Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. In: Healthcare. ;13(6):603. MDPI. (2025). [DOI] [PMC free article] [PubMed]
2.Khosravi, M., Zare, Z., Mojtabaeian, S. M. & Izadi, R. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv. Res. Managerial Epidemiol.11, 23333928241234863 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Harrer, S. Attention is not all you need: the complicated case of ethically using large Language models in healthcare and medicine. EBioMedicine90, 104456 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Brennan, P. F. Patient satisfaction and normative decision theory. J. Am. Med. Inform. Assoc.2 (4), 250–259 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Brennan, M. et al. Multiobjective optimization challenges in perioperative anesthesia: A review. Surgery170 (1), 320–324. 10.1016/j.surg.2020.11.005 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Felli, J. C., Noel, R. A. & Cavazzoni, P. A. A multiattribute model for evaluating the benefit–risk profiles of treatment alternatives. Med. Decis. Making. 29 (1), 104–115 (2009). [DOI] [PubMed] [Google Scholar]
7.Davis, P., Gribben, B., Scott, A. & Lay-Yee, R. The supply hypothesis and medical practice variation in primary care: testing economic and clinical models of inter-practitioner variation. Soc. Sci. Med.50 (3), 407–418 (2000). [DOI] [PubMed] [Google Scholar]
8.Corallo, A. N. et al. A systematic review of medical practice variation in OECD countries. Health Policy. 114 (1), 5–14 (2014). [DOI] [PubMed] [Google Scholar]
9.Sohn, S., Moon, S., Prokop, L. J., Montori, V. M. & Fan, J. W. A scoping review of medical practice variation research within the informatics literature. Int. J. Med. Informatics. 165, 104833 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Haselton, M. G. et al. Adaptive rationality: an evolutionary perspective on cognitive bias. Soc. Cogn.27 (5), 733–763 (2009). [Google Scholar]
11.Saposnik, G., Redelmeier, D., Ruff, C. C. & Tobler, P. N. Cognitive biases associated with medical decisions: a systematic review. BMC Med. Inf. Decis. Mak.16 (1), 138 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Murawski, C. & Bossaerts, P. How humans solve complex problems: the case of the knapsack problem. Sci. Rep.6 (1), 34851 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Bossaerts, P. & Murawski, C. Computational complexity and human decision-making. Trends Cogn. Sci.21 (12), 917–929 (2017). [DOI] [PubMed] [Google Scholar]
14.Tikhomirov, L. et al. Medical artificial intelligence for clinicians: the lost cognitive perspective. Lancet Digit. Health. 6 (8), e589–e594 (2024). [DOI] [PubMed] [Google Scholar]
15.Selman F, Obletz K, Vismara V, Putko R, Perry NPJ. Editorial Commentary: Artificial intelligence and language learning models can be improved by curated input of medical training data but still face the limitations of available literature and require continued human oversight. Arthroscopy: The Journal of Arthroscopic & Related Surgery. 2025 Aug;41(8) 2758–2760. 10.1016/j.arthro.2025.03.0422025 [DOI] [PubMed]
16.Macmillan-Scott, O. & Musolesi, M. Ir)rationality and cognitive biases in large Language models. Royal Soc. Open. Sci.11 (6), 240255 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell.2 (11), 665–673 (2020). [Google Scholar]
18.Sacco, P. L. Biases, evolutionary mismatch and the comparative analysis of human versus artificial cognition: a comment on Macmillan-Scott and musolesi (2024). Royal Soc. Open. Sci.12 (2), 241017 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Dietvorst, B. J., Simmons, J. P. & Massey, C. Algorithm aversion: people erroneously avoid algorithms after seeing them err. J. Exp. Psychol. Gen.144 (1), 114 (2015). [DOI] [PubMed] [Google Scholar]
20.Laurito, W. et al. AI–AI bias: large Language models favor communications generated by large Language models. Proc. Natl. Acad. Sci.122 (31), e2415697122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Olson, A. et al. The inseparability of context and clinical reasoning. J. Eval. Clin. Pract.30 (4), 533–538 (2024). [DOI] [PubMed] [Google Scholar]
22.Doreswamy, N. & Horstmanshof, L. Attributes that influence human Decision-Making in complex health services: scoping review. JMIR Hum. Factors. 10, e46490 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Schuler, K. et al. Context factors in clinical decision-making: a scoping review. BMC Med. Inf. Decis. Mak.25 (1), 133 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yazdani, S. & Hoseini Abardeh, M. Five decades of research and theorization on clinical reasoning: a critical review. Adv. Med. Educ. Pract.27, 703–716 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Duong, Q. H. et al. A scoping review of therapeutic reasoning process research. Adv. Health Sci. Educ.28 (4), 1289–1310 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Marewski, J. N. & Gigerenzer, G. Heuristic decision making in medicine. Dialog. Clin. Neurosci.14 (1), 77–89. 10.31887/DCNS.2012.14.1/jmarewski (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Whelehan, D. F., Conlon, K. C. & Ridgway, P. F. Medicine and heuristics: cognitive biases and medical decision-making. Ir. J. Med. Sci. (1971–). 189 (4), 1477–1484 (2020). [DOI] [PubMed] [Google Scholar]
28.Budler, L. C. et al. A brief review on benchmarking for large Language models evaluation in healthcare. Wiley Interdisciplinary Reviews: Data Min. Knowl. Discovery. 15 (2), e70010 (2025). [Google Scholar]
29.McCoy, L. G. et al. Assessment of large Language models in clinical reasoning: A novel benchmarking study. NEJM AI. 2 (10), AIdbp2500120 (2025). [Google Scholar]
30.Tam, T. Y. et al. A framework for human evaluation of large Language models in healthcare derived from literature review. NPJ Digit. Med.7 (1), 258 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Mondillo G, Masino M, Colosimo S, Perrotta A, Frattolillo V. Evaluating AI reasoning models in pediatric medicine: a comparative analysis of o3-mini and o3-mini-high. medRxiv [Preprint]]Feb 28.10.1101/2025.02.27.25323028 2025
32.Degany, O., Laros, S., Idan, D. & Einav, S. Evaluating the o1 reasoning large Language model for cognitive bias: a vignette study. Crit. Care. 29 (1), 376 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Lin, Z. et al. Performance analysis of large Language models ChatGPT-4o, openai O1, and openai O3 mini in clinical treatment of pneumonia: a comparative study. Clin. Experimental Med.25 (1), 213 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Cronbach, L. J. & Meehl, P. E. Construct validity in psychological tests. Psychol. Bull.52 (4), 281–302 (1955). [DOI] [PubMed] [Google Scholar]
35.Messick, S. Validity. In: (ed Linn, R. L.) Educational Measurement. 3rd ed. New York: American Council on Education; Macmillan; 13–103. (1989). [Google Scholar]
36.Kane, M. T. Validating the interpretations and uses of test scores. J. Educ. Meas.50 (1), 1–73 (2013). [Google Scholar]
37.D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res.23 (226), 1–61 (2022). [Google Scholar]
38.Karargyris, A. et al. Narayana Moorthy P. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell.5 (7), 799–810 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s time to bench the medical exam benchmark. NEJM AI. 2 (2), AIe2401235 (2025). [Google Scholar]
40.Liang, P. et al. Holistic evaluation of Language models. arXiv [Preprint]. 2022 Nov 16. arXiv:2211.09110.
41.Tun, H. M., Rahman, H. A., Naing, L. & Malik, O. A. Trust in artificial Intelligence–Based clinical decision support systems among health care workers: systematic review. J. Med. Internet. Res.27, e69678 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Tucci, V., Saary, J. & Doyle, T. E. Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review. Journal Med. Artif. Intelligence.5:4, (2022).
43.Micocci, M. et al. Attitudes towards trusting artificial intelligence insights and factors to prevent the passive adherence of gps: a pilot study. J. Clin. Med.10 (14), 3101 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Rosenbacke, R., Melhus, Å., McKee, M. & Stuckler, D. How explainable artificial intelligence can increase or decrease clinicians’ trust in AI applications in health care: systematic review. JMIR AI. 3, e53207 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Glickman, M. & Sharot, T. How human–AI feedback loops alter human perceptual, emotional and social judgements. Nat. Hum. Behav.9 (2), 345–359 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.von Felten, N. Beyond Isolation: Towards an Interactionist Perspective on Human Cognitive Bias and AI Bias. arXiv preprint arXiv:2504.18759. Apr 26. (2025).
47.Ludolph, R. & Schulz, P. J. Debiasing health-related judgments and decision making: a systematic review. Med. Decis. Making. 38 (1), 3–13 (2018). [DOI] [PubMed] [Google Scholar]
48.Dellermann, D., Ebel, P., Söllner, M. & Leimeister, J. M. Hybrid intelligence. Bus. Inform. Syst. Eng.61 (5), 637–643 (2019). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(9.7KB, xlsx)}

Supplementary Material 2^{(18.7KB, xlsx)}

Supplementary Material 3^{(216KB, docx)}

Data Availability Statement

The custom Python code used for the statistical analysis and figure generation in this study is available from the corresponding author on reasonable request.

[CR1] 1.Vrdoljak, J., Boban, Z., Vilović, M., Kumrić, M. & Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. In: Healthcare. ;13(6):603. MDPI. (2025). [DOI] [PMC free article] [PubMed]

[CR2] 2.Khosravi, M., Zare, Z., Mojtabaeian, S. M. & Izadi, R. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv. Res. Managerial Epidemiol.11, 23333928241234863 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Harrer, S. Attention is not all you need: the complicated case of ethically using large Language models in healthcare and medicine. EBioMedicine90, 104456 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Brennan, P. F. Patient satisfaction and normative decision theory. J. Am. Med. Inform. Assoc.2 (4), 250–259 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Brennan, M. et al. Multiobjective optimization challenges in perioperative anesthesia: A review. Surgery170 (1), 320–324. 10.1016/j.surg.2020.11.005 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Felli, J. C., Noel, R. A. & Cavazzoni, P. A. A multiattribute model for evaluating the benefit–risk profiles of treatment alternatives. Med. Decis. Making. 29 (1), 104–115 (2009). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Davis, P., Gribben, B., Scott, A. & Lay-Yee, R. The supply hypothesis and medical practice variation in primary care: testing economic and clinical models of inter-practitioner variation. Soc. Sci. Med.50 (3), 407–418 (2000). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Corallo, A. N. et al. A systematic review of medical practice variation in OECD countries. Health Policy. 114 (1), 5–14 (2014). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Sohn, S., Moon, S., Prokop, L. J., Montori, V. M. & Fan, J. W. A scoping review of medical practice variation research within the informatics literature. Int. J. Med. Informatics. 165, 104833 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Haselton, M. G. et al. Adaptive rationality: an evolutionary perspective on cognitive bias. Soc. Cogn.27 (5), 733–763 (2009). [Google Scholar]

[CR11] 11.Saposnik, G., Redelmeier, D., Ruff, C. C. & Tobler, P. N. Cognitive biases associated with medical decisions: a systematic review. BMC Med. Inf. Decis. Mak.16 (1), 138 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Murawski, C. & Bossaerts, P. How humans solve complex problems: the case of the knapsack problem. Sci. Rep.6 (1), 34851 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Bossaerts, P. & Murawski, C. Computational complexity and human decision-making. Trends Cogn. Sci.21 (12), 917–929 (2017). [DOI] [PubMed] [Google Scholar]

[CR14] 14.Tikhomirov, L. et al. Medical artificial intelligence for clinicians: the lost cognitive perspective. Lancet Digit. Health. 6 (8), e589–e594 (2024). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Selman F, Obletz K, Vismara V, Putko R, Perry NPJ. Editorial Commentary: Artificial intelligence and language learning models can be improved by curated input of medical training data but still face the limitations of available literature and require continued human oversight. Arthroscopy: The Journal of Arthroscopic & Related Surgery. 2025 Aug;41(8) 2758–2760. 10.1016/j.arthro.2025.03.0422025 [DOI] [PubMed]

[CR16] 16.Macmillan-Scott, O. & Musolesi, M. Ir)rationality and cognitive biases in large Language models. Royal Soc. Open. Sci.11 (6), 240255 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell.2 (11), 665–673 (2020). [Google Scholar]

[CR18] 18.Sacco, P. L. Biases, evolutionary mismatch and the comparative analysis of human versus artificial cognition: a comment on Macmillan-Scott and musolesi (2024). Royal Soc. Open. Sci.12 (2), 241017 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Dietvorst, B. J., Simmons, J. P. & Massey, C. Algorithm aversion: people erroneously avoid algorithms after seeing them err. J. Exp. Psychol. Gen.144 (1), 114 (2015). [DOI] [PubMed] [Google Scholar]

[CR20] 20.Laurito, W. et al. AI–AI bias: large Language models favor communications generated by large Language models. Proc. Natl. Acad. Sci.122 (31), e2415697122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Olson, A. et al. The inseparability of context and clinical reasoning. J. Eval. Clin. Pract.30 (4), 533–538 (2024). [DOI] [PubMed] [Google Scholar]

[CR22] 22.Doreswamy, N. & Horstmanshof, L. Attributes that influence human Decision-Making in complex health services: scoping review. JMIR Hum. Factors. 10, e46490 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Schuler, K. et al. Context factors in clinical decision-making: a scoping review. BMC Med. Inf. Decis. Mak.25 (1), 133 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Yazdani, S. & Hoseini Abardeh, M. Five decades of research and theorization on clinical reasoning: a critical review. Adv. Med. Educ. Pract.27, 703–716 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Duong, Q. H. et al. A scoping review of therapeutic reasoning process research. Adv. Health Sci. Educ.28 (4), 1289–1310 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Marewski, J. N. & Gigerenzer, G. Heuristic decision making in medicine. Dialog. Clin. Neurosci.14 (1), 77–89. 10.31887/DCNS.2012.14.1/jmarewski (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Whelehan, D. F., Conlon, K. C. & Ridgway, P. F. Medicine and heuristics: cognitive biases and medical decision-making. Ir. J. Med. Sci. (1971–). 189 (4), 1477–1484 (2020). [DOI] [PubMed] [Google Scholar]

[CR28] 28.Budler, L. C. et al. A brief review on benchmarking for large Language models evaluation in healthcare. Wiley Interdisciplinary Reviews: Data Min. Knowl. Discovery. 15 (2), e70010 (2025). [Google Scholar]

[CR29] 29.McCoy, L. G. et al. Assessment of large Language models in clinical reasoning: A novel benchmarking study. NEJM AI. 2 (10), AIdbp2500120 (2025). [Google Scholar]

[CR30] 30.Tam, T. Y. et al. A framework for human evaluation of large Language models in healthcare derived from literature review. NPJ Digit. Med.7 (1), 258 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Mondillo G, Masino M, Colosimo S, Perrotta A, Frattolillo V. Evaluating AI reasoning models in pediatric medicine: a comparative analysis of o3-mini and o3-mini-high. medRxiv [Preprint]]Feb 28.10.1101/2025.02.27.25323028 2025

[CR32] 32.Degany, O., Laros, S., Idan, D. & Einav, S. Evaluating the o1 reasoning large Language model for cognitive bias: a vignette study. Crit. Care. 29 (1), 376 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Lin, Z. et al. Performance analysis of large Language models ChatGPT-4o, openai O1, and openai O3 mini in clinical treatment of pneumonia: a comparative study. Clin. Experimental Med.25 (1), 213 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Cronbach, L. J. & Meehl, P. E. Construct validity in psychological tests. Psychol. Bull.52 (4), 281–302 (1955). [DOI] [PubMed] [Google Scholar]

[CR35] 35.Messick, S. Validity. In: (ed Linn, R. L.) Educational Measurement. 3rd ed. New York: American Council on Education; Macmillan; 13–103. (1989). [Google Scholar]

[CR36] 36.Kane, M. T. Validating the interpretations and uses of test scores. J. Educ. Meas.50 (1), 1–73 (2013). [Google Scholar]

[CR37] 37.D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res.23 (226), 1–61 (2022). [Google Scholar]

[CR38] 38.Karargyris, A. et al. Narayana Moorthy P. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell.5 (7), 799–810 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s time to bench the medical exam benchmark. NEJM AI. 2 (2), AIe2401235 (2025). [Google Scholar]

[CR40] 40.Liang, P. et al. Holistic evaluation of Language models. arXiv [Preprint]. 2022 Nov 16. arXiv:2211.09110.

[CR41] 41.Tun, H. M., Rahman, H. A., Naing, L. & Malik, O. A. Trust in artificial Intelligence–Based clinical decision support systems among health care workers: systematic review. J. Med. Internet. Res.27, e69678 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Tucci, V., Saary, J. & Doyle, T. E. Factors influencing trust in medical artificial intelligence for healthcare professionals: a narrative review. Journal Med. Artif. Intelligence.5:4, (2022).

[CR43] 43.Micocci, M. et al. Attitudes towards trusting artificial intelligence insights and factors to prevent the passive adherence of gps: a pilot study. J. Clin. Med.10 (14), 3101 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Rosenbacke, R., Melhus, Å., McKee, M. & Stuckler, D. How explainable artificial intelligence can increase or decrease clinicians’ trust in AI applications in health care: systematic review. JMIR AI. 3, e53207 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Glickman, M. & Sharot, T. How human–AI feedback loops alter human perceptual, emotional and social judgements. Nat. Hum. Behav.9 (2), 345–359 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.von Felten, N. Beyond Isolation: Towards an Interactionist Perspective on Human Cognitive Bias and AI Bias. arXiv preprint arXiv:2504.18759. Apr 26. (2025).

[CR47] 47.Ludolph, R. & Schulz, P. J. Debiasing health-related judgments and decision making: a systematic review. Med. Decis. Making. 38 (1), 3–13 (2018). [DOI] [PubMed] [Google Scholar]

[CR48] 48.Dellermann, D., Ebel, P., Söllner, M. & Leimeister, J. M. Hybrid intelligence. Bus. Inform. Syst. Eng.61 (5), 637–643 (2019). [Google Scholar]

PERMALINK

Disagreement between human and AI evaluation of treatment plans

Dipayan Sengupta

Saumya Panda

Abstract

Supplementary Information

Introduction

Results

Phase 1: human experts favour Peer-Generated plans

Table 1.

Phase 2: AI judge favors AI-Generated plans

Table 2.

Evidence of an evaluator effect

Table 3.

Fig. 1.

Discussion

Methods

Study design and participants

Clinical case scenarios

Treatment plan generation

Response normalization

Evaluation protocol

Statistical analysis

LLM usage in manuscript Preparation

Supplementary Information

Acknowledgements

Author contributions

Data availability

Code availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases