Model confrontation and collaboration: A debate intelligence framework for enhancing medical reasoning in large language models

Xinti Sun; Qiyang Hong; Mengyan Zhang; Yuyan Li; Tingwei Chen; Zigeng Huang; Guihan Liang; Wenjun Tang; Sulin Xu; Xiaolin Ni; Junling Pang; Peixing Wan; Erping Long

doi:10.1016/j.xcrm.2025.102547

. 2026 Jan 5;7(1):102547. doi: 10.1016/j.xcrm.2025.102547

Model confrontation and collaboration: A debate intelligence framework for enhancing medical reasoning in large language models

Xinti Sun ^1,³, Qiyang Hong ^1,³, Mengyan Zhang ¹, Yuyan Li ¹, Tingwei Chen ¹, Zigeng Huang ¹, Guihan Liang ², Wenjun Tang ¹, Sulin Xu ¹, Xiaolin Ni ¹, Junling Pang ¹, Peixing Wan ^2,^∗, Erping Long ^1,^4,^∗∗

PMCID: PMC12866169 PMID: 41494532

Summary

Medical reasoning is fundamental to clinical decision-making, underpinning tasks such as patient communication, diagnosis, and treatment planning. Inspired by psychological findings that peer interaction promotes self-correction, we introduce model confrontation and collaboration (MCC), a debate intelligence framework that transcends static ensemble methods by integrating critique and self-reflection to iteratively refine reasoning through structured, multi-round confrontation and collaboration among diverse large language models (LLMs). In multiple-choice benchmarks, MCC achieved mean accuracy on MedQA (92.6%) and PubMedQA (84.8%) and demonstrated strong performance on medical subsets of MMLU. In long-form medical question answering, MCC outperformed all individual LLMs and the domain-specific LLM Med-PaLM 2 in both physician and layperson evaluations. In diagnostic dialog tasks, MCC further excelled in both history-taking and diagnostic accuracy, reaching a top-1 diagnosis rate of 80%. These results position MCC as a scalable, model-agnostic framework that advances medical reasoning through collaborative deliberation.

Keywords: large language models, medical AI benchmarks, medical reasoning

Graphical abstract

Highlights

•
Introduces MCC, a debate intelligence framework that enhances medical reasoning
•
Enables confrontation and collaboration to drive self-correction and consensus
•
Outperforms individual and domain-specific LLMs across multiple medical tasks
•
Shows superior multi-dimensional performance under layperson-clinician assessments

Sun et al. introduce model confrontation and collaboration (MCC), a humanized intelligence framework that integrates advanced LLMs from multiple vendors into a dynamic, debate-driven roundtable. MCC achieves strong performance across diverse medical tasks, serving as a clinical decision-support tool that can potentially improve diagnostic accuracy and reduce clinician cognitive load.

Introduction

Medical reasoning is fundamental to decision-making in clinical practice.¹ In recent years, large language models (LLMs) tailored for healthcare have achieved state-of-the-art (SOTA) performance on a range of medical benchmarks.²^,³^,⁴ However, these models typically rely on a single-LLM architecture, which inherently lacks mechanisms for external validation, cross-perspective critique, and self-correction⁵—key elements for ensuring reliability in complex, high-stakes medical scenarios.

One promising direction is the use of multi-LLM techniques. Clinically, multi-LLM debate mirrors both multidisciplinary team (MDT) discussions and collective diagnostic reasoning, where independent expert perspectives challenge and cross-validate each other to reduce diagnostic errors and enhance reliability—providing a direct clinical rationale for adopting multi-LLM reasoning. Recent studies have shown that multi-LLM approaches have been applied to biomedical tasks such as diagnosis of rare diseases⁶ and cell-type annotation.⁷ However, existing multi-LLM frameworks still face critical limitations that restrict their reasoning potential. First, the design of decision-making protocols remains an open question. Prior studies rely on fixed mechanisms such as consensus voting⁸^,⁹ or stepwise reasoning¹⁰ to aggregate independent LLM outputs but lack dynamic, debate-style interaction essential for robust reasoning. Second, existing systems often overlook the issue of algorithmic singularity. Each LLM reflects a unique combination of model architecture, training data, and learning algorithms,¹¹ leading to differences in reasoning patterns across models. However, current frameworks rely on a single model family (e.g., GPT series⁶), limiting epistemic diversity and opportunities for cross-validation. Third, while recent studies including DDxTutor,¹² with its structured differential diagnosis approach for clinical reasoning tutoring, and MAI-DxO,¹³ which employs virtual physician panels for sequential diagnosis, demonstrate LLMs' capabilities in specific medical tasks, multi-LLM reasoning for complex clinical scenarios remains underdeveloped. Key challenges include unverified performance in adversarial settings or open-ended diagnostic dialogs, undefined usage strategies, and a lack of scalable evaluation methods, collectively hindering real-world clinical implementation.

In this study, we introduce model confrontation and collaboration (MCC), a humanized debate intelligence that integrates advanced reasoning LLMs from multiple vendors into a dynamic, debate-driven roundtable. We evaluate MCC across classical medical multiple-choice question benchmarks, expert-level reasoning assessments, and adversarial testing scenarios. We further extend its application to long-form and diagnostic dialogues in a zero-shot setting, with the goal of enabling generalizable and transparent clinical decision support. To support clinically oriented evaluation, we incorporate physician-layperson ratings, rubric-based evaluation, and Objective Structured Clinical Examination (OSCE)-style diagnostic simulations. Overall, MCC could be explored as a potential decision-support tool aimed at synthesizing clinical evidence and presenting reasoned perspectives, which may assist in reducing cognitive load and improving diagnostic accuracy in complex cases.

Results

Study overview

The MCC workflow comprises three components: (1) initial response generation, in which integrated LLMs (GPT-o1,¹⁴ Qwen-QwQ,¹⁵ and DeepSeek-R1¹⁶) independently generate responses; (2) adversarial activation, which is triggered only when responses diverge, thereby initiating multi-round debates; and (3) consensus optimization, in which the final output is selected by majority voting if consensus is not reached within three rounds (Figure 1A). We integrated multiple benchmarks, including multiple-choice questions and long-form question answering, to evaluate the effectiveness of MCC (Figure 1A). Furthermore, we applied MCC to diagnostic dialogs modeled on the OSCE framework to assess its performance in history-taking and diagnostic accuracy (Figure 1B).

The MCC framework

(A) We evaluated the MCC framework using two input format benchmarks: multiple-choice questions and long-form medical questions. We present the detailed dataset composition underlying the benchmarks. The execution of MCC involves three sequential stages: (1) initial response generation, where three LLMs—GPT-o1, Qwen-QwQ, and DeepSeek-R1—independently generate responses; (2) adversarial activation, which is triggered only when the initial responses diverge, thereby initiating multi-round debates with critique and revision; and (3) consensus optimization, where the process terminates if consensus is achieved, or otherwise proceeds to a majority voting after a maximum of three rounds.

(B) Implementation process of the diagnostic dialog task: (left) the history-taking phase; (right) the disease diagnosis phase.

MCC achieved SOTA performance on classical multiple-choice benchmarks

We first tested the performance of MCC in three classical multiple-choice benchmarks (MedQA,¹⁷ PubMedQA,¹⁸ and MMLU¹⁹; Table S1) under the zero-shot setting. To account for the inherent stochasticity of LLMs, we report results from 10 independent runs per benchmark as mean accuracy with 95% confidence intervals (CIs), moving beyond the single-run point estimates common in prior works.²^,³^,⁴ On the MedQA benchmark of 1,273 questions, MCC achieved SOTA accuracy of 92.6% ± 0.3%, outperforming the previous SOTA of 91.1% by Med-Gemini³ and individual reasoning LLMs (GPT-o1, 84.8% ± 0.4%; Qwen-QwQ, 81.8% ± 0.5%; and DeepSeek-R1, 89.7% ± 0.4%; all p < 0.001), as well as medical LLM Med-PaLM 2² (86.5%) (Table 1). Detailed records of the LLMs’ stance shift throughout the entire process are provided in Data S1. At the MCC initial stage, consensus was reached among three models on 1,019 questions, of which 938 (73.7%) were correctly answered and 81 (6.4%) were incorrectly answered (Figure 2A). Among all 254 non-consensus cases, DeepSeek-R1 achieved the highest initial accuracy with 204 correct responses (80.3%), followed by GPT-o1 with 146 (57.5%) and Qwen-QwQ with 105 (41.3%) (Figure 2B). Notably, if relying solely on majority voting based on the initial LLM responses, only 201 cases (79.1%) would have been answered correctly (Figure 2B). In the first-round debate, the models reached a correct consensus on 104 questions and an incorrect consensus on 6 questions. In the second round, 138 more correct consensuses were achieved, along with 8 incorrect ones. In the third round, 89 questions still remained without consensus; final answers for these were determined via majority voting. Ultimately, MCC provided correct answers for 241 (94.9%) of the 254 initially unresolved cases (Figure 2B). We further investigated the view flow within individual LLMs. Among three debate rounds, GPT-o1 corrected 82.4% of its initial errors (from 108 to 19), Qwen-QwQ corrected 55.7% (from 149 to 66), and DeepSeek-R1 corrected 30% (from 50 to 35) (Figure S1). A representative case of medication management for a pregnant patient with bipolar disorder was presented, where only GPT-o1 selected the correct answer at the initial response stage, but both Qwen-QwQ and DeepSeek-R1 were persuaded to correct their choices during the debate rounds (Figure 2C; full debates in Data S2). Of the 13 cases where at least one LLM initially answered correctly yet MCC reached an erroneous consensus (Data S3), we identified two failure types. In the type of “persuasion-induced errors” (9 cases), a correctly inclined LLM was swayed by two incorrect peers (often in the first debate round) using phrases such as “upon reevaluation …” or “deeper analysis reveals…” This may suggest susceptibility to social pressure or over-compliance with majority views. In the type of “suppression-induced errors” (4 cases), the correct model maintained its position using statements like “I still maintain…” but failed to persuade the incorrect majority, leading to a wrong final vote, indicating possible entrenchment or mutual reinforcement of erroneous views.

Table 1.

Comparative accuracy of MCC and leading LLMs on classical multiple-choice benchmarks

Dataset	Previous SOTA	Med-PaLM 2	GPT- o1	Qwen-QwQ	DeepSeek-R1	MCC
MedQA (USMLE)	91.1 (Med-Gemini)	86.5	84.8 ± 0.4^b	81.8 ± 0.5^b	89.7 ± 0.4^b	92.6 ± 0.3
PubMedQA	83.9 (MDTeamGPT)	81.8	74.5 ± 0.4^b	81.8 ± 0.7^b	80.9 ± 0.6^b	84.8 ± 0.4
MMLU Clinical Knowledge	95.8 (Medprompt)	88.7	85.7 ± 0.7^b	90.3 ± 0.6^b	92.5 ± 0.9^b	94.0 ± 0.4
MMLU Medical Genetics	98.0 (Medprompt)	92.0	98.5 ± 0.4^b	97.8 ± 0.5^b	98.8 ± 0.6^a	99.7 ± 0.3
MMLU Anatomy	89.6 (Medprompt)	84.4	83.8 ± 1.4^b	82.3 ± 0.7^b	91.0 ± 1.0^a	92.8 ± 0.3
MMLU Professional Medicine	95.2 (Med-PaLM 2)	95.2	92.4 ± 0.6^b	93.2 ± 0.6^b	95.1 ± 0.5^b	97.1 ± 0.4
MMLU College Biology	97.9 (Medprompt)	95.8	95.2 ± 0.8^b	96.2 ± 0.8^b	98.5 ± 0.4^a	99.0 ± 0.3
MMLU College Medicine	89.0 (Medprompt)	83.2	82.0 ± 1.3^b	85.3 ± 1.0^b	88.3 ± 0.9^b	91.0 ± 0.4

Open in a new tab

Previous SOTA values and Med-PaLM 2 are reported as point estimates from prior work; results for GPT-o1, Qwen-QwQ, DeepSeek-R1, and MCC are reported as mean ±95% CI across 10 independent runs (t-distribution, df = 9).

Backbone LLMs and providers are Med-Gemini (Gemini and Google), MDTeamGPT (multi-agent method applied to gpt-4-turbo and OpenAI), Medprompt (prompting method applied to GPT-4 and OpenAI), Med-PaLM 2 (PaLM 2 and Google), GPT-o1 (o1-mini and OpenAI), Qwen-QwQ (QwQ-32B and Alibaba), and DeepSeek-R1 (DeepSeek-R1-671B and DeepSeek AI). MCC integrates o1-mini, QwQ-32B, and DeepSeek-R1-671B.

p < 0.01.

p < 0.001 (two-sided paired t test vs. MCC across runs).

MCC performance and decision dynamics on the MedQA benchmark

(A) The upper pie chart shows the distribution of LLMs’ agreement at the initial response stage. Green indicates questions where all three LLMs answered correctly, yellow indicates questions where all three were incorrect, and gray denotes questions with disagreement among models—these 254 cases triggered the MCC debate process. The three smaller pie charts display each model’s individual accuracy in cases with disagreement.

(B) Number of correct responses changes across three debate rounds (R1–R3). At each round, a subset of cases reaches correct consensus (green) or incorrect consensus (yellow), while the rest proceed to the next round. The 89 remaining cases after R3 are resolved via majority voting. The lower two pie charts summarize final performance of MCC and direct majority voting without MCC.

(C) A simplified illustration of how MCC enables correction through structured debate (with portions of the dialog abbreviated for brevity; see Data S2 for the full transcript).

On the PubMedQA benchmark of 500 questions, MCC achieved SOTA accuracy of 84.8% ± 0.4%, outperforming the previous SOTA of 83.9% by MDTeamGPT⁹ and individual reasoning LLMs (GPT-o1, 74.5% ± 0.4%; Qwen-QwQ, 81.8% ± 0.7%, and DeepSeek-R1, 80.9% ± 0.6%), as well as a medical LLM (Med-PaLM 2, 81.8%) (Table 1). Compared to simple majority voting on initial responses (80.0%, 72/90), MCC improved the correction rate of non-consensus cases by 15.6%, reaching 95.6% (86/90) (Figures S2A and S2B; Data S4). A representative PubMedQA case showed how the MCC debate enabled LLMs to resolve semantic ambiguity and reach a correct consensus on an osteoarthritis pain question (Figure S2C, full debates in Data S5).

On the MMLU benchmarks, MCC achieved SOTA performance on five of six clinical topics, with accuracies from 91.0% to 99.7%, consistently outperforming Med-PaLM 2 and individual LLMs (Table 1; Data S6, S7, S8, S9, S10, and S11). In tasks like clinical knowledge, anatomy, and professional medicine, MCC resolved 100% of disagreement cases or outperformed simple majority voting by up to 30%. It also exceeded Medprompt⁴ in medical genetics, college biology, and college medicine, demonstrating robust and generalizable reasoning across diverse medical topics.

We further performed a computational cost analysis, as described in the STAR Methods, to compare MCC with majority voting and single-LLM baselines. As shown in Table S2, compared with majority voting, MCC required +6.9 h in total time (23.7 h vs. 16.8 h) and +19.5 s per case on average (67.0 s vs. 47.5 s), largely dominated by the slowest constituent model. In tokens, MCC consumed +6,127,963 additional tokens in total (16,379,935 vs. 10,251,972), i.e., +4,814 tokens per case on average (12,867 vs. 8,053). In monetary cost, MCC incurred +$6.5 in total ($25.5 vs. $19.0) and +$0.005 per case on average ($0.020 vs. $0.015). Although these overheads, mainly resulting from multi-round debates and dependent on consensus efficiency, increase computational demands, the substantial accuracy gains achieved by MCC justify its adoption in high-stakes medical applications.

Together, these results demonstrate how MCC fosters positive mutual influence among LLMs through multi-round debates, ultimately surpassing previous SOTA and individual models across three multiple-choice benchmarks.

Configuration experiments

We conducted configuration experiments to examine how the choice of backbone LLMs and the number of models influence the performance of the MCC framework. Owing to its large sample size and representativeness, all experiments were performed on the MedQA benchmark. We first evaluated three two-model configurations: GPT-o1 + Qwen-QwQ, GPT-o1 + DeepSeek-R1, and Qwen-QwQ + DeepSeek-R1. All pairwise combinations produced tie cases, defined as instances where the debate failed to reach consensus after three rounds. To conservatively estimate accuracy, tie cases in which any one model selected the correct answer were counted with a weight of 0.5 toward the correct cases. The results showed that overall accuracy in two-model MCC frameworks declined relative to the full MCC setting, in some cases even dropping below the performance of individual LLMs, primarily due to ineffective debates ending in ties (Table S3). Next, we assessed the three-replica configuration (three GPT-o1, three Qwen-QwQ, or three DeepSeek-R1), representing homogeneous ensemble configurations created by replicating a single LLM three times. The three-replica configuration also showed limited improvement, only slightly surpassing individual model performance, indicating little complementarity among instances of the same model (Table S3). Finally, we evaluated the effect of scaling the ensemble by introducing a newly released reasoning model, Qwen3-30B-A3B-Thinking-2507²⁰ (hereafter referred to as Qwen3), forming a four-model configuration (GPT-o1 + Qwen-QwQ + DeepSeek-R1 + Qwen3) that boosted accuracy to 91.1%, albeit with increased computational cost.

MCC generalized effectively to advanced reasoning and adversarial benchmarks

We further introduced MedXpertQA,²¹ a benchmark designed to evaluate expert-level medical knowledge comprehension and advanced reasoning. On this benchmark, MCC achieved an accuracy of 40.2% on the 1,861 reasoning questions and 40.1% on the 589 understanding questions, outperforming all individual LLM baselines (GPT-o1, Qwen-QwQ, and DeepSeek-R1). MCC also exceeded those reported in the original benchmark study for leading models such as GPT-4o (reasoning: 30.6%, understanding: 29.5%) and LLaMA-3.3-70B (reasoning: 23.9%, understanding: 26.5%) (Table S4).

We next extended our evaluation to the adversarial benchmark MetaMedQA,²² a modified extension of the MedQA-USMLE dataset specifically designed to evaluate the metacognitive capabilities of LLMs in medical question answering (Table S5). Specifically, MCC achieved an overall accuracy of 84.1%, outperforming individual LLMs (GPT-o1: 76.1%; Qwen-QwQ: 72.6%; DeepSeek-R1: 81.7%) and best-performing model reported in the original study (GPT-4o: 78.8%). On the missing answer recall metric, MCC also reached the best performance of 75.7% across all LLMs. For the unknown recall metric, MCC achieved 43.2%, exceeding all LLMs (GPT-o1: 32.7%; Qwen-QwQ: 10.5%; DeepSeek-R1: 37.0%) except GPT-4o (51.8%).

On the RABBITS (Robust Assessment of Biomedical Benchmarks Involving Drug Term Substitutions) benchmark,²³ designed to assess the robustness of LLMs in handling drug name variability (brand vs. generic) in medical question answering, MCC again demonstrated robustness (Table S6). In the MedQA subsets, accuracy declined modestly for GPT-o1 (from 90.2% to 87.8%, −2.4%), Qwen-QwQ (from 88.6% to 86.8%, −1.8%), and DeepSeek-R1 (from 91.8% to 90.7%, −1.1%), while MCC showed minimal impact (from 94.2% to 93.7%, −0.5%). In the MedMCQA subsets, GPT-o1 (from 89.7% to 86.8%, −2.9%) and Qwen-QwQ (from 89.1% to 85.9%, −3.2%) exhibited declines, while DeepSeek-R1 (from 91.1% to 89.7%, −1.4%) showed a moderate drop and MCC (from 93.4% to 92.8%, −0.6%) remained comparatively stable. Notably, across all four dataset variants, MCC consistently achieved the highest accuracy underscoring its superior robustness and stability relative to single-LLM baselines.

MCC outperformed individual LLMs on long-form medical questions

We further evaluated MCC’s performance on long-form medical question answering tasks using the MultiMedQA 140 benchmark²^,²⁴ (Data S12). Following the evaluation protocol,² we compared MCC’s outputs with those of Med-PaLM 2, GPT-o1, Qwen-QwQ, and DeepSeek-R1. We first recruited 10 physician raters to assess responses across 12 independent rating dimensions (Table S7). MCC consistently outperformed all individual LLMs across all 12 physician-rated criteria (Figure 3A), achieving notable gains in correct retrieval (11.8%, p < 0.001), correct reasoning (8.8%, p < 0.001), and possibility of bias reduction (11.1%, p < 0.001). Across these 12 dimensions, MCC achieved 1.6%–11.2% lower flawed-content rates, with statistically significant improvements in 10 (permutation p < 0.05), most prominently in comprehension (7.6%, p < 0.0001) and recall (11.2%, p < 0.0001) (Figure 3B).

Multi-dimension assessment of long-form medical question outputs

(A) Independent multi-dimensional evaluation of long-form responses by physician raters (n = 10). Each value represents the average score assigned by physician raters to responses from five LLMs (Med-PaLM 2, GPT-o1, Qwen-QwQ, DeepSeek-R1, and MCC) across 12 clinical quality dimensions, visualized as a radar plot.

(B) Rows represent the 12 physician-rated evaluation dimensions, and columns represent different language models. Each cell shows the average defect score, with lower values indicating better performance. Asterisks next to the dimension labels denote levels of statistical significance relative to other models (∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001; ns, not significant; permutation test).

(C) Ranking-based evaluation of long-form answers by a mixed panel of physician raters (n = 10).

(D) Independent evaluation of long-form responses by layperson raters (n = 10). Bar plots show the percentage distribution of ratings across three categories (“directly,” “indirectly,” and “not”) for two aspects of model responses: user helpfulness (top) and alignment with question intent (bottom). x axis indicates the percentage of responses in each category, and y axis lists the evaluated models.

We then extended the physician evaluation to a blinded comparative-ranking assessment using nine predefined quality metrics (Table S8). MCC significantly reduced answer defect rates by 3.2%–9.1% across these metrics, with statistically significant gains (p < 0.05) in seven dimensions. The largest improvements were observed in reading comprehension (9.1%, p < 0.0001), knowledge recall (7.5%, p < 0.0001), consensus alignment (6.3%, p = 0.0006), and demographic bias reduction (6.4%, p = 0.0010) (Figure 3C).

Complementing the physician assessment, we engaged 10 layperson raters to evaluate model responses in terms of helpfulness and relevance (Table S9). MCC’s responses were more frequently rated as directly helpful and better aligned with user intent than those of all individual LLMs (Figure 3D).

A representative example illustrates the comparative outputs (Data S13). In response to the question “What is losing balance a symptom of?”, GPT-o1 omitted differential diagnoses such as acoustic neuroma and safety concerns related to vestibular suppressants, Qwen-QwQ mentioned medication side effects but did not address spinal or environmental causes, and DeepSeek-R1 included rarer etiologies but overlooked metabolic, psychiatric, and age-related factors. The MCC output integrated relevant elements from all baseline responses, covering neurological, otolaryngological, metabolic, psychiatric, cardiovascular, and geriatric considerations. It additionally included diagnostic suggestions, risk stratification, and environmental factors, with individual knowledge points explicitly attributed to their source models.

Furthermore, given the small sample size of MultiMedQA-140 and the unavoidable subjectivity of manual evaluation, we additionally evaluated on HealthBench benchmark.²⁵ On the HealthBench Consensus subset (n = 3,671), MCC achieved a score of 0.921, outperforming baseline LLMs (GPT-o1: 0.873; Qwen-QwQ: 0.866; DeepSeek-R1: 0.887) (Table S10). In a representative emergency case, MCC successfully satisfied both evaluation criteria by providing immediate emergency referral while converting follow-up questions into a concise clinical checklist (full records in Data S14). For the more challenging HealthBench Hard subset, MCC achieved a score of 0.304, substantially exceeding individual models (GPT-o1: 0.116; Qwen-QwQ: 0.175; DeepSeek-R1: 0.213) (Table S10). In a complex German-language query about stem-cell therapy, GPT-o1 scored −0.041, Qwen-QwQ scored 0.061, and DeepSeek-R1 scored 0.265. The single-LLM outputs correctly rejected absolute guarantees but exhibited notable inconsistencies across the 10 rubric criteria—such as omissions of regulatory information, insufficient enumeration of concrete risks, and lack of evidence-based alternatives. Under adversarial critique and iterative self-revision, MCC consistently pushed the response toward a safer and more complete structure, ultimately improving the rubric score to 0.612 (full records in Data S15).

Collectively, these results highlight MCC’s advantage in enabling dynamic, multi-LLM refinement to achieve safer, more comprehensive, and clinically rigorous long-form responses for both expert and general audiences.

MCC enables enhanced information capture and diagnostic reasoning in diagnostic dialog task

Extending beyond long-form medical questions, we next evaluated MCC in diagnostic dialog tasks, where LLMs must actively capture, interpret, and reason clinical information in real time. We conducted a structured OSCE-style evaluation using 16 simulated patient scenarios covering a broad range of clinical contexts, which evaluated MCC performance in two stages (history-taking and disease diagnosis) (STAR Methods).

For the history-taking stage, we evaluated two metrics: patient information capture rate (PICR) and patient-relevant question rate (PRQR). For PICR, MCC achieved scores above 80% in 14 of 16 cases. In contrast, GPT-o1 did not exceed the 80% threshold in any case, while Qwen-QwQ and DeepSeek-R1 reached this level in only 4 and 8 cases, respectively (Figure 4A). Formal paired analyses indicated that MCC’s PICR was significantly higher than that of all baseline LLMs, with median paired improvements of +0.29, +0.15, and +0.11 compared to GPT-o1, Qwen-QwQ, and DeepSeek-R1, respectively (all Bonferroni-adjusted p < 0.01). For PRQR, MCC achieved values above 80% in 14 cases, whereas GPT-o1, Qwen-QwQ, and DeepSeek-R1 exceeded this threshold in 10, 13, and 14 cases, respectively (Figure 4A). MCC significantly outperformed GPT-o1 (+0.08; Bonferroni-adjusted p < 0.001), while the improvements over Qwen-QwQ and DeepSeek-R1 were smaller and not statistically significant.

Performance of MCC on diagnostic dialogue tasks

(A) Evaluation metrics across 16 role-play samples. PICR denotes patient information capture rate. The gray category (“not applicable”) indicates cases in which the corresponding role package did not specify a correct answer for either the primary or supporting diagnosis.

(B) Output record for a representative case (with portions of the dialog abbreviated for brevity; see Data S16 for the full transcript).

In the diagnostic stage, we assessed top-1 primary diagnosis accuracy (PDA) and supporting differential diagnosis accuracy (SDA). Among 15 cases with a clearly defined primary diagnosis, MCC achieved a PDA of 80.0% (12/15), outperforming GPT-o1 and DeepSeek-R1 (both 66.7%, 10/15) and Qwen-QwQ (60.0%, 9/15) (Figure 4A). However, exact McNemar tests did not reach significance (p = 0.50, 0.25, and 0.50 for GPT-o1, Qwen-QwQ, and DeepSeek-R1, respectively), as only 2–3 discordant pairs were observed per comparison, indicating limited statistical power. For SDA, MCC correctly identified all supporting differentials in 10 of 15 cases (100% accuracy in those cases), compared to 6 for DeepSeek-R1, 5 for GPT-o1, and 3 for Qwen-QwQ (Figure 4A). Paired analysis confirmed that MCC significantly outperformed GPT-o1 (+0.25; Bonferroni-adjusted p = 0.01) and Qwen-QwQ (+0.33; Bonferroni-adjusted p = 0.02), while differences vs. DeepSeek-R1 were not significant.

A representative case involving a patient with weight loss and metabolic symptoms is shown in Figure 4B. The case begins with a 56-year-old woman, reporting 6 weeks of excessive thirst, frequent urination (including 3–4 nighttime awakenings), fatigue, and persistent hyperglycemia despite diet control and newly prescribed glimepiride. Two weeks earlier, she had been diagnosed with type 2 diabetes. Each LLM independently generated initial diagnostic questions, after which MCC initiated a debate process—models critiqued and revised each other’s outputs, culminating in a consensus-driven question set synthesized by a moderator model. A key turning point came when DeepSeek-R1 uniquely proposed asking about pancreatic history—this became the critical 22nd question. The patient’s response revealed significant unintended weight loss (12 kg over 6 months) and persistent epigastric pain radiating to the back. Qwen-QwQ was first to suggest secondary diabetes due to pancreatic disease; through iterative debate, both GPT-o1 and DeepSeek-R1 revised their assessments and reached the same diagnosis. The final consensus correctly identified pancreatic cancer as the underlying cause. Full dialog of this case was presented in Data S16.

Overall, the results indicate that MCC outperformed individual LLMs in both clinical information capture and diagnostic accuracy, suggesting that collaborative reasoning shows beneficial potential in simulated clinical settings. Notably, MCC showed clear advantages in history-taking (PICR and PRQR vs. GPT-o1) and supporting differential diagnosis (vs. GPT-o1 and Qwen-QwQ), whereas improvements in primary diagnosis were directionally favorable but underpowered due to the limited sample size (n = 16), precluding statistical significance.

Discussion

In this study, we systematically demonstrated the utility of MCC in three medical reasoning tasks: multiple-choice questions, long-form medical questions, and diagnostic dialogs. MCC achieved SOTA performance on three multiple-choice benchmarks, including MedQA, PubMedQA, and the medical subsets of MMLU. For open-ended long-form medical questions, MCC outperformed individual LLMs (GPT-o1, DeepSeek-R1, and Qwen-QwQ) and domain-specific LLM Med-PaLM 2 in multiple evaluation dimensions rated by physicians or laypersons. Importantly, in diagnostic dialog, MCC exhibited a higher history-taking ability and diagnostic accuracy, compared to individual LLMs.

In MCC debates of multiple-choice questions, an interesting observation in MedQA benchmark is that most of the choice transitions are from incorrect to correct, while reverse transitions—from correct to incorrect—are rare. We interpreted this asymmetry from two perspectives. First, although reasoning-capable LLMs possess the capacity to self-correct, their initial responses often reflect a form of prompt-induced cognitive rigidity—known as the “Prefix Dominance Trap”⁵—in which models fixate on superficial cues in the question and fail to activate the optimal reasoning trajectory. MCC mitigates this by exposing models to structured counterarguments and alternative reasoning paths, often grounded in authoritative evidence, which serves to reorient flawed initial reasoning. As a result, it is common for models to move from incorrect to correct answers when presented with credible peer input. In contrast, transitions from correct to incorrect are rare because a model that arrives at a correct answer typically does so via a robust, guideline-aligned reasoning path that is less susceptible to disruption by weaker or inconsistent peer critiques. This dynamics explains the directional asymmetry observed in MCC transitions. Second, correct initial answers exhibit higher robustness to perturbation, consistent with findings that well-grounded LLM reasoning resists noise when supported by coherent internal representations.²⁶^,²⁷ MCC’s consensus mechanism further reinforces this by requiring convergent evidence, similar to ensemble-based uncertainty calibration in medical image segmentation.²⁸ This bidirectional optimization mechanism, which corrects initially incorrect answers while maintaining correct ones, enables superior performance of MCC over simple voting schemes for contentious multiple-choice questions.

Prior studies have noted the presence of malformed and incomplete questions in the MedQA benchmark,²² consistent with our observations during evaluation. Across 10 repeated trials, MCC consistently encountered items for which all initial responses failed, suggesting a ceiling effect that likely reflects inherent limitations in the benchmark design—such as ambiguous, underspecified, or flawed items—rather than the true upper bound of model capability. This observation motivated us to evaluate MCC on more challenging adversarial benchmarks. In MetaMedQA, an adaptation of MedQA-USMLE designed to assess metacognitive capacity, MCC achieved a SOTA accuracy of 84.1%, with a missing-answer recall of 75.7%, markedly higher than single-LLM baselines, and an unknown-answer recall comparable to DeepSeek-R1, demonstrating a balanced metacognitive profile critical for clinical contexts. Extending to pharmacology, the RABBITS benchmark showed that MCC consistently outperformed single LLMs on both MedQA and MedMCQA subsets, maintaining higher accuracy and reduced performance loss under drug name perturbations—an essential feature for real-world medical applications. In the MedXpertQA benchmark, which probes expert-level medical knowledge and advanced reasoning, MCC again surpassed all baselines in both reasoning and comprehension dimensions, with scores exceeding 40% in both categories. Together, these findings establish MCC as a robust and generalizable framework for medical question answering, advancing beyond the limitations of traditional benchmarks toward clinically relevant performance.

Although multiple-choice question benchmarks are straightforward to evaluate, their closed-ended design constrains the assessment of broader clinical applications and provides limited insight into real-world clinical utility.²⁹ Moreover, such benchmarks harbor inherent vulnerabilities, as LLMs may sometimes arrive at the correct answer through flawed reasoning pathways.³⁰ We, therefore, extended our evaluation of MCC to open-ended tasks. In the long-form medical question task, we applied a dual-perspective evaluation framework²—incorporating both physician and layperson assessments—to systematically examine how well model outputs align with human expectations for high-quality medical answers. Across all evaluated dimensions, MCC outperformed not only individual LLMs (GPT-o1, Qwen-QwQ, and DeepSeek-R1) but also the medical LLM Med-PaLM 2. Mimicking a peer-review-like process, MCC integrates structured debate, cross-model critique, and iterative refinement, reducing key error rates by 3.2%–9.1%. It achieved statistically significant gains in comprehension, factual recall, and bias mitigation. Notably, MCC’s responses were not only broader in medical coverage but also more coherent and actionable, reflecting integration of diverse disciplinary perspectives—unlike the fragmented outputs typical of single models. Furthermore, MCC explicitly annotated knowledge contributions from each participating model, enhancing transparency and traceability—an essential advance for clinical AI systems where explainability and accountability are critical.³¹ Considering that all raters in our dual-perspective evaluation were recruited primarily from a single geographic region (China), and that the limited sample size inevitably introduces subjectivity and potential bias, we further validated MCC on the more challenging HealthBench benchmark. MCC demonstrated strong performance in both the Consensus and Hard subsets, achieving high scores of 0.921 and 0.304, respectively, thereby underscoring its potential in open-ended long-form dialog tasks.

In recent years, multiple efforts have sought to apply LLMs to the medical domain. For example, Articulate Medical Intelligence Explorer (AMIE),¹ a domain-specific dialog system, improves performance through expert-annotated reasoning chains and iterative self-play, whereas Med-PaLM 2 adopts a similar strategy by aligning models with curated medical question-answering corpora via supervised fine-tuning. Other efforts, such as MedRAG,³² have focused on systematically benchmarking retrieval-augmented generation (RAG) in the medical domain, highlighting the substantial gains achievable through optimized retriever-corpus configurations and ensemble retrieval strategies and offering valuable methodological insights. AgentMD³³ represents a tool-augmented paradigm for clinical decision support, in which LLMs execute medical tool-based reasoning by invoking a large-scale repository of clinical calculators automatically constructed from authoritative biomedical literature and clinical guidelines, thereby enabling precise risk stratification and quantitative diagnostic assessment. By contrast, MCC operates without task-specific training, external knowledge injection, or tool invocation; instead, it enhances reasoning by leveraging structured peer interactions among diverse LLMs oriented toward the reasoning process.

Several multi-LLM frameworks, such as MedAgents,⁸ MDTeamGPT,⁹ and MDAgents,³⁴ employ role specialization (e.g., proposer, reviewer, simulated expert, or specialty-specific expert pools) to enhance accuracy. Table S11 presents accuracy comparisons on the MedQA and PubMedQA benchmarks, where MCC consistently outperforms existing methods. Specifically, on MedQA, MCC achieves 92.6% ± 0.3%, surpassing the previous best method MDTeamGPT (90.1%) by +2.5 percentage points. On PubMedQA, MCC attains 84.8% ± 0.4%, exceeding MDTeamGPT (83.9%) by +0.9 percentage points. We attributed MCC’s advantages to several design distinctions. First, while role-specialized frameworks offer domain-specific insights, their fixed assignments can constrain reasoning breadth. For instance, in headache diagnosis, neurology and ophthalmology agents may pursue parallel reasoning paths without effective reconciliation. Second, contextual interaction is equally critical in multi-agent systems, yet approaches such as Mixture-of-Agents,¹⁰ although introducing hierarchical architectures and network-based communication, primarily refine immediate predecessor outputs without revisiting earlier contexts, a limitation that may restrict their applicability in dynamic clinical discussions. Third, many systems utilize homogeneous model families, whereas our ablation studies confirm that limited diversity in knowledge sources and reasoning approaches reduces overall performance. In contrast, MCC enables flexible, role-agnostic collaboration free from architectural constraints, integrating heterogeneous models while preserving the full dialog context to support perspective integration and reduce collective bias.

MCC demonstrates significant potential for real-world healthcare applications. In clinical diagnostics, its debate-driven mechanism serves as a reasoning aid for complex cases, providing comprehensive perspectives to support multidisciplinary treatment planning and precision decision-making. The transparent reasoning process and self-correction capabilities enhance trustworthiness and safety in medical AI applications. To facilitate the efficient application of MCC in real-time clinical decision-making—enhancing its usability while preserving interpretability under clinical time constraints—we developed a structured, prompt-based summarization process (prompt provided in Data S17) that condenses the full debate logs into concise, human-readable summaries. In a pilot case, a complete debate of over 500 lines (Data S2) was distilled into a concise 34-line summary (Table S12) while retaining the key reasoning chains, stance shifts, and supporting evidence. This concise format enables clinicians to efficiently review the core reasoning pathways and consensus points under time constraints, while the complete debate logs remain available for in-depth traceability and analysis. Our OSCE-style evaluations further suggest that MCC may serve as a promising training and evaluation platform in medical education. During its interactions with simulated patients, MCC makes its questioning logic and reasoning pathways explicitly observable, enabling learners to understand why specific questions are asked, how differential diagnoses are structured, and what clinical cues drive decision-making. By exposing participants to these articulated reasoning steps in real time, MCC can help reinforce key diagnostic concepts and deepen learners’ intuitive grasp of clinical reasoning. Under layperson-clinician dual assessments and expert rubrics, MCC generated higher-quality, more logically rigorous, and safer responses, indicating its potential relevance to doctor-patient communication and patient health education. Notably, we emphasize MCC’s role as an assistive tool that augments rather than replaces clinical judgment, presenting consensus views while highlighting uncertainties. Future research should establish robust human-AI collaboration frameworks³⁵^,³⁶ to validate MCC’s performance in real clinical settings. Considering that MCC possesses zero-shot scalability, continued advances in the field of LLMs will enable future researchers to integrate higher-quality models to further explore its performance boundaries. For real-world deployment, it will be essential to balance computational cost with clinical value and to explore feasible pathways for localized implementation, while also developing adaptive mechanisms to handle ambiguous presentations and conflicting information in real-world clinical settings, thereby ensuring MCC’s safe and robust integration into healthcare systems.

Limitations of the study

This study had three main limitations. First, the geographic homogeneity of the human evaluators (all recruited from a single region in China) may limit the generalizability of the findings to broader populations. Second, MCC’s performance boundaries remain uncertain when dealing with highly ambiguous, inconsistent, or sparse real-world clinical data, where debate may amplify rather than resolve uncertainty. Accordingly, we emphasize the role of MCC as a supportive—rather than autonomous—tool for reasoning under uncertainty. Third, MCC requires increased computational cost and longer inference time compared with single-model baselines, which may affect its scalability and practical deployability in clinical settings.

Resource availability

Lead contact

Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Erping Long (erping.long@ibms.pumc.edu.cn).

Materials availability

This study did not generate new unique reagents.

Data and code availability

•
Data: The MedQA dataset (https://github.com/jind11/MedQA), PubMedQA (https://pubmedqa.github.io), MMLU (https://huggingface.co/datasets/cais/mmlu), MedXpertQA (https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA), MetaMedQA (https://huggingface.co/datasets/maximegmd/MetaMedQA), RABBITS (https://github.com/BittermanLab/RABBITS), and HealthBench (https://github.com/openai/simple-evals) were downloaded from their respective sources. The MultiMedQA 140 dataset, which includes 20 LiveQA questions (https://github.com/abachaa/LiveQA_MedicalTask_TREC2017), 20 MedicationQA questions (https://github.com/abachaa/Medication_QA_MedInfo2019), and 100 HealthSearchQA questions was accessed from the study by Natarajan et al.¹⁷ The scenario packs used in the OSCE study, sourced from the United Kingdom, are available at https://www.thefederation.uk/sites/default/files/documents/Station%202%20Scenario%20Pack%20%2816%29.pdf. Additionally, all data utilized in this analysis are available on GitHub at https://github.com/sunxinti/MCC.
•
Code: All original code has been deposited at Zenodo (https://doi.org/10.5281/zenodo.17834284) and is also available on GitHub (https://github.com/sunxinti/MCC).
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Acknowledgments

This study is supported by the National Natural Science Foundation of China (Excellent Youth Scholars Program, 32300483, 32470635, and 82090011), the National Key R&D Program (2024ZD0528500), the State Key Laboratory Special Fund (2060204), and the Chinese Academy of Medical Sciences Innovation Fund (2023-I2M-3-010 and 2023-I2M-2-001). We thank the Bioinformatics Center of Institute of Basic Medical Sciences and High-Performance Computing Center of Chinese Academy of Medical Sciences for computing support.

Author contributions

E.L. and P.W. contributed to the conception and design of the work. X.S., Q.H., M.Z., and Y.L. contributed to the data acquisition, curation, and analysis. T.C., Z.H., G.L., W.T., S.X., X.N., and J.P. provided valuable assistance. All authors contributed to the drafting and revising of the manuscript.

Declaration of interests

The authors declare no competing interests.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

MCC debate log	This paper	https://github.com/sunxinti/MCC

Software and algorithms

GPT-o1	OpenAI	https://openai.com/
Qwen-QwQ	Alibaba	https://huggingface.co/Qwen/QwQ-32B
DeepSeek-R1	DeepSeek	https://huggingface.co/deepseek-ai/DeepSeek-R1
Qwen3-30B-A3B-Thinking-2507	Alibaba	https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Python 3.11	Python	https://www.python.org/
R 4.1.3	R project	https://www.r-project.org/
MCC	This paper	https://doi.org/10.5281/zenodo.17834284 https://github.com/sunxinti/MCC

Open in a new tab

Method details

Multiple-choice questions dataset

For evaluation on multiple-choice questions, we used the MultiMedQA benchmark suite¹⁷ (Table S1). Specifically, the benchmarks include three datasets (MedQA,³⁷ PubMed QA,¹⁸ and MMLU¹⁹).

MedQA consists of multiple-choice questions written in the style of medical licensing examinations, designed to assess the professional competence of medical practitioners. To enable fair comparisons with prior work,²^,⁴ we focus on the United States Medical Licensing Examination (USMLE)-style subset, which includes 1,273 questions.

PubMedQA evaluates a model’s ability to answer biomedical research questions with “yes,” “no,” or “uncertain” responses based on background information provided from PubMed abstracts. The benchmark offers two evaluation settings: reasoning-required and reasoning-free. In the reasoning-free setting, a long-form answer containing an explanation is also provided. Consistent with prior studies,⁴ we report results under the reasoning-required setting, where models must answer solely based on the abstract without access to additional explanations. This dataset contains 500 questions in total.

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark spanning 57 datasets across STEM, humanities, and social sciences. Following previous research,² we benchmark our models on the medically relevant MMLU subset, which covers clinical knowledge, medical genetics, anatomy, professional medicine, college biology, and college medicine.

MedXpertQA is a recently introduced, rigorously curated benchmark that is both highly challenging and comprehensive, specifically designed to assess LLMs on expert-level medical knowledge and advanced clinical reasoning.²¹ The benchmark consists of 4,460 questions spanning 17 medical specialties and 11 body systems, and is structured into two complementary subsets: a text-based subset and a multimodal subset. For the purposes of this study, we focused on the text subset, comprising 2,450 questions in total, including 589 targeting understanding skills and 1,861 targeting reasoning skills.

MetaMedQA is an adversarial benchmark comprising 1,373 multiple-choice items systematically derived from the MedQA-USMLE dataset.²² The benchmark incorporates three structured augmentations: fictional questions generated from a synthetic organ system (Glianorex) to assess recognition of knowledge blind spots; malformed questions identified through expert review to capture items lacking essential information or dependent on missing media; and modified questions produced by altering correct options, distractors, or stems to introduce adversarial perturbations. Model performance is evaluated using overall accuracy, missing-answer recall (ability to recognize “None of the above” questions), and unknown-answer recall (ability to identify unanswerable questions), providing an integrated measure of metacognitive ability under both standard and adversarial conditions.

RABBITS is a benchmark framework designed to evaluate the robustness of LLMs in handling drug name variability (brand vs. generic) in medical question answering.²³ The dataset was constructed through systematic substitution of drug names in MedQA and MedMCQA,³⁸ with all replacements reviewed by physician experts to ensure contextual and semantic consistency. The MedQA-derived subset comprises 378 questions covering approximately 821 unique drug variants, while the MedMCQA-derived subset includes 348 questions involving around 650 unique variants. By introducing clinically relevant brand–generic substitutions, RABBITS enables rigorous assessment of whether models maintain consistent reasoning performance and accuracy in the presence of subtle lexical variation in drug terminology.

Implementation of MCC framework on multiple-choice questions

We utilized three advanced reasoning LLMs: GPT-o1 (implemented via the o1-mini model from OpenAI), Qwen-QwQ (implemented via the QwQ-32B model from Alibaba’s Qwen series), and DeepSeek-R1 (implemented via the DeepSeek-R1-671B model from DeepSeek). The specific LLM versions used in this study were selected based on availability at the time of experimentation. The core workflow of MCC comprises three components: (1) initial response generation, in which integrated LLMs independently generate responses; (2) adversarial activation, a component specifically designed to assess whether adversarial mechanisms should be engaged, conditionally triggered upon detection of response divergence to initiate a structured multi-round debate process; (3) consensus optimization, responsible for optimizing each model’s reasoning path by leveraging a structured framework of adversarial critique and cooperative refinement; if consensus is not reached within three rounds, the final output is determined by majority voting. For multiple-choice tasks, consensus is established when the final selected options align across all models.

Long-form medical questions dataset

For evaluation on long-form medical questions, we used MultiMedQA 140 dataset (Data S12) from Vivek Natarajan et al. study.² The 140 MultiMedQA questions used are a structured sample consisting of 100 questions from HealthSearchQA,¹⁷ 20 questions from LiveQA³⁹ and 20 questions from MedicationQA.⁴⁰ In addition, we incorporated HealthBench,²⁵ a large-scale benchmark released by OpenAI that evaluates model performance against 48,562 criteria authored by 262 physicians. Each dialogue in HealthBench is paired with a physician-designed rubric for response grading, enabling fine-grained assessment of model outputs. We adopted two curated subsets: HealthBench Consensus (n = 3,671), which emphasizes highly validated evaluation, and HealthBench Hard (n = 1,000), designed to capture challenging, unsaturated cases. HealthBench dialogues simulate realistic interactions between LLMs and patients or clinicians, with the model tasked to provide the optimal response to the user’s final query.

Implementation of MCC framework on long-form medical questions

Given the differing task characteristics for consensus determination in long-form medical questions, we made minimal adaptations to the core MCC framework to accommodate. Specifically, while maintaining three core components used for multiple-choice tasks, we developed a consensus evaluation method based on multi-dimensional standard deviation to quantify the consistency of model outputs in question-answering tasks. The method includes the following steps.

1.
Evaluation dimensions: Three core dimensions were selected, with each dimension scored on a scale from 1 to 10. Factual accuracy refers to the objective correctness of the response content. Completeness refers to the extent to which the response includes key and relevant information. Safety refers to the degree to which the response avoids potential medical risks.
2.
Dimension consistency score calculation: For each evaluation dimension, we collected the scores assigned to all LLM responses, denoted as the set S_d = {s_d,1, s_d,2, …, s_d,M}, where d represents the evaluation dimension and M is the total number of scores collected.
3.
Overall consensus score integration: For each dimension d, we first compute the mean score:

u_{d} = \frac{1}{M} \sum_{i = 1}^{M} s_{d, i}

Calculation of score standard deviation

σ_{d} = \sqrt{\frac{1}{M} \sum_{i = 1}^{M} {(s_{d, i} - u_{d})}^{2}}

Mapping standard deviation to consistency score

A_{d} = \max (0, 1 - \frac{σ_{d}}{4})

Let D represents the set of evaluation dimensions. The overall consensus score is defined as:

Consensus Score = \frac{1}{| D |} \sum_{d \in D} A_{d}

4.
Consensus determination criteria: When the overall consensus score is ≥0.8, the models are considered to have reached consensus, and no further debate is required. When the overall consensus score is <0.8, consensus is considered not yet achieved, and additional rounds of critique-improvement debate are conducted.

Following the initial round of independent response generation, adversarial activation is automatically triggered, wherein each LLM evaluates the responses of the other LLMs across the three predefined dimensions. During this phase, each LLM assigns dimensional scores to the others and provides critical feedback and suggestions for improvement. Subsequently, the process enters the third stage, consensus optimization, where each LLM refines its response based on the feedback received. This adversarial evaluation and optimization cycle is repeated iteratively. If the consensus score exceeds 0.8, the system considers that sufficient agreement among the three LLMs has been achieved and terminates further debate. At this point, a designated moderator LLM (GPT-o1) is introduced, whose sole purpose is to consolidate the optimized responses from the three LLMs without generating new content, resulting in the final integrated output.

Individualized evaluation of long-form question-answering

Long-form question answering generated by Med-PaLM 2,²⁴ GPT-o1, Qwen-QwQ, DeepSeek-R1, and MCC were independently evaluated by physicians and layperson raters using the scoring framework proposed in the Med-PaLM 2 study by Natarajan et al.² Specifically, physicians and layperson raters used task-specific evaluation rubrics: physicians assessed answer across 12 dimensions (Table S7), while laypersons evaluated responses across 2 dimensions (Table S9). All raters were blinded to the source of the responses and conducted their evaluations independently. Evaluations were conducted across all responses in the MultiMedQA 140 dataset, with each response assessed by ten independent raters from the corresponding group (either physicians or layperson raters). We further conducted a groupwise ranking evaluation to enable direct comparisons of responses from different sources. This evaluation was designed to explicitly compare the relative performance of Med-PaLM 2, GPT-o1, DeepSeek-R1, Qwen-QwQ, and MCC. For each question, raters reviewed all model-generated responses and selected the one they judged as superior, based on a nine-dimension evaluation framework defined previously²⁷ (Table S8). Whereas independent evaluations assessed the positive and negative attributes of each response across individual dimensions, the groupwise ranking evaluation integrated these aspects to determine which response exhibited overall higher quality. Raters were blinded to the source of each response, and the display order of responses was randomized to prevent ordering bias.

Composition of physician and layperson raters

This study recruited a total of 20 raters (10 licensed physicians and 10 laypersons). Licensed physicians (5 male and 5 female) were from hospitals and academic institutions in China, all of whom had passed the Chinese National Medical Licensing Examination and had ≥3 years of independent clinical practice in internal medicine, surgery, or general practice. Laypersons (5 male and 5 female) had diverse educational and occupational backgrounds (e.g., teachers, engineers, service workers) but no formal medical training. All raters were recruited voluntarily through institutional networks and community outreach and received modest compensation.

OSCE study design for diagnostic dialogue tasks

To evaluate MCC’s performance on diagnostic dialogue tasks, we designed a simulation-based assessment inspired by OSCE principles.¹ Specifically, we used 16 simulated case packages sourced from the UK Royal College of Physicians website, each containing a gold-standard (primary) diagnosis and a set of acceptable (secondary) diagnoses. These cases covered a broad range of clinical domains, including cardiovascular (1 case), respiratory (3 cases), gastrointestinal (2 cases), neurological (2 cases), genitourinary (1 case), obstetrics and gynecology (1 case), and general internal medicine (6 cases). We recruited 16 enrolled students, each randomly assigned an instruction package, to interact with a custom-designed online dialogue system. Simulated patients were instructed not to disclose any information absent from the role package, and if the model posed a question not covered by the scenario, they were to truthfully respond, “I don’t know.”

The evaluation metrics were divided into two main categories: (1) history-taking metrics, including patient information capture rate (PICR), defined as the number of patient details successfully elicited by the model’s questions divided by the total predefined patient details, and patient-relevant question rate (PRQR), defined as the proportion of the model’s questions that effectively targeted relevant patient information; (2) diagnostic accuracy metrics, including primary diagnosis accuracy (PDA), measured by accuracy on the gold-standard diagnosis, and supporting differential diagnosis accuracy (SDA), measured by accuracy across the set of acceptable diagnoses.

We applied non-parametric paired statistical tests to assess differences across models. For PDA (binary), MCC was compared with each baseline using the exact McNemar test (binomial test on discordant pairs). For continuous outcomes (PICR, PRQR, and SDA), we used paired Wilcoxon signed-rank tests. Overall differences across all four models were assessed by the Friedman test, followed by Bonferroni-adjusted pairwise Wilcoxon tests. For model-level accuracies, 95% confidence intervals (CIs) were computed using both Clopper–Pearson exact and Wilson score intervals. For paired differences in PICR/PRQR/SDA, 95% CIs were estimated via non-parametric bootstrap (10,000 resamples). Two-sided α = 0.05 was used for all tests.

Implementation of MCC framework on diagnostic dialogue tasks

Given that diagnostic dialogue tasks inherently involve multi-turn interactions with patients, we employed a moderator LLM (GPT-o1) to determine whether a patient’s query warrants the initiation of a full debate process. This design aims to minimize the time cost for virtual patients and better simulate realistic communication. For simple exchanges—such as greetings or expressions of thanks—the moderator LLM responds directly, bypassing the history-taking and diagnostic processes.

The implementation of diagnostic dialogues involves two phases. In the history-taking phase, we made minimal modifications to the free-form response system. In addition to the original three evaluation dimensions—factual Accuracy, completeness, and safety—we introduced a fourth dimension (readability).³⁶ This evaluates whether the response is clearly and coherently articulated, structured appropriately, and expressed in language that is easily understandable to the average patient. In real-world medical interactions, overly verbose or highly technical responses can increase cognitive load and introduce psychological barriers for patients. If the moderator LLM determines that history-taking is required, it triggers the generation of initial questions by the three LLMs. Each LLM independently produces a set of history-taking questions. After this initial round, adversarial activation is automatically triggered, wherein each LLM evaluates the responses of the others across the four predefined dimensions. Each LLM assigns dimensional scores to the others and provides targeted feedback and suggestions for improvement. During consensus optimization, each LLM revises its response based on the feedback it receives. This cycle of adversarial evaluation and cooperative refinement is repeated iteratively. If the consensus score exceeds 0.8, the system determines that sufficient agreement has been reached among LLMs, and the debate terminates. The moderator LLM then consolidates the agreed-upon history-taking question set and delivers it to the virtual patient.

In the diagnostic phase, the consensus mechanism is similar to that used in multiple-choice tasks. However, since different LLMs may express the same diagnosis in varied linguistic forms, simple lexical matching is insufficient to determine agreement. To address this, the moderator LLM is tasked with assessing whether the primary diagnoses proposed by the LLMs are semantically equivalent. If equivalence is confirmed, consensus is considered achieved. For secondary diagnoses, we adopt a collective summarization approach, aggregating and synthesizing the secondary diagnostic suggestions provided by all participating models.

Computational cost analysis

The MCC framework supports flexible API-based integration with a wide range of commercially available LLMs, enabling access without code modification. The computational costs, expressed in terms of runtime, token usage, and financial expenditure, vary with dataset size and the proportion of disagreement-triggered cases requiring multi-round deliberation. At the time of study, API pricing was as follows: GPT-o1(o1-mini), $0.00122 per 1K input tokens and $0.00489 per 1K output tokens; Qwen-QwQ (QwQ-32B), $0.000139 per 1K input tokens and $0.000556 per 1K output tokens; and DeepSeek-R1 (deepseek-reasoner), $0.000069 per 1K input tokens and $0.00167 per 1K output tokens. These prices reflect the rates at the time of experimentation and are subject to change, as LLM pricing typically fluctuates and trends downward with the release of newer models. To obtain a detailed breakdown, we performed sequential traversal of the MedQA dataset (n = 1,273), executing cases in order from sample 1 to 1,273 while logging runtime, token utilization, and monetary costs. For each API call, costs were calculated as (input tokens/1K × Price in) + (output tokens/1K × Price out). We report per-call, per-case, and overall statistics, including min, max, mean, and median runtime, token usage, and costs. Notably, the actual cost and time of running MCC vary depending on the number of debate rounds and whether consensus is reached. Per-round inference time, token usage, and cost estimates can be reasonably approximated using the per-case means.

Configuration experiments

To evaluate the contribution of different components within the MCC framework, we performed configuration experiments using the MedQA dataset (n = 1,273), chosen for its representative scale. We first assessed the impact of vendor diversity by configuring three replicas of the same LLM (DeepSeek-R1, GPT-o1, or Qwen-QwQ). We then examined the effect of agent number and composition by testing pairwise combinations of two distinct LLMs (GPT-o1 + DeepSeek-R1, GPT-o1 + Qwen-QwQ, and Qwen-QwQ + DeepSeek-R1). Finally, we evaluated performance in a four-agent setting by adding an additional reasoning model (DeepSeek-R1 + Qwen-QwQ + GPT-o1 + Qwen3-30B-A3B-Thinking-2507,²⁰ hereafter referred to as Qwen3) to examine whether increased model diversity further improves performance.

Quantification and statistical analysis

All analyses were conducted using Python v3.11 and R software v4.1.3. For multiple-choice question classical benchmarks, each model was evaluated over 10 independent runs. We report mean ± standard deviation and mean with 95% confidence intervals based on a Student’s t-distribution (df = 9). Paired two-sided t-tests were used to compare MCC with baseline models across matched runs (α = 0.05). For the OSCE analysis, we applied exact McNemar tests for binary outcomes (e.g., PDA), Wilcoxon signed-rank tests for continuous outcomes (PICR, PRQR, SDA), and Friedman tests followed by Bonferroni-adjusted pairwise comparisons across all models. Confidence intervals for model-level metrics were estimated using Clopper–Pearson and Wilson methods, while differences in paired metrics were bootstrapped (10,000 iterations). Long-form evaluation results were assessed via clustered bootstrap sampling at the answer level. Two-tailed permutation tests (10,000 iterations) were used for hypothesis testing. In cases involving multiple human ratings per answer, permutations were clustered by answer identity to preserve rater structure. All analyses used a two-sided α = 0.05.

Published: January 5, 2026

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.xcrm.2025.102547.

Contributor Information

Peixing Wan, Email: peixing@bjmu.edu.cn.

Erping Long, Email: erping.long@ibms.pumc.edu.cn.

Supplemental information

Document S1. Figures S1 and S2 and Tables S1–S12

mmc1.pdf^{(362.7KB, pdf)}

Data S1. Detailed overview of the MedQA benchmark, related to Figure 2

mmc2.xlsx^{(627.9KB, xlsx)}

Data S2. Full output of a representative case from MCC on the MedQA benchmark, related to Figure 2

mmc3.xlsx^{(27.7KB, xlsx)}

Data S3. Detailed overview of 13 representative cases with incorrect debate outcomes, related to Figure 2

mmc4.xlsx^{(11.3KB, xlsx)}

Data S4. Detailed overview of the PubMedQA benchmark, related to Figure S2

mmc5.xlsx^{(51.3KB, xlsx)}

Data S5. Full output of a representative case from MCC on the PubMedQA benchmark, related to Figure S2

mmc6.xlsx^{(21.8KB, xlsx)}

Data S6. Detailed overview of the Clinical knowledge benchmark, related to Table 1

mmc7.xlsx^{(68.3KB, xlsx)}

Data S7. Detailed overview of the Medical genetics benchmark, related to Table 1

mmc8.xlsx^{(34.7KB, xlsx)}

Data S8. Detailed overview of the Anatomy benchmark, related to Table 1

mmc9.xlsx^{(32KB, xlsx)}

Data S9. Detailed overview of the Professional medicine benchmark, related to Table 1

mmc10.xlsx^{(117.2KB, xlsx)}

Data S10. Detailed overview of the College biology benchmark, related to Table 1

mmc11.xlsx^{(41.9KB, xlsx)}

Data S11. Detailed overview of the College medicine benchmark, related to Table 1

mmc12.xlsx^{(54.7KB, xlsx)}

Data S12. Detailed overview of the MultiMedQA 140 benchmark, related to Figure 3

mmc13.xlsx^{(15.9KB, xlsx)}

Data S13. Representative case of a long-form medical question-answering response, related to Figure 3

mmc14.xlsx^{(43.7KB, xlsx)}

Data S14. Representative case from the HealthBench Consensus benchmark, related to Table S10

mmc15.xlsx^{(31.7KB, xlsx)}

Data S15. Representative case from the HealthBench Hard benchmark, related to Table S10

mmc16.xlsx^{(46.8KB, xlsx)}

Data S16. Representative case of diagnostic dialog task, related to Figure 4

mmc17.xlsx^{(61.6KB, xlsx)}

Data S17. Structured prompt used for debate summarization, related to Figure 2

mmc18.xlsx^{(10.6KB, xlsx)}

Document S2. Article plus supplemental information

mmc19.pdf^{(4.4MB, pdf)}

References

1.Tu T., Schaekermann M., Palepu A., Saab K., Freyberg J., Tanno R., Wang A., Li B., Amin M., Cheng Y., et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642:442–450. doi: 10.1038/s41586-025-08866-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Singhal K., Tu T., Gottweis J., Sayres R., Wulczyn E., Amin M., Hou L., Clark K., Pfohl S.R., Cole-Lewis H., Neal D. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943–950. doi: 10.1038/s41591-024-03423-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Saab K., Tu T., Weng W.H., Tanno R., Stutz D., Wulczyn E., Zhang F., Strother T., Park C., Vedadi E., Chaves J.Z. Capabilities of Gemini Models in Medicine. arXiv. 2024 doi: 10.48550/arXiv.2404.18416. Preprint at. [DOI] [Google Scholar]
4.Nori H., Lee Y.T., Zhang S., Carignan D., Edgar R., Fusi N., King N., Larson J., Li Y., Liu W., Luo R. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. arXiv. 2023 doi: 10.48550/arXiv.2311.16452. Preprint at. [DOI] [Google Scholar]
5.Luo T., Du W., Bi J., Chung S., Tang Z., Yang H., Zhang M., Wang B. Learning from Peers in Reasoning Models. arXiv. 2025 doi: 10.48550/arXiv.2505.07787. Preprint at. [DOI] [Google Scholar]
6.Chen X., Yi H., You M., Liu W., Wang L., Li H., Zhang X., Guo Y., Fan L., Chen G., et al. Enhancing diagnostic capability with multi-agents conversational large language models. npj Digit. Med. 2025;8:159. doi: 10.1038/s41746-025-01550-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Yang C., Zhang X., Chen J. Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. 2025 doi: 10.1101/2025.04.10.647852. Preprint at. [DOI] [Google Scholar]
8.Tang X., Zou A., Zhang Z., Li Z., Zhao Y., Zhang X., Cohan A., Gerstein M. Association for Computational Linguistics; 2024. MedAgents: Large Language Models as Collaborators for Zero-Shot Medical Reasoning. [Google Scholar]
9.Chen K., Li X., Yang T., Wang H., Dong W., Gao Y. MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation. arXiv. 2025 doi: 10.48550/arXiv.2503.13856. Preprint at. [DOI] [Google Scholar]
10.Wang J., Wang J., Athiwaratkun B., Zhang C., Zou J. In: International Conference on Representation Learning (ICLR) Yue Y., editor. 2025. Mixture-of-Agents Enhances Large Language Model Capabilities; pp. 33944–33963. [Google Scholar]
11.Moritz M., Topol E., Rajpurkar P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng. 2025;9:432–438. doi: 10.1038/s41551-025-01363-2. [DOI] [PubMed] [Google Scholar]
12.Wu Q., Gao Z., Gou L., Dou Q. Association for Computational Linguistics; 2025. DDxTutor: Clinical Reasoning Tutoring System with Differential Diagnosis-Based Structured Reasoning. [Google Scholar]
13.Nori H., Daswani M., Kelly C., Lundberg S., Ribeiro M.T., Wilson M., Liu X., Sounderajah V., Carlson J., Lungren M.P., Gross B. Sequential Diagnosis with Language Models. arXiv. 2025 doi: 10.48550/arXiv.2506.22405. Preprint at. [DOI] [Google Scholar]
14.OpenAI. Learning to Reason with LLMs. 2024; Available from: https://openai.com/index/learning-to-reason-with-llms/.
15.Team, Q. Qwen-QwQ-32B: Embracing the Power of Multi-Agent Collaboration. 2025; Available from: https://qwenlm.github.io/blog/qwq-32b/.
16.Guo D., Yang D., Zhang H., Song J., Wang P., Zhu Q., Xu R., Zhang R., Ma S., Bi X., et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature. 2025;645:633–638. doi: 10.1038/s41586-025-09422-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Singhal K., Azizi S., Tu T., Mahdavi S.S., Wei J., Chung H.W., Scales N., Tanwani A., Cole-Lewis H., Pfohl S., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Jin Q., Dhingra B., Liu Z., Cohen W., Lu X. Association for Computational Linguistics; 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. [Google Scholar]
19.Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D., Steinhardt J. International Conference on Learning Representations (ICLR) 2021. Measuring Massive Multitask Language Understanding. [Google Scholar]
20.Yang A., Li A., Yang B., Zhang B., Hui B., Zheng B., Yu B., Gao C., Huang C., Lv C., Zheng C. Qwen3 Technical Report. arXiv. 2025 doi: 10.48550/arXiv.2505.09388. Preprint at. [DOI] [Google Scholar]
21.Zuo Y., Qu S., Li Y., Chen Z., Zhu X., Hua E., Zhang K., Ding N., Zhou B. In: Proceedings of the 42nd International Conference on Machine Learning. Aarti S., editor. PMLR: Proceedings of Machine Learning Research; 2025. MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding; pp. 80961–80990. [Google Scholar]
22.Griot M., Hemptinne C., Vanderdonckt J., Yuksel D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 2025;16:642. doi: 10.1038/s41467-024-55628-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gallifant J., Chen S., Moreira P.J.F., Munch N., Gao M., Pond J., Celi L.A., Aerts H., Hartvigsen T., Bitterman D. Association for Computational Linguistics; 2024. Language Models Are Surprisingly Fragile to Drug Names in Biomedical Benchmarks. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Pfohl S.R., Cole-Lewis H., Sayres R., Neal D., Asiedu M., Dieng A., Tomasev N., Rashid Q.M., Azizi S., Rostamzadeh N., McCoy L.G. A toolbox for surfacing health equity harms and biases in large language models. Nat Med. 2024;30:3590–3600. doi: 10.1038/s41591-024-03258-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Arora R.K., Wei J., Hicks R.S., Bowman P., Quiñonero-Candela J., Tsimpourlas F., Sharman M., Shah M., Vallone A., Beutel A., Heidecke J. HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv. 2025 doi: 10.48550/arXiv.2505.08775. Preprint at. [DOI] [Google Scholar]
26.Jiang E., Xu C., Singh N., Singh G. Misaligning Reasoning with Answers - A Framework for Assessing LLM CoT Robustness. arXiv. 2025 doi: 10.48550/arXiv.2505.17406. Preprint at. [DOI] [Google Scholar]
27.Li Z., Qiu W., Ma P., Li Y., Li Y., He S., Jiang B., Wang S., Gu W. An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios. arXiv. 2024 doi: 10.48550/arXiv.2402.01723. Preprint at. [DOI] [Google Scholar]
28.Buddenkotte T., Escudero Sanchez L., Crispin-Ortuzar M., Woitek R., McCague C., Brenton J.D., Öktem O., Sala E., Rundo L. Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Comput. Biol. Med. 2023;163 doi: 10.1016/j.compbiomed.2023.107096. [DOI] [PubMed] [Google Scholar]
29.Raji I.D., Daneshjou R., Alsentzer E. It's Time to Bench the Medical Exam Benchmark. NEJM AI. 2025;2 [Google Scholar]
30.Jin Q., Chen F., Zhou Y., Xu Z., Cheung J.M., Chen R., Summers R.M., Rousseau J.F., Ni P., Landsman M.J., et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digit. Med. 2024;7:190. doi: 10.1038/s41746-024-01185-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Amann J., Blasimme A., Vayena E., Frey D., Madai V.I., Precise4Q consortium Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med. Inform. Decis. Mak. 2020;20:310. doi: 10.1186/s12911-020-01332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Xiong G., Jin Q., Lu Z., Zhang A. Association for Computational Linguistics; 2024. Benchmarking Retrieval-Augmented Generation for Medicine. [Google Scholar]
33.Jin Q., Wang Z., Yang Y., Zhu Q., Wright D., Huang T., Khandekar N., Wan N., Ai X., Wilbur W.J., et al. AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning. Nat. Commun. 2025;16:9377. doi: 10.1038/s41467-025-64430-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Kim Y. Proceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2025. MDAgents: an adaptive collaboration of LLMs for medical decision-making. [Google Scholar]
35.Tanno R., Barrett D.G.T., Sellergren A., Ghaisas S., Dathathri S., See A., Welbl J., Lau C., Tu T., Azizi S., et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2025;31:599–608. doi: 10.1038/s41591-024-03302-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Wan P., Huang Z., Tang W., Nie Y., Pei D., Deng S., Chen J., Zhou Y., Duan H., Chen Q., Long E. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 2024;30:2878–2885. doi: 10.1038/s41591-024-03148-7. [DOI] [PubMed] [Google Scholar]
37.Jin D., Pan E., Oufattole N., Weng W.H., Fang H., Szolovits P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. (Basel). 2021;11:6421. [Google Scholar]
38.Pal A., Umapathi L.K., Sankarasubbu M. In: Proceedings of the Conference on Health, Inference, and Learning. Gerardo F., editor. PMLR: Proceedings of Machine Learning Research; 2022. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering; pp. 248–260. [Google Scholar]
39.Abacha A.B., Agichtein E., Pinter Y., Demner-Fushman D. Text Retrieval Conference. 2017. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. [Google Scholar]
40.Abacha A.B., Mrabet Y., Sharp M., Goodwin T.R., Shooshan S.E., Demner-Fushman D. Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. Stud. Health Technol. Inform. 2019;264:25–29. doi: 10.3233/SHTI190176. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1 and S2 and Tables S1–S12

mmc1.pdf^{(362.7KB, pdf)}

Data S1. Detailed overview of the MedQA benchmark, related to Figure 2

mmc2.xlsx^{(627.9KB, xlsx)}

Data S2. Full output of a representative case from MCC on the MedQA benchmark, related to Figure 2

mmc3.xlsx^{(27.7KB, xlsx)}

Data S3. Detailed overview of 13 representative cases with incorrect debate outcomes, related to Figure 2

mmc4.xlsx^{(11.3KB, xlsx)}

Data S4. Detailed overview of the PubMedQA benchmark, related to Figure S2

mmc5.xlsx^{(51.3KB, xlsx)}

Data S5. Full output of a representative case from MCC on the PubMedQA benchmark, related to Figure S2

mmc6.xlsx^{(21.8KB, xlsx)}

Data S6. Detailed overview of the Clinical knowledge benchmark, related to Table 1

mmc7.xlsx^{(68.3KB, xlsx)}

Data S7. Detailed overview of the Medical genetics benchmark, related to Table 1

mmc8.xlsx^{(34.7KB, xlsx)}

Data S8. Detailed overview of the Anatomy benchmark, related to Table 1

mmc9.xlsx^{(32KB, xlsx)}

Data S9. Detailed overview of the Professional medicine benchmark, related to Table 1

mmc10.xlsx^{(117.2KB, xlsx)}

Data S10. Detailed overview of the College biology benchmark, related to Table 1

mmc11.xlsx^{(41.9KB, xlsx)}

Data S11. Detailed overview of the College medicine benchmark, related to Table 1

mmc12.xlsx^{(54.7KB, xlsx)}

Data S12. Detailed overview of the MultiMedQA 140 benchmark, related to Figure 3

mmc13.xlsx^{(15.9KB, xlsx)}

Data S13. Representative case of a long-form medical question-answering response, related to Figure 3

mmc14.xlsx^{(43.7KB, xlsx)}

Data S14. Representative case from the HealthBench Consensus benchmark, related to Table S10

mmc15.xlsx^{(31.7KB, xlsx)}

Data S15. Representative case from the HealthBench Hard benchmark, related to Table S10

mmc16.xlsx^{(46.8KB, xlsx)}

Data S16. Representative case of diagnostic dialog task, related to Figure 4

mmc17.xlsx^{(61.6KB, xlsx)}

Data S17. Structured prompt used for debate summarization, related to Figure 2

mmc18.xlsx^{(10.6KB, xlsx)}

Document S2. Article plus supplemental information

mmc19.pdf^{(4.4MB, pdf)}

Data Availability Statement

•
Data: The MedQA dataset (https://github.com/jind11/MedQA), PubMedQA (https://pubmedqa.github.io), MMLU (https://huggingface.co/datasets/cais/mmlu), MedXpertQA (https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA), MetaMedQA (https://huggingface.co/datasets/maximegmd/MetaMedQA), RABBITS (https://github.com/BittermanLab/RABBITS), and HealthBench (https://github.com/openai/simple-evals) were downloaded from their respective sources. The MultiMedQA 140 dataset, which includes 20 LiveQA questions (https://github.com/abachaa/LiveQA_MedicalTask_TREC2017), 20 MedicationQA questions (https://github.com/abachaa/Medication_QA_MedInfo2019), and 100 HealthSearchQA questions was accessed from the study by Natarajan et al.¹⁷ The scenario packs used in the OSCE study, sourced from the United Kingdom, are available at https://www.thefederation.uk/sites/default/files/documents/Station%202%20Scenario%20Pack%20%2816%29.pdf. Additionally, all data utilized in this analysis are available on GitHub at https://github.com/sunxinti/MCC.
•
Code: All original code has been deposited at Zenodo (https://doi.org/10.5281/zenodo.17834284) and is also available on GitHub (https://github.com/sunxinti/MCC).
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

[bib1] 1.Tu T., Schaekermann M., Palepu A., Saab K., Freyberg J., Tanno R., Wang A., Li B., Amin M., Cheng Y., et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642:442–450. doi: 10.1038/s41586-025-08866-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Singhal K., Tu T., Gottweis J., Sayres R., Wulczyn E., Amin M., Hou L., Clark K., Pfohl S.R., Cole-Lewis H., Neal D. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943–950. doi: 10.1038/s41591-024-03423-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Saab K., Tu T., Weng W.H., Tanno R., Stutz D., Wulczyn E., Zhang F., Strother T., Park C., Vedadi E., Chaves J.Z. Capabilities of Gemini Models in Medicine. arXiv. 2024 doi: 10.48550/arXiv.2404.18416. Preprint at. [DOI] [Google Scholar]

[bib4] 4.Nori H., Lee Y.T., Zhang S., Carignan D., Edgar R., Fusi N., King N., Larson J., Li Y., Liu W., Luo R. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. arXiv. 2023 doi: 10.48550/arXiv.2311.16452. Preprint at. [DOI] [Google Scholar]

[bib5] 5.Luo T., Du W., Bi J., Chung S., Tang Z., Yang H., Zhang M., Wang B. Learning from Peers in Reasoning Models. arXiv. 2025 doi: 10.48550/arXiv.2505.07787. Preprint at. [DOI] [Google Scholar]

[bib6] 6.Chen X., Yi H., You M., Liu W., Wang L., Li H., Zhang X., Guo Y., Fan L., Chen G., et al. Enhancing diagnostic capability with multi-agents conversational large language models. npj Digit. Med. 2025;8:159. doi: 10.1038/s41746-025-01550-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Yang C., Zhang X., Chen J. Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data. bioRxiv. 2025 doi: 10.1101/2025.04.10.647852. Preprint at. [DOI] [Google Scholar]

[bib8] 8.Tang X., Zou A., Zhang Z., Li Z., Zhao Y., Zhang X., Cohan A., Gerstein M. Association for Computational Linguistics; 2024. MedAgents: Large Language Models as Collaborators for Zero-Shot Medical Reasoning. [Google Scholar]

[bib9] 9.Chen K., Li X., Yang T., Wang H., Dong W., Gao Y. MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation. arXiv. 2025 doi: 10.48550/arXiv.2503.13856. Preprint at. [DOI] [Google Scholar]

[bib10] 10.Wang J., Wang J., Athiwaratkun B., Zhang C., Zou J. In: International Conference on Representation Learning (ICLR) Yue Y., editor. 2025. Mixture-of-Agents Enhances Large Language Model Capabilities; pp. 33944–33963. [Google Scholar]

[bib11] 11.Moritz M., Topol E., Rajpurkar P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng. 2025;9:432–438. doi: 10.1038/s41551-025-01363-2. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Wu Q., Gao Z., Gou L., Dou Q. Association for Computational Linguistics; 2025. DDxTutor: Clinical Reasoning Tutoring System with Differential Diagnosis-Based Structured Reasoning. [Google Scholar]

[bib13] 13.Nori H., Daswani M., Kelly C., Lundberg S., Ribeiro M.T., Wilson M., Liu X., Sounderajah V., Carlson J., Lungren M.P., Gross B. Sequential Diagnosis with Language Models. arXiv. 2025 doi: 10.48550/arXiv.2506.22405. Preprint at. [DOI] [Google Scholar]

[bib14] 14.OpenAI. Learning to Reason with LLMs. 2024; Available from: https://openai.com/index/learning-to-reason-with-llms/.

[bib15] 15.Team, Q. Qwen-QwQ-32B: Embracing the Power of Multi-Agent Collaboration. 2025; Available from: https://qwenlm.github.io/blog/qwq-32b/.

[bib16] 16.Guo D., Yang D., Zhang H., Song J., Wang P., Zhu Q., Xu R., Zhang R., Ma S., Bi X., et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature. 2025;645:633–638. doi: 10.1038/s41586-025-09422-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Singhal K., Azizi S., Tu T., Mahdavi S.S., Wei J., Chung H.W., Scales N., Tanwani A., Cole-Lewis H., Pfohl S., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Jin Q., Dhingra B., Liu Z., Cohen W., Lu X. Association for Computational Linguistics; 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. [Google Scholar]

[bib19] 19.Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D., Steinhardt J. International Conference on Learning Representations (ICLR) 2021. Measuring Massive Multitask Language Understanding. [Google Scholar]

[bib20] 20.Yang A., Li A., Yang B., Zhang B., Hui B., Zheng B., Yu B., Gao C., Huang C., Lv C., Zheng C. Qwen3 Technical Report. arXiv. 2025 doi: 10.48550/arXiv.2505.09388. Preprint at. [DOI] [Google Scholar]

[bib21] 21.Zuo Y., Qu S., Li Y., Chen Z., Zhu X., Hua E., Zhang K., Ding N., Zhou B. In: Proceedings of the 42nd International Conference on Machine Learning. Aarti S., editor. PMLR: Proceedings of Machine Learning Research; 2025. MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding; pp. 80961–80990. [Google Scholar]

[bib22] 22.Griot M., Hemptinne C., Vanderdonckt J., Yuksel D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 2025;16:642. doi: 10.1038/s41467-024-55628-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Gallifant J., Chen S., Moreira P.J.F., Munch N., Gao M., Pond J., Celi L.A., Aerts H., Hartvigsen T., Bitterman D. Association for Computational Linguistics; 2024. Language Models Are Surprisingly Fragile to Drug Names in Biomedical Benchmarks. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Pfohl S.R., Cole-Lewis H., Sayres R., Neal D., Asiedu M., Dieng A., Tomasev N., Rashid Q.M., Azizi S., Rostamzadeh N., McCoy L.G. A toolbox for surfacing health equity harms and biases in large language models. Nat Med. 2024;30:3590–3600. doi: 10.1038/s41591-024-03258-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Arora R.K., Wei J., Hicks R.S., Bowman P., Quiñonero-Candela J., Tsimpourlas F., Sharman M., Shah M., Vallone A., Beutel A., Heidecke J. HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv. 2025 doi: 10.48550/arXiv.2505.08775. Preprint at. [DOI] [Google Scholar]

[bib26] 26.Jiang E., Xu C., Singh N., Singh G. Misaligning Reasoning with Answers - A Framework for Assessing LLM CoT Robustness. arXiv. 2025 doi: 10.48550/arXiv.2505.17406. Preprint at. [DOI] [Google Scholar]

[bib27] 27.Li Z., Qiu W., Ma P., Li Y., Li Y., He S., Jiang B., Wang S., Gu W. An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios. arXiv. 2024 doi: 10.48550/arXiv.2402.01723. Preprint at. [DOI] [Google Scholar]

[bib28] 28.Buddenkotte T., Escudero Sanchez L., Crispin-Ortuzar M., Woitek R., McCague C., Brenton J.D., Öktem O., Sala E., Rundo L. Calibrating ensembles for scalable uncertainty quantification in deep learning-based medical image segmentation. Comput. Biol. Med. 2023;163 doi: 10.1016/j.compbiomed.2023.107096. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Raji I.D., Daneshjou R., Alsentzer E. It's Time to Bench the Medical Exam Benchmark. NEJM AI. 2025;2 [Google Scholar]

[bib30] 30.Jin Q., Chen F., Zhou Y., Xu Z., Cheung J.M., Chen R., Summers R.M., Rousseau J.F., Ni P., Landsman M.J., et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digit. Med. 2024;7:190. doi: 10.1038/s41746-024-01185-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Amann J., Blasimme A., Vayena E., Frey D., Madai V.I., Precise4Q consortium Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med. Inform. Decis. Mak. 2020;20:310. doi: 10.1186/s12911-020-01332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Xiong G., Jin Q., Lu Z., Zhang A. Association for Computational Linguistics; 2024. Benchmarking Retrieval-Augmented Generation for Medicine. [Google Scholar]

[bib33] 33.Jin Q., Wang Z., Yang Y., Zhu Q., Wright D., Huang T., Khandekar N., Wan N., Ai X., Wilbur W.J., et al. AgentMD: Empowering language agents for risk prediction with large-scale clinical tool learning. Nat. Commun. 2025;16:9377. doi: 10.1038/s41467-025-64430-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Kim Y. Proceedings of the 38th International Conference on Neural Information Processing Systems. Curran Associates Inc.; 2025. MDAgents: an adaptive collaboration of LLMs for medical decision-making. [Google Scholar]

[bib35] 35.Tanno R., Barrett D.G.T., Sellergren A., Ghaisas S., Dathathri S., See A., Welbl J., Lau C., Tu T., Azizi S., et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2025;31:599–608. doi: 10.1038/s41591-024-03302-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Wan P., Huang Z., Tang W., Nie Y., Pei D., Deng S., Chen J., Zhou Y., Duan H., Chen Q., Long E. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 2024;30:2878–2885. doi: 10.1038/s41591-024-03148-7. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Jin D., Pan E., Oufattole N., Weng W.H., Fang H., Szolovits P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. (Basel). 2021;11:6421. [Google Scholar]

[bib38] 38.Pal A., Umapathi L.K., Sankarasubbu M. In: Proceedings of the Conference on Health, Inference, and Learning. Gerardo F., editor. PMLR: Proceedings of Machine Learning Research; 2022. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering; pp. 248–260. [Google Scholar]

[bib39] 39.Abacha A.B., Agichtein E., Pinter Y., Demner-Fushman D. Text Retrieval Conference. 2017. Overview of the Medical Question Answering Task at TREC 2017 LiveQA. [Google Scholar]

[bib40] 40.Abacha A.B., Mrabet Y., Sharp M., Goodwin T.R., Shooshan S.E., Demner-Fushman D. Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. Stud. Health Technol. Inform. 2019;264:25–29. doi: 10.3233/SHTI190176. [DOI] [PubMed] [Google Scholar]

PERMALINK

Model confrontation and collaboration: A debate intelligence framework for enhancing medical reasoning in large language models

Xinti Sun

Qiyang Hong

Mengyan Zhang

Yuyan Li

Tingwei Chen

Zigeng Huang

Guihan Liang

Wenjun Tang

Sulin Xu

Xiaolin Ni

Junling Pang

Peixing Wan

Erping Long

Summary

Graphical abstract

Highlights

Introduction

Results

Study overview

Figure 1.

MCC achieved SOTA performance on classical multiple-choice benchmarks

Table 1.

Figure 2.

Configuration experiments

MCC generalized effectively to advanced reasoning and adversarial benchmarks

MCC outperformed individual LLMs on long-form medical questions

Figure 3.

MCC enables enhanced information capture and diagnostic reasoning in diagnostic dialog task

Figure 4.

Discussion

Limitations of the study

Resource availability

Lead contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

STAR★Methods

Key resources table

Method details

Multiple-choice questions dataset

Implementation of MCC framework on multiple-choice questions

Long-form medical questions dataset

Implementation of MCC framework on long-form medical questions

Individualized evaluation of long-form question-answering

Composition of physician and layperson raters

OSCE study design for diagnostic dialogue tasks

Implementation of MCC framework on diagnostic dialogue tasks

Computational cost analysis

Configuration experiments

Quantification and statistical analysis

Footnotes

Contributor Information

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases