A benchmark for evaluating diagnostic questioning efficiency of LLMs in patient conversations

Mai Werthaim; Maya Kimhi; Alexander Apartsin; Yehudit Aperstein

doi:10.1038/s41598-026-37022-y

. 2026 Jan 24;16:6121. doi: 10.1038/s41598-026-37022-y

A benchmark for evaluating diagnostic questioning efficiency of LLMs in patient conversations

Mai Werthaim ¹, Maya Kimhi ¹, Alexander Apartsin ¹, Yehudit Aperstein ^2,^✉

PMCID: PMC12901287 PMID: 41580557

Abstract

Physician–patient diagnostic interactions often rely on incomplete, uncertain, and noisy symptom descriptions expressed in non-specialist language. To reach an accurate diagnosis efficiently, physicians employ adaptive sequences of focused questions in which each inquiry depends on prior patient responses. Similar requirements apply to AI assistants engaged in clinical dialogue, which must strategically select questions to elicit diagnostic information efficiently. While several datasets and benchmarks address clinical reasoning, few focus on evaluating the strategic inquiry process itself. To address this gap, we introduce Q4Dx (Question-Driven Diagnosis), a benchmark for assessing goal-directed diagnostic questioning. Q4Dx consists of synthetically generated patient cases derived from curated symptom–disease knowledge. Each case is instantiated at multiple information levels (100%, 80%, and 50% symptom exposure) to evaluate performance under varying informational constraints. We simulate patient–physician interactions using GPT-4.1 and GPT-4o-mini agents to generate clinician questions, patient responses, and intermediate diagnostic hypotheses. Performance is evaluated using Zero-shot Diagnostic Accuracy (ZDA), Mean Questions to Correct Diagnosis (MQD), and Inquire Sequence Efficiency (ISE), which measures convergence toward the correct diagnosis. Q4Dx provides a reusable framework for benchmarking large language models in strategic clinical inquiry and supports future research on AI-assisted diagnostic training. The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

Keywords: Diagnostic reasoning, Interactive questioning, Large language models (LLMs), Clinical dialogue simulation, Medical AI benchmark

Subject terms: Computational biology and bioinformatics, Health care, Mathematics and computing, Medical research

Introduction

The rapid adoption of large language models (LLMs) in healthcare has sparked growing interest in their potential to perform medical diagnoses, assist in clinical decision-making, and interpret patient narratives. Recent studies have demonstrated that general-purpose and medically fine-tuned large language models (LLMs) can achieve impressive performance on benchmark datasets such as MedQA, PubMedQA, and MMLU Medical^1–3, which typically frame diagnosis as a static classification task based on fully specified inputs. However, these benchmarks fall short of capturing the dynamic and interactive nature of real-world diagnostic reasoning.

In clinical practice, physicians rarely receive complete information upfront. Instead, diagnosis unfolds through iterative exploration: a process of asking targeted questions, interpreting responses, and refining hypotheses. This interactive diagnostic loop is critical, particularly when dealing with layperson expressions, ambiguous descriptions, or missing details. However, existing evaluation protocols for LLMs overlook this fundamental aspect, leaving a gap in our understanding of how they might support or simulate dialogue-based clinical reasoning.

To address this gap, we propose Q4Dx, a large language model (LLM) for interactive diagnosis via patient interrogation. Our benchmark simulates realistic diagnostic uncertainty by initially revealing only partial symptom descriptions. The model must then proactively generate follow-up questions to uncover additional information, refining its diagnosis after each response. This interactive setting mirrors the decision-making challenges faced by human clinicians, enabling a more rigorous evaluation of model or human performance under uncertainty.

To construct Q4Dx, we sample disease records and the associated symptom sets from a structured clinical database. We first construct naturalistic narratives describing complete patient cases. From each of these, we derive partial variants exposing approximately 80% or 50% of the original symptoms. We simulate diagnostic dialogues between two autonomous agents, using models from the GPT family. One agent serves as the interrogating clinician, while the other acts as the patient, answering based on a hidden ground-truth diagnosis. After each answer, the clinician-agent attempts a diagnosis, producing a sequence of question–answer–diagnosis steps that capture the agent’s reasoning trajectory.

Our contributions are fourfold:

We introduce Q4Dx, the first benchmark specifically designed to evaluate large language models (LLMs) in interactive diagnostic reasoning through patient interrogation.
We present a controlled data generation pipeline that creates naturalistic case descriptions with varying degrees of symptom completeness.
We release a structured dataset of diagnostic dialogues that supports multi-level evaluation across interaction steps, question quality, and diagnosis accuracy.
We demonstrate how the model’s questioning trajectory can serve as a reference for training and assessing human diagnostic performance in under-specified clinical scenarios.

By emphasizing not only outcomes but also the interactive reasoning process, Q4Dx provides a novel foundation for developing and evaluating LLMs capable of adaptive, goal-driven clinical dialogue, while also offering a reusable benchmark for comparing human and model performance in strategic diagnostic reasoning.

Literature review

Recent research on medical question answering with large language models spans static exam-style QA, interactive conversational diagnosis, and conversational multi-agent systems. While static benchmarks primarily assess medical knowledge, interactive and multi-agent approaches aim to model the adaptive questioning, reasoning, and collaboration observed in real clinical practice⁴.

Static question-answering

Recent work has demonstrated that LLMs can approach expert performance on medical question-answering (QA) tasks, particularly when tailored to the clinical domain. For example¹, present Med-PaLM 2, an LLM fine-tuned with retrieval and ensemble methods, which achieves 86.5% accuracy on the USMLE-style MedQA benchmark, a 19% improvement over its predecessor. Med-PaLM 2 also demonstrated significant improvements on related datasets, including MedMCQA and PubMedQA, and was preferred by physicians in human evaluations. These results suggest that LLMs can encode substantial clinical knowledge when optimized. However, most such evaluations have focused on static question answering (e.g., exam questions) rather than the real-world interactive process of diagnosis.

Strategic reasoning can also be enhanced by employing prompting techniques. For example, chain-of-thought or self-reflection prompts encourage longer, detailed reasoning. Chen et al.⁵ compared chain-of-thought, self-consistency, and multi-agent prompting, finding that multi-agent methods yielded richer reasoning paths and higher accuracy than simple CoT or single-pass prompting. Similarly, retrieval-augmented prompts (as in Med-PaLM 2) ground answers in external knowledge, which is especially important when follow-up questions depend on up-to-date medical information¹.

Interactive diagnostic dialogues

Unlike conventional QA benchmarks, real-world diagnosis is inherently multi-turn and adaptive. In conversational QA, users iteratively ask related questions, and the system must leverage dialogue context to provide answers⁶. In medical settings, the patient-doctor dialogue exemplifies this: each answer provides new evidence that informs the subsequent question and the differential diagnosis. Thus, strategic QA requires the model not only to produce answers, but to plan subsequent inquiries based on partial information.

LLMs are now being developed explicitly for interactive diagnostic reasoning. Tu et al.⁷ describe AMIE, a conversational AI trained via self-play simulation for diagnostic dialogue. AMIE was evaluated in an OSCE-style study with patient actors, outperforming primary care physicians in diagnostic accuracy and communication metrics in many cases. In effect, AMIE exemplifies an LLM that asks follow-up questions to refine its differential diagnosis. Similarly⁸, proposes a Diagnostic-Reasoning Chain-of-Thought (DR-CoT) prompting strategy. By mimicking the clinician’s reasoning steps, DR-CoT elicited significant gains, with diagnostic accuracy improving by ~ 15% (and 18% out-of-domain) compared to standard LLM prompting. Wu et al. explicitly model the clinical workflow as an iterative QA process: the physician asks about symptoms, updates a differential diagnosis list, and then asks the following question based on that list. In this view, the doctor’s hidden chain of thought guides which queries to make next.

Conversational multiagent approaches

Large-scale multi-agent strategies have also been explored. Chen et al.⁹ introduce a Multi-Agent Conversation (MAC) framework that simulates a team of “doctor” agents and a supervisor agent collaborating on diagnosis. MAC achieved higher accuracy on rare-disease consultations than single-agent LLMs and outperformed chain-of-thought or self-refinement methods. Notably, MAC generated substantially more output (in terms of questions and reasoning) than simpler prompting techniques, enabling it to explore diverse diagnostic paths. These examples underscore that a richer, more strategic interrogation, through chaining, teamwork, or simulation, can significantly enhance LLM diagnostic performance.

Simulation environments have also been developed. Yu et al.¹⁰ introduce a language-model role-playing framework in which an LLM simulates the patient, enabling fully free-text dialogues. Google’s AMIE was trained using a self-play simulation in an automated dialogue learning environment at Google Research. Other platforms utilize standardized patients, actors, or large language models (LLMs) to simulate patients. Such simulations enable the evaluation of an LLM’s strategic questioning; for example, Tu et al.⁷ utilized OSCE-style text chats with patient actors to assess AMIE across 149 cases.

Available datasets

Research on diagnostic QA relies on a variety of datasets and benchmarks, but few directly evaluate the multi-turn strategy. Table of representative resources:

MedDialog¹¹: An English/Chinese corpus of doctor–patient chats (≈0.3 M English and 1.1 M Chinese dialogues, 4 M utterances in CN). MedDialog provides realistic conversational data (from haodf.com) but little in the way of diagnostic labels or tasks.

KaMed¹²: A Chinese medical dialogue corpus (60 K consultations) enriched with external medical knowledge (entities from CMeKG). KaMed is used to train knowledge-aware chatbots, but tasks are typically limited to generating plausible responses given the history.

IMCS-21¹³: A multi-level annotated pediatric consultation dataset (4,116 dialogues, 164 K utterances) designed as a benchmark. It defines five tasks, including medical named entity recognition (NER), symptom labelling (present/absent/unknown), dialogue act classification, report generation, and diagnostic decision policy (selecting the next question). By providing fine-grained labels for each utterance, IMCS-21 supports research on question-asking strategy (e.g., predicting which symptom to ask about next).

GMD-12¹⁴: A Chinese dataset focused on diagnosis dialogue with interpretable inquiry logic. It contains ~ 17 K conversations with structured symptom annotations. The authors propose a modular model that mimics doctors’ decision-making process, achieving significant accuracy gains (doi.org). This highlights the value of interpretable question policies in data-driven systems.

USMLE-style QA datasets (e.g., MedQA¹⁵, MedMCQA¹⁶) are widely used to assess medical knowledge through multiple-choice questions. Singhal et al.¹ utilized these to demonstrate LLM progress; however, these tasks do not involve multi-turn dialogue or sequential inference.

DDXPlus¹⁷: A synthetic dataset of ~ 1.3 million patient cases covering 49 pathologies. Each case includes a comprehensive differential diagnosis, accurate pathology, and detailed information on symptoms and antecedents. DDXPlus is designed to train and evaluate automated symptom checkers, emphasizing that incorporating differential-diagnosis data improves learning efficiency and interpretability.

MDD-5 k¹⁸: The largest Chinese dataset for psychiatric diagnosis, consisting of 5,000 long patient-psychiatrist conversations generated via a neuro-symbolic LLM framework. Each dialogue is labelled with the correct diagnosis. MDD-5 k demonstrates that LLMs can synthesize realistic diagnostic interviews; however, it is currently domain-specific (mental health) and synthetic in nature.

CPC/Vignette datasets: Collections of clinical vignettes (e.g., in JAMIA or educational settings) with target diagnoses. These tools allow for the assessment of reasoning but are typically designed for single-turn interactions, presenting all information at once.

Research on large language models (LLMs) for diagnostic applications has advanced rapidly, but significant gaps remain. Many studies highlight that clinical reasoning is naturally collaborative and iterative, yet few benchmarks truly reflect this aspect. Chen et al.¹³ explicitly identify a “large gap” in frameworks and datasets for automated medical consultation, noting that most existing corpora are designed for response generation and feature only coarse annotations, which limits their usefulness for training systems to ask questions strategically. They also note, along with other researchers, that available datasets are typically small, comprising only hundreds to a few thousand dialogues, thereby limiting models’ capacity to learn effective multi-turn policies.

Evaluation methods have mainly focused on the accuracy and completeness of differential diagnoses. For instance, Wu et al. reported that clinicians assisted by AMIE produced “more comprehensive” differentials, while also acknowledging that complex cases still challenge the model’s holistic reasoning. This highlights the need for evaluation criteria that assess the quality of questioning strategies, including the speed with which the correct diagnosis is reached, the relevance of the questions asked, and the effectiveness of communication. Although multi-step question-answering tasks in fields such as gaming and education use metrics like information gain or turn count, similar standards have not yet been widely adopted in medicine.

In summary, prior work on conversational medical diagnosis, including datasets such as IMCS-21 and MDD-5 k, and systems like AMIE, has incredibly advanced modeling of clinician–patient interactions and diagnostic reasoning. However, these efforts primarily assess dialogue quality or overall diagnostic accuracy and do not explicitly evaluate the strategic inquiry process itself, including the efficiency and informativeness of individual questions. Q4Dx complements this body of work by focusing on question selection as the primary evaluation target and introducing efficiency-focused metrics, intermediate diagnostic assessments, and controlled information exposure to enable systematic comparison of diagnostic inquiry strategies.

Methodology

Q4Dx dataset generation

The Q4Dx dataset is constructed through a step-by-step, controlled-generation process designed to isolate the roles of information availability and questioning strategy in diagnostic reasoning.

Selection of disease–symptom anchors Dataset generation begins by sampling from the SDPD (Symptoms and Disease Pairings) dataset¹⁹, which contains labeled associations between representative symptom sets and their corresponding diagnoses across a wide range of medical conditions. From this source, we randomly select 100 disease–symptom pairings. Each selected pairing serves as a fixed diagnostic anchor, defining a realistic disease profile together with its associated symptoms. Fixing the diagnosis at this stage ensures that all downstream case variants remain comparable and differ only in the information revealed to the model.
Natural-language case realization For each selected disease–symptom pairing, we generate synthetic patient self-descriptions in free-form natural language using GPT-4o-mini. This step converts structured symptom lists into realistic patient narratives that reflect how symptoms are typically described in clinical encounters rather than as explicit medical inventories.
Controlled symptom disclosure To model varying levels of initial uncertainty, we generate three versions of each patient description that differ in the proportion of symptoms revealed: a full-disclosure case (100%), a medium-disclosure case (approximately 80%), and a low-disclosure case (approximately 50%). All variants correspond to the same underlying diagnosis but progressively omit diagnostically relevant details. This controlled reduction of the initial evidence mirrors real clinical scenarios in which patients often provide incomplete information and enables systematic analysis of how models adapt their inquiry strategies as uncertainty increases. Representative examples are shown in Table 1.
Verification of information degradation To validate the generation process, we quantify the overlap between symptoms expressed in each generated description and the reference symptom set. Specifically, we compute the Jaccard similarity (Eq. 1) between the reference symptom list A, extracted from the full-disclosure case, and the symptom set B extracted from each partial case. As expected, similarity decreases with lower disclosure levels (0.37 for 100%, 0.32 for 80%, and 0.23 for 50%), confirming a systematic reduction in information. The relatively low absolute values reflect lexical variation, paraphrasing, and limitations of symptom extraction from free-text narratives.
1
Interactive dialogue simulation and reference generation Using the generated cases, we simulate diagnostic interactions with two LLMs assuming distinct roles. A patient agent, powered by GPT-4o-mini, has full access to the complete symptom set associated with the selected diagnosis and responds consistently to physician inquiries. A physician agent, implemented with GPT-4.1, is initially presented with a partial-disclosure case description (50% or 80%), reflecting realistic diagnostic conditions in which clinicians begin with incomplete information. GPT-4.1 is intentionally chosen for the physician role due to its stronger reasoning capabilities relative to GPT-4o-mini, allowing it to generate high-quality diagnostic question sequences that are retained as reference interrogation strategies within the dataset.
Dialogue recording and dataset construction For each simulated interaction, we record the full conversational trace, including the sequence of physician questions, patient responses, and intermediate diagnostic decisions. These traces form the core of the Q4Dx dataset, capturing the evolution of diagnostic reasoning under incremental information disclosure.

The overall workflow is illustrated in Fig. 1.
Limitations Q4Dx uses LLM-generated reference question sequences rather than clinician-authored ones, which is a key limitation. Although disease–symptom associations are grounded in a real, curated dataset, the diagnostic questioning trajectories are synthetic. This choice reflects practical constraints, as collecting large-scale, multi-turn traces of clinician reasoning under controlled information disclosure is costly and difficult to standardize. GPT-4.1 was used to generate consistent reference sequences due to its stronger reasoning and medical QA performance; these are not treated as clinical ground truth, but as reference policies for relative model comparison. The reference sequences are used to support relative benchmarking of question selection and efficiency, not to assert clinical optimality.

Table 1.

Generated self-descriptions with varying levels of symptom disclosure.

Symptoms reveal	Case description
100%	I am 58 years old, and over the past few weeks, I have been feeling increasingly exhausted, regardless of how much I rest. I have also lost significant weight without trying. Lately, my legs have started to feel weak, and it is becoming increasingly difficult for me to walk or perform simple tasks around the house. On top of that, I have been having trouble controlling my bladder and often feel a sudden, urgent need to urinate that I cannot always hold back. My doctors ran some scans, and they told me that there is a tumor affecting my spine.
80%	I am 58 years old, and recently I have been feeling very weak and tired all the time. I have also unintentionally lost weight. My legs have become weak, and it is becoming increasingly difficult for me to move around. Lately, I have been having trouble controlling my bladder and often feel an urgent need to urinate that I cannot hold back.
50%	I am 58 years old, and I have been unintentionally losing weight. Lately, I have also had trouble controlling my bladder, often feeling a sudden, urgent need to urinate that I cannot always manage to hold back.

Open in a new tab

Fig. 1 — Q4Dx data generation and simulation workflow.

Benchmarking LLM models using the Q4Dx dataset

We utilize the generated dataset to benchmark the diagnostic capabilities of several state-of-the-art models, including both general-purpose language models and models specifically designed for healthcare. In this evaluation, GPT-4.1 serves as our reference model for diagnostic reasoning, while comparisons are made with alternative models under identical conditions.

Model performance is assessed using three primary evaluation metrics:

Zero-Shot Diagnostic Accuracy (ZDA): ZDA measures a model’s ability to produce the correct final diagnosis from the initial patient description alone, without any interactive questioning. A prediction is considered correct if the model’s top-ranked diagnosis exactly matches the reference diagnosis assigned to the case, with both selected from a predefined diagnosis list. ZDA is computed only for cases with complete symptom disclosure (100% reveal), where all diagnostically relevant information is already available, and interactive questioning would be unnecessary.

Mean Questions to Correct Diagnosis (MQD). MQD quantifies diagnostic efficiency by computing the average number of questions a model asks before producing the correct diagnosis for the first time. A question is counted if it is explicitly posed to the patient agent and elicits new clinical information. Interrogation is capped at five questions to standardize evaluation across models; if the correct diagnosis is not produced within this limit, the case is marked as a failure and assigned the maximum question count. MQD is then averaged over all evaluated cases, with failed cases contributing the capped value.

Interrogation Sequence Efficiency (ISE). ISE assesses the quality of a model’s questioning strategy by comparing its generated question sequence against a reference sequence produced during dataset construction. Semantic similarity between individual model-generated and reference questions is computed using BERTScore, which captures differences in phrasing and wording. To account for differences in question ordering, we apply sequence alignment to identify the mapping between model-generated and reference questions that maximizes the cumulative BERTScore across the two sequences. This alignment allows questions to be matched even when they appear in different positions, while insertions and deletions penalize extra or missing questions. Higher ISE values indicate that the model selects semantically appropriate diagnostic questions and avoids unnecessary or omitted inquiries, with question order playing a secondary role relative to question selection.

Results

We evaluate the diagnostic reasoning capabilities of large language models using the Q4Dx benchmark, focusing on both zero-shot and interactive performance.

In the Zero-shot Diagnostic Accuracy (ZDA) setting, the model is asked to generate a diagnosis based solely on the initial patient description, without posing any questions. Performance in this setting depends heavily on the completeness of the available clinical information. Specifically, GPT-4.1 achieves a ZDA of 51% when provided with the complete set of symptoms (100% case), which declines to 44% and 37% when only 80% and 50% of the symptoms are available, respectively. This trend confirms that the richness of the initial input significantly influences diagnostic accuracy.

Beyond static prediction, we evaluate interactive diagnostic performance by analyzing the model’s ability to converge on the correct diagnosis through multi-turn questioning. We report a Mean Questions to Correct Diagnosis (MQD) of 2.4, indicating that the model typically identifies the correct condition within just a few targeted interactions, well before reaching the maximum allowable five-question limit. This highlights the model’s diagnostic efficiency and its ability to rapidly narrow the differential diagnosis.

To further assess the quality of the model’s inquiry strategy, we introduce the Interrogation Sequence Efficiency (ISE) metric, which quantifies how closely the model’s sequence of questions aligns with the LLM-generated reference questioning paths provided in the dataset. Using the BERTScore similarity metric, we measure semantic overlap between each model-generated question and its corresponding ground-truth counterpart. The ISE analysis reveals a gradual improvement in alignment across rounds, with average similarity scores increasing from 0.623 in the first round to 0.635 in the third, and the mean of the maximum similarity across rounds reaching 0.657. These results suggest that the model’s question selection becomes progressively more targeted and diagnostically informative with each turn.

To illustrate this interactive process, Table 2 presents a representative example in which a patient reports gastrointestinal symptoms. At each round, the model poses a clarifying question, receives a response, and updates its diagnostic hypothesis accordingly:

Table 2.

An example dialogue during model evaluation using Q4Dx.

Round	Question	Answer	Diagnosis	ISE
1	Have you noticed if your stomach pain or acidity symptoms are related to eating specific foods or occurring at particular times of the day?	I have not noticed that.	Dyspepsia	0.4787
2	Have you experienced any unintentional weight loss, changes in appetite, or difficulty swallowing along with your stomach pain?	No, I have not noticed any unintentional weight loss, changes in appetite, or difficulty swallowing	Gastritis	0.5124
3	Have you had any episodes of black or tarry stools, blood in your stool, or noticed any changes in your bowel habits recently?	No, I have not noticed any episodes of black or tarry stools, blood in my stool, or changes in my bowel habits recently.	Gastroesophageal reflux disease (GERD)	1.0000

Open in a new tab

This example demonstrates how the model’s line of questioning becomes increasingly aligned with expert diagnostic reasoning. By the third round, the model not only arrives at the correct diagnosis but also poses a question that is semantically identical to that of a human expert, as captured by a perfect BERTScore. The combination of strong ISE scores, low MQD, and improving ZDA underlines the model’s capacity to simulate meaningful and efficient clinical dialogue under varying degrees of information availability.

Conclusions and future research

This work introduces Q4Dx, the first benchmark explicitly designed to evaluate LLMs in interactive diagnostic reasoning, where performance depends on both medical knowledge and the strategic formulation of clarifying questions. By modelling under-specified clinical scenarios and multi-turn patient–physician interactions, Q4Dx addresses a key gap in existing diagnostic evaluation frameworks.

We present a controlled data-generation pipeline that produces realistic clinical cases with varying levels of information completeness and constructs structured multi-turn diagnostic dialogues between physician and patient agents. We further introduce task-specific metrics, Zero-shot Diagnostic Accuracy (ZDA), Mean Questions to Correct Diagnosis (MQD), and Interrogation Sequence Efficiency (ISE), to jointly assess diagnostic accuracy and inquiry efficiency.

Baseline experiments with GPT-4.1 and related models show that diagnostic performance degrades under partial information and improves through strategic dialogue, highlighting the need for goal-directed reasoning capabilities in high-stakes clinical settings. Beyond benchmarking, Q4Dx provides a foundation for studying and training diagnostic inquiry, enabling systematic comparison between LLMs and human clinicians and supporting future extensions to broader clinical scenarios and evaluation settings.

Future work will extend Q4Dx along several dimensions, including expanding the benchmark to larger and more diverse case sets, incorporating multiple baseline models and variance-aware evaluation, and exploring alternative reference strategies to better capture the multiplicity of valid diagnostic inquiry paths. In addition, integrating human-authored dialogues or clinician-in-the-loop validation remains an important direction for strengthening clinical realism and interpretability.

Author contributions

All authors contributed to the study conception and design. Conceptualization: A.A, Y.A; Methodology: A.A., M.W., M.K.,Y.A.; Formal analysis and investigation: A.A., M.W.,M.K.,Y.A., Code: M.W.,M.K., Supervision A.A., Writing—original draft preparation: A.A., Y.A.; Writing—review and editing: A.A., Y.A. All authors read and approved the final manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability

The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Singhal, K. et al. Toward expert-level medical question answering with large language models. NPJ Digit. Med.8(1), 10–23 (2025).39757295 [Google Scholar]
2.Singhal, K. et al. Large Language models encode clinical knowledge. Nature620(7972), 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu, F. et al. Large Language models in the clinic: A comprehensive benchmark. arXiv preprint arXiv:2405.00716 (2024).
4.Wang, W. et al. A survey of llm-based agents in medicine: How Far are we from baymax? arXiv preprint arXiv:2502.11211 (2025).
5.Chen, Y. et al. CoMM: Collaborative multi-agent, multi-reasoning-path prompting for complex reasoning with large language models. arXiv. arXiv:2404.17729 (2024). [Google Scholar]
6.Liu, W. et al. Meddg: An entity-centric medical consultation dataset for entity-aware medical dialogue generation. in CCF International Conference on Natural Language Processing and Chinese Computing 447–459 (Springer International Publishing, Cham, 2022).
7.Tu, T. et al. Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654 (2024).
8.Wu, C. K., Chen, W. L. & Chen, H. H. Large Language models perform diagnostic reasoning. arXiv Preprint arXiv:2307.08922 (2023).
9.Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large Language models. NPJ Digit. Med.8(1), 159 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Yu, H. et al. Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education. Commun. Med.6, 27 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.He, X. et al. Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020).
12.Li, D. et al. Semi-supervised variational reasoning for medical dialogue generation. in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 11 Jul, 544–554 (2021).
13.Chen, W. et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics39(1), btac817 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Liu, W. et al. My nose is running. Are you also coughing? Building A medical diagnosis agent with interpretable inquiry Logics. arXiv Preprint arXiv:2204.13953 (2022).
15.Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci.11(14), 6421 (2021). [Google Scholar]
16.Pal, A., Umapathi, L. K., Sankarasubbu, M. & Medmcqa A large-scale multi-subject multi-choice dataset for medical domain question answering. in Conference on health, inference, and learning 2022 248–260 (PMLR, 2022).
17.Fansi Tchango, A. et al. A new dataset for automatic medical diagnosis. Adv. Neural. Inf. Process. Syst.35, 31306–31318 (2022). [Google Scholar]
18.Yin, C. et al. Mdd-5k: A new diagnostic conversation dataset for mental disorders synthesized via neuro-symbolic llm agents. in Proceedings of the AAAI Conference on Artificial Intelligence Vol. 39(24), 25715–25723 (2025).
19.Tucker, J. SymbiPredict, Mendeley Data, V1. 10.17632/dv5z3v2xyd.1 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

[CR1] 1.Singhal, K. et al. Toward expert-level medical question answering with large language models. NPJ Digit. Med.8(1), 10–23 (2025).39757295 [Google Scholar]

[CR2] 2.Singhal, K. et al. Large Language models encode clinical knowledge. Nature620(7972), 172–180 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Liu, F. et al. Large Language models in the clinic: A comprehensive benchmark. arXiv preprint arXiv:2405.00716 (2024).

[CR4] 4.Wang, W. et al. A survey of llm-based agents in medicine: How Far are we from baymax? arXiv preprint arXiv:2502.11211 (2025).

[CR5] 5.Chen, Y. et al. CoMM: Collaborative multi-agent, multi-reasoning-path prompting for complex reasoning with large language models. arXiv. arXiv:2404.17729 (2024). [Google Scholar]

[CR6] 6.Liu, W. et al. Meddg: An entity-centric medical consultation dataset for entity-aware medical dialogue generation. in CCF International Conference on Natural Language Processing and Chinese Computing 447–459 (Springer International Publishing, Cham, 2022).

[CR7] 7.Tu, T. et al. Towards conversational diagnostic AI. arXiv preprint arXiv:2401.05654 (2024).

[CR8] 8.Wu, C. K., Chen, W. L. & Chen, H. H. Large Language models perform diagnostic reasoning. arXiv Preprint arXiv:2307.08922 (2023).

[CR9] 9.Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large Language models. NPJ Digit. Med.8(1), 159 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Yu, H. et al. Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education. Commun. Med.6, 27 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.He, X. et al. Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329 (2020).

[CR12] 12.Li, D. et al. Semi-supervised variational reasoning for medical dialogue generation. in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 11 Jul, 544–554 (2021).

[CR13] 13.Chen, W. et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics39(1), btac817 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Liu, W. et al. My nose is running. Are you also coughing? Building A medical diagnosis agent with interpretable inquiry Logics. arXiv Preprint arXiv:2204.13953 (2022).

[CR15] 15.Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci.11(14), 6421 (2021). [Google Scholar]

[CR16] 16.Pal, A., Umapathi, L. K., Sankarasubbu, M. & Medmcqa A large-scale multi-subject multi-choice dataset for medical domain question answering. in Conference on health, inference, and learning 2022 248–260 (PMLR, 2022).

[CR17] 17.Fansi Tchango, A. et al. A new dataset for automatic medical diagnosis. Adv. Neural. Inf. Process. Syst.35, 31306–31318 (2022). [Google Scholar]

[CR18] 18.Yin, C. et al. Mdd-5k: A new diagnostic conversation dataset for mental disorders synthesized via neuro-symbolic llm agents. in Proceedings of the AAAI Conference on Artificial Intelligence Vol. 39(24), 25715–25723 (2025).

[CR19] 19.Tucker, J. SymbiPredict, Mendeley Data, V1. 10.17632/dv5z3v2xyd.1 (2024).

PERMALINK

A benchmark for evaluating diagnostic questioning efficiency of LLMs in patient conversations

Mai Werthaim

Maya Kimhi

Alexander Apartsin

Yehudit Aperstein

Abstract

Introduction

Literature review

Static question-answering

Interactive diagnostic dialogues

Conversational multiagent approaches

Available datasets

Methodology

Q4Dx dataset generation

Table 1.

Fig. 1.

Benchmarking LLM models using the Q4Dx dataset

Results

Table 2.

Conclusions and future research

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A benchmark for evaluating diagnostic questioning efficiency of LLMs in patient conversations

Mai Werthaim

Maya Kimhi

Alexander Apartsin

Yehudit Aperstein

Abstract

Introduction

Literature review

Static question-answering

Interactive diagnostic dialogues

Conversational multiagent approaches

Available datasets

Methodology

Q4Dx dataset generation

Table 1.

Fig. 1.

Benchmarking LLM models using the Q4Dx dataset

Results

Table 2.

Conclusions and future research

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases