Abstract
This cross-sectional study assesses the ability of a large language model to process medical data and display clinical reasoning compared with the ability of attending physicians and residents.
Large language models (LLMs) have shown promise in clinical reasoning, but their ability to synthesize clinical encounter data into problem representations remains unexplored.1,2,3 We compared an LLM’s reasoning abilities against human performance using standards developed for physicians.
Methods
We recruited internal medicine residents and attending physicians at 2 academic medical centers in Boston, Massachusetts, from July to August 2023. We used 20 clinical cases, each comprising 4 sections representing sequential stages of clinical data acquisition.4 We developed a survey instructing physicians to write a problem representation and prioritized differential diagnosis with justifications for each section (eTable 1 in Supplement 1). Each physician received the survey with 1 randomly selected case (4 sections). We developed a prompt with identical instructions (eTable 2 in Supplement 1) and ran all sections in GPT-4 (OpenAI) on August 17-18, 2023.5 The Massachusetts General Hospital and Beth Israel Deaconess Medical Center Institutional Review Boards deemed this cross-sectional study exempt from review. Written informed consent were obtained. We followed the STROBE reporting guideline.
Primary outcome was the Revised-IDEA (R-IDEA) score, a validated 10-point scale evaluating 4 core domains of clinical reasoning documentation (eTable 3 in Supplement 1).6 To establish reliability, we (D.R., Z.K., A.R.) independently scored 29 section responses from 8 nonparticipants, showing substantial scoring agreement (mean Cohen weighted κ = 0.61). Secondary outcomes included the presence of correct and incorrect reasoning, diagnostic accuracy, and cannot-miss diagnoses (eMethods in Supplement 1). Scorers were blinded to respondent type (chatbot, attending, resident).
Descriptive statistics were calculated for all outcomes. R-IDEA scores were binarized as low (0-7) or high (8-10). Associations between respondent type and score category were evaluated using logistic regression with random effects accounting for participant repeat structure (eMethods in Supplement 1). Significance was defined as 2-sided P < .05. Analyses were performed using R 4.2.1 (R Core Team).
Results
The sample included 21 attendings and 18 residents, who provided responses to a single case. Chatbot provided responses to all 20 cases. Two hundred thirty-six sections were completed, with 232 unique combinations of respondent type and section.
Median (IQR) R-IDEA scores were 10 (9-10) for chatbot, 9 (6-10) for attendings, and 8 (4-9) for residents (Table). In logistic regression analysis, chatbot had the highest estimated probability of achieving high R-IDEA scores (0.99; 95% CI, 0.98-1.00), followed by attendings (0.76; 95% CI, 0.51-1.00) and residents (0.56; 95% CI, 0.23-0.90), with chatbot being significantly higher than attendings (P = .002) and residents (P < .001) (Figure). Using Wilcoxon signed-rank test, chatbot had significantly higher R-IDEA scores than attendings (154; P = .003) and residents (127; P = .002). Attendings’ scores did not differ from residents’.
Table. Descriptive Statistics of Clinical Reasoning Outcomes Stratified by Respondent Type.
| Outcome | Respondents, No. (%) | |||
|---|---|---|---|---|
| All (N = 232)a,b | Chatbot (n = 80) | Attending physician (n = 80)a | Resident (n = 72) | |
| R-IDEA score | ||||
| Median (IQR) | 9 (7-10) | 10 (9-10) | 9 (6-10) | 8 (4-9) |
| Score group | ||||
| Low: 0-7 | 90 (38.8) | 4 (5.0) | 29 (36.3) | 33 (45.8) |
| High: 8-10 | 142 (61.2) | 76 (95.0) | 51 (63.8) | 39 (54.9) |
| Correct clinical reasoning instances | 222 (95.7) | 80 (100) | 79 (98.8) | 63 (87.5) |
| Incorrect clinical reasoning instances | 23 (9.9) | 11 (13.8) | 10 (12.5) | 2 (2.8) |
| Diagnostic accuracy | ||||
| Median (IQR) | 100 (100-100) | 100 (100-100) | 100 (100-100) | 100 (100-100) |
| Score group | ||||
| Low: <75% | 32 (13.8) | 8 (10.0) | 14 (17.5) | 10 (13.9) |
| High: 75%-100% | 200 (86.2) | 72 (90.0) | 66 (82.5) | 62 (86.1) |
| Cannot-miss diagnoses included, median (IQR), %b | 66.7 (33.3-100) | 66.7 (50.0-100) | 50.0 (27.1-100) | 66.7 (33.3-81.2) |
Abbreviation: R-IDEA, Revised IDEA.
There was 1 instance in which 2 attending physicians provided responses for the same case. Means of section scores were calculated prior to analysis.
For the cannot-miss diagnoses outcome, which reflects the percentage of cannot-miss diagnoses included in the initial differential (first section of each case), the sample size for all respondents was 52, with 18 responses each from chatbot and attending physicians and 16 responses from residents. Two cases without identified cannot-miss diagnoses were excluded.
Figure. Distribution of Revised-IDEA (R-IDEA) Scores.
Distribution of R-IDEA scores is displayed across 232 sections, stratified by respondent type: chatbot (n = 80), attending physicians (n = 80), and residents (n = 72). The dashed vertical line delineates the low score (0-7) and high score (8-10) categories used in the regression analysis. For visual clarity, scores of 0 were aggregated with scores of 1.
Chatbot performed similar to attendings and residents in diagnostic accuracy, correct clinical reasoning, and cannot-miss diagnosis inclusion. Median (IQR) inclusion rate of cannot-miss diagnoses in initial differentials were 66.7% (50.0%-100%) for chatbot, 50.0% (27.1%-100%) for attendings, and 66.7% (33.3%-81.2%) for residents (Table). Chatbot had more frequent instances of incorrect clinical reasoning (13.8%) than residents (2.8%; P = .04) but not attendings (12.5%; P = .89).
Discussion
An LLM was better than physicians in processing medical data and clinical reasoning using recognizable frameworks as measured by R-IDEA scores. Several other clinical reasoning outcomes showed no differences between physicians and chatbot, although chatbot had more instances of incorrect clinical reasoning than residents. This observation underscores the importance of multifaceted evaluations of LLM capabilities preceding their integration into the clinical workflow.
Study limitations include clinical data being provided in simulated cases; it is unclear how chatbot would perform in clinical scenarios. Physicians’ vignette responses may differ from clinical practice; however, physician R-IDEA scores in clinical documentation were lower than physician scores in this study.6 We also used a zero-shot approach for chatbot’s prompt. Iterative training could enhance LLM performance, suggesting that the results may have underestimated its capabilities. Future research should assess clinical reasoning of the LLM-physician interaction, as LLMs will more likely augment, not replace, the human reasoning process.
eTable 1. Survey Instruction Given to Physician Participants
eTable 2. Prompt for GPT-4 That Includes the Same Instruction Given to Physician Participants but Included Prompt Engineering Principles From OpenAI
eTable 3. The Revised-IDEA Assessment Tool for Clinical Reasoning Documentation by Schaye and Colleagues
eMethods.
eReferences
Data Sharing Statement
References
- 1.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78-80. doi: 10.1001/jama.2023.8288 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi: 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Strong E, DiGiammarino A, Weng Y, et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern Med. 2023;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Abdulnour REE, Parsons AS, Muller D, Drazen J, Rubin EJ, Rencic J. Deliberate practice at the virtual bedside to improve clinical reasoning. N Engl J Med. 2022;386(20):1946-1947. doi: 10.1056/NEJMe2204540 [DOI] [PubMed] [Google Scholar]
- 5.OpenAI. Best practices for prompt engineering with OpenAI API. Accessed November 14, 2023. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- 6.Schaye V, Miller L, Kudlowitz D, et al. Development of a clinical reasoning documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback. J Gen Intern Med. 2022;37(3):507-512. doi: 10.1007/s11606-021-06805-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eTable 1. Survey Instruction Given to Physician Participants
eTable 2. Prompt for GPT-4 That Includes the Same Instruction Given to Physician Participants but Included Prompt Engineering Principles From OpenAI
eTable 3. The Revised-IDEA Assessment Tool for Clinical Reasoning Documentation by Schaye and Colleagues
eMethods.
eReferences
Data Sharing Statement

