Skip to main content
. 2024 Jan 24;7:20. doi: 10.1038/s41746-024-01010-1

Table 5.

GPT 4 challenge set performance with differential diagnosis reasoning prompts compared to traditional CoT.

Prompt Correct responses (%) Difference in percentage (confidence interval) p value
Chain of thought 38%
Differential diagnosis 34% −4.2% (−11.4%, +2.1%) 0.09

GPT-4 performance on the NEJM challenge question set with both traditional chain-of-thought and differential diagnosis reasoning prompting.