Table 5.

GPT 4 challenge set performance with differential diagnosis reasoning prompts compared to traditional CoT.

Prompt	Correct responses (%)	Difference in percentage (confidence interval)	p value
Chain of thought	38%	–	–
Differential diagnosis	34%	−4.2% (−11.4%, +2.1%)	0.09

GPT-4 performance on the NEJM challenge question set with both traditional chain-of-thought and differential diagnosis reasoning prompting.