. 2024 Jan 24;7:20. doi: 10.1038/s41746-024-01010-1

Table 3.

GPT 3.5 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

Prompt	Correct responses (%)	Difference in percentage (confidence interval)	p value^a
Chain of thought	46%	–	–
Intuitive reasoning	48%	1.7% (−2.5%, 5.9%)	0.4
Analytic reasoning	40%	−6.0% (−11%, −1.5%)	0.001
Differential diagnosis	38%	−8.9% (−14%, −3.4%)	<0.001
Bayesian inference	42%	−4.4% (−9.1%, 0.2%)	0.02

GPT-3.5 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.

^aPercentage difference and p value statistics compared to traditional chain-of-thought.