Skip to main content
. 2024 Jan 24;7:20. doi: 10.1038/s41746-024-01010-1

Table 4.

GPT 4 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

Prompt Correct responses (%) Difference in percentage (confidence interval) p valuea
Chain of thought 76%
Intuitive reasoning 77% 0.8% (−3.6%, 5.2%) 0.73
Analytic reasoning 78% 1.6% (−2.4%, 5.6%) 0.35
Differential diagnosis 78% 2.2% (−2.3%, 6.7%) 0.24
Bayesian inference 72% −3.4% (−9.1%, 1.2%) 0.07

GPT-4 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.

aPercentage difference and p value statistics compared to traditional chain-of-thought.