Skip to main content
. 2024 Jan 24;7:20. doi: 10.1038/s41746-024-01010-1

Table 3.

GPT 3.5 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.

Prompt Correct responses (%) Difference in percentage (confidence interval) p valuea
Chain of thought 46%
Intuitive reasoning 48% 1.7% (−2.5%, 5.9%) 0.4
Analytic reasoning 40% −6.0% (−11%, −1.5%) 0.001
Differential diagnosis 38% −8.9% (−14%, −3.4%) <0.001
Bayesian inference 42% −4.4% (−9.1%, 0.2%) 0.02

GPT-3.5 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.

aPercentage difference and p value statistics compared to traditional chain-of-thought.