Table 4.
GPT 4 MEDQA performance with diagnostic reasoning prompts compared to traditional CoT.
Prompt | Correct responses (%) | Difference in percentage (confidence interval) | p valuea |
---|---|---|---|
Chain of thought | 76% | – | – |
Intuitive reasoning | 77% | 0.8% (−3.6%, 5.2%) | 0.73 |
Analytic reasoning | 78% | 1.6% (−2.4%, 5.6%) | 0.35 |
Differential diagnosis | 78% | 2.2% (−2.3%, 6.7%) | 0.24 |
Bayesian inference | 72% | −3.4% (−9.1%, 1.2%) | 0.07 |
GPT-4 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.
aPercentage difference and p value statistics compared to traditional chain-of-thought.