Table 3.
Prompt | Correct responses (%) | Difference in percentage (confidence interval) | p valuea |
---|---|---|---|
Chain of thought | 46% | – | – |
Intuitive reasoning | 48% | 1.7% (−2.5%, 5.9%) | 0.4 |
Analytic reasoning | 40% | −6.0% (−11%, −1.5%) | 0.001 |
Differential diagnosis | 38% | −8.9% (−14%, −3.4%) | <0.001 |
Bayesian inference | 42% | −4.4% (−9.1%, 0.2%) | 0.02 |
GPT-3.5 performance on a free-response MEDQA question set with both traditional chain-of-thought model prompting strategies as well as clinical reasoning prompts of intuitive reasoning, analytic reasoning, differential diagnosis and Bayesian inference.
aPercentage difference and p value statistics compared to traditional chain-of-thought.