Table 2.
Model | Learning | MedQA (USMLE) | PubMedQA (r.r./r.f.) | MedMCQA (dev/test) |
---|---|---|---|---|
Human expert [31] | – | 87.0 | 78.0/90.4 | 90.0 |
Human passing [31] | – | 60.0 | – | 50.0 |
Med-PaLM 2 [30] | Mixed | 86.5 a | 81.8/− | 72.3a/− |
GPT-4 [29] | Few-shot | 86.1 | 80.4/− | 73.7/− |
FLAN-PaLM [8] | Few-shot | 67.6 | 79.0/− | 57.6/− |
GPT-3.5 [31] | Few-shot | 60.2 | 78.2/− | 59.7/62.7 |
Galactica [25] | Mixed | 44.4 | 77.6a/− | 52.9a/− |
BioMedLM [6] | Fine-tune | 50.3 | 74.4/− | – |
BioGPT [7] | Fine-tune | – | −/81.0 | – |
PMC-LLaMA [9] | Fine-tune | 44.7 | 69.5/− | −/50.5 |
Non-LLM SOTA [64] | Fine-tune | 47.5 | 73.4/− | – |
Note: All numbers are accuracy in percentages. Underline values denote the best performance by language models. r.r.: reasoning-required; r.f.: reasoning-free.
aFine-tuning.