Skip to main content
. 2024 Jan 2;25(1):bbad493. doi: 10.1093/bib/bbad493

Table 2.

Performance of LLMs on biomedical QA tasks

Model Learning MedQA (USMLE) PubMedQA (r.r./r.f.) MedMCQA (dev/test)
Human expert [31] 87.0 78.0/90.4 90.0
Human passing [31] 60.0 50.0
Med-PaLM 2 [30] Mixed 86.5 a 81.8/− 72.3a/−
GPT-4 [29] Few-shot 86.1 80.4/− 73.7/−
FLAN-PaLM [8] Few-shot 67.6 79.0/− 57.6/−
GPT-3.5 [31] Few-shot 60.2 78.2/− 59.7/62.7
Galactica [25] Mixed 44.4 77.6a/− 52.9a/−
BioMedLM [6] Fine-tune 50.3 74.4/−
BioGPT [7] Fine-tune −/81.0
PMC-LLaMA [9] Fine-tune 44.7 69.5/− −/50.5
Non-LLM SOTA [64] Fine-tune 47.5 73.4/−

Note: All numbers are accuracy in percentages. Underline values denote the best performance by language models. r.r.: reasoning-required; r.f.: reasoning-free.

aFine-tuning.