Skip to main content
. 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943

Table 4.

Zero-shot answering accuracy of InstructGPT (text-davinci-002) on the MedQA-USMLE (test), MedMCQA (valid), and PubMedQA (test) datasets

Model Grounding Prompt USMLE MedMCQA PubMedQA
InstructGPT direct 46.0 44.0 73.2
InstructGPT CoT #1–#5 46.1 ± 0.7 40.4 ± 2.2 59.9 ± 3.5
InstructGPT BM25 direct 47.3 46.7
InstructGPT BM25 CoT #1–#5 46.4 ± 0.7 42.5 ± 1.7
InstructGPT ensemble (n = 6) 50.0 42.4 70.4
InstructGPT BM25 ensemble (n = 6) 49.3 48.8
InstructGPT  + BM25 ensemble (n = 12) 53.1 47.6
Fine-tuned BERT BM25, DPR, 44.6 43.0 72.2
Human (passing score) 60.0 50.0
Human (expert score) 87.0 90.0 78.0

We report the best fine-tuned BERT-based methods. We tested 5 domain-specific CoT cues (#1–#5) and report the mean performances with standard deviations. Fine-tuned BERT, BioLinkBERT58; DPR, dense passage retrieval.59 When multiple results are aggregated, we report the mean and standard deviation (±).