. 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943

Table 4.

Zero-shot answering accuracy of InstructGPT (text-davinci-002) on the MedQA-USMLE (test), MedMCQA (valid), and PubMedQA (test) datasets

Model	Grounding	Prompt	USMLE	MedMCQA	PubMedQA
InstructGPT	$\emptyset$	direct	46.0	44.0	73.2
InstructGPT	$\emptyset$	CoT #1–#5	46.1 $\pm$ 0.7	40.4 $\pm$ 2.2	59.9 ± 3.5
InstructGPT	BM25	direct	47.3	46.7	–
InstructGPT	BM25	CoT #1–#5	46.4 $\pm$ 0.7	42.5 $\pm$ 1.7	–
InstructGPT	$\emptyset$	ensemble (n = 6)	50.0	42.4	70.4
InstructGPT	BM25	ensemble (n = 6)	49.3	48.8	–
InstructGPT	$\emptyset$ + BM25	ensemble (n = 12)	53.1	47.6	–
Fine-tuned BERT	BM25, DPR, $\emptyset$	–	44.6	43.0	72.2
Human (passing score)	–	–	60.0	50.0	–
Human (expert score)	–	–	87.0	90.0	78.0

We report the best fine-tuned BERT-based methods. We tested 5 domain-specific CoT cues (#1–#5) and report the mean performances with standard deviations. Fine-tuned BERT, BioLinkBERT⁵⁸; DPR, dense passage retrieval.⁵⁹ When multiple results are aggregated, we report the mean and standard deviation (±).