Table 4.
Model | Grounding | Prompt | USMLE | MedMCQA | PubMedQA |
---|---|---|---|---|---|
InstructGPT | direct | 46.0 | 44.0 | 73.2 | |
InstructGPT | CoT #1–#5 | 46.1 0.7 | 40.4 2.2 | 59.9 ± 3.5 | |
InstructGPT | BM25 | direct | 47.3 | 46.7 | – |
InstructGPT | BM25 | CoT #1–#5 | 46.4 0.7 | 42.5 1.7 | – |
InstructGPT | ensemble (n = 6) | 50.0 | 42.4 | 70.4 | |
InstructGPT | BM25 | ensemble (n = 6) | 49.3 | 48.8 | – |
InstructGPT | + BM25 | ensemble (n = 12) | 53.1 | 47.6 | – |
Fine-tuned BERT | BM25, DPR, | – | 44.6 | 43.0 | 72.2 |
Human (passing score) | – | – | 60.0 | 50.0 | – |
Human (expert score) | – | – | 87.0 | 90.0 | 78.0 |
We report the best fine-tuned BERT-based methods. We tested 5 domain-specific CoT cues (#1–#5) and report the mean performances with standard deviations. Fine-tuned BERT, BioLinkBERT58; DPR, dense passage retrieval.59 When multiple results are aggregated, we report the mean and standard deviation (±).