Table 7. Performance of models per dataset.
Krippendorff’s alpha (α) performance of models per dataset averaged over prompts. Average over 330 items per model/dataset pair, best results in bold. N total = 11,880.
| Model | Command-XL | Flan-T5-XXL | GPT-3.5-turbo | GPT-4 | Davinci-002 | Davinci-003 |
|---|---|---|---|---|---|---|
| dataset | ||||||
| CommonsenseQA | .57 (.50, .64) | .81 (.75, .85) | .70 (.64, .76) | .82 (.76, .87) | .68 (.62, .74) | .68 (.62, .74) |
| MedQA | .06 (.01, .13) | .02 (.00, .07) | .40 (.32, .47) | .55 (.47, .61) | .09 (.03, .15) | .17 (.11, .24) |
| MedMCQA | .08 (.01, .14) | .10 (.03, .17) | .51 (.44, .58) | .73 (.67, .79) | .20 (.13, .27) | .21 (.14, .28) |
| OpenBookQA | .43 (.36, .50) | .69 (.63, .76) | .77 (.71, .83) | .91 (.87, .95) | .45 (.37, .52) | .66 (.59, .72) |
| StrategyQA | .10 (.00, .21) | .23 (.12, .34) | .44 (.33, .55) | .69 (.61, .76) | .20 (.09, .32) | .22 (.12, .31) |
| WorldTree v2 | .67 (.61, .73) | .77 (.72, .83) | .89 (.85, .93) | .97 (.95, .99) | .84 (.79, .89) | .84 (.80, .89) |