Skip to main content
. 2024 Apr 30;10:e1999. doi: 10.7717/peerj-cs.1999

Table 7. Performance of models per dataset.

Krippendorff’s alpha (α) performance of models per dataset averaged over prompts. Average over 330 items per model/dataset pair, best results in bold. N total = 11,880.

Model Command-XL Flan-T5-XXL GPT-3.5-turbo GPT-4 Davinci-002 Davinci-003
dataset
CommonsenseQA .57 (.50, .64) .81 (.75, .85) .70 (.64, .76) .82 (.76, .87) .68 (.62, .74) .68 (.62, .74)
MedQA .06 (.01, .13) .02 (.00, .07) .40 (.32, .47) .55 (.47, .61) .09 (.03, .15) .17 (.11, .24)
MedMCQA .08 (.01, .14) .10 (.03, .17) .51 (.44, .58) .73 (.67, .79) .20 (.13, .27) .21 (.14, .28)
OpenBookQA .43 (.36, .50) .69 (.63, .76) .77 (.71, .83) .91 (.87, .95) .45 (.37, .52) .66 (.59, .72)
StrategyQA .10 (.00, .21) .23 (.12, .34) .44 (.33, .55) .69 (.61, .76) .20 (.09, .32) .22 (.12, .31)
WorldTree v2 .67 (.61, .73) .77 (.72, .83) .89 (.85, .93) .97 (.95, .99) .84 (.79, .89) .84 (.80, .89)