Table 6. Performance of models.
Krippendorff’s alpha (α) performance of models averaged over datasets and prompts, best results in bold. N total = 11,880.
| Model | α(CI) n per model = 1980 |
|---|---|
| GPT-4 | .78 (.76, .81) |
| GPT-3.5-turbo | .62 (.59, .65) |
| Davinci-003 | .47 (.45, .50) |
| Flan-T5-XXL | .45 (.42, .47) |
| Davinci-002 | .41 (.38, .44) |
| Command-XL | .32 (.29, .35) |