Skip to main content
. 2024 Apr 30;10:e1999. doi: 10.7717/peerj-cs.1999

Table 6. Performance of models.

Krippendorff’s alpha (α) performance of models averaged over datasets and prompts, best results in bold. N total = 11,880.

Model α(CI) n per model = 1980
GPT-4 .78 (.76, .81)
GPT-3.5-turbo .62 (.59, .65)
Davinci-003 .47 (.45, .50)
Flan-T5-XXL .45 (.42, .47)
Davinci-002 .41 (.38, .44)
Command-XL .32 (.29, .35)