Table 3.
Inter-Rater Reliability Comparison Expressed Through Cohen’s Kappa. This table outlines inter-rater reliability among the LLMs. Values below 0.2 indicate minimal agreement, 0.21–0.39 suggest weak agreement, and 0.40–0.59 indicate moderate agreement [23], [28]. No model achieved a Kappa value ≥ 0.6, which signifies strong agreement. The p-values associated with each comparison are presented in parentheses.
LLM | GPT-4o | GPT-4Turbo | GPT-4mini | GPT-3.5 | Llama-3.1 | Gemma 2 | Mistral-Nemo | Gemini 1.5 | Gemini 1.0 |
---|---|---|---|---|---|---|---|---|---|
GPT−4o | 1 | 0.448 (<0.001) | 0.401 (<0.001) | 0.222 (0.008) | 0.140 (0.008) | 0.365 (<0.001) | 0.250 (<0.001) | 0.348 (<0.001) | 0.275 (0.002) |
GPT−4Turbo | 0.448 (<0.001) | 1 | 0.512 (<0.001) | 0.336 (<0.001) | 0.134 (0.065) | 0.430 (<0.001) | 0.297 (<0.001) | 0.534 (<0.001) | 0.410 (<0.001) |
GPT−4mini | 0.401 (<0.001) | 0.512 (<0.001) | 1 | 0.382 (<0.001) | 0.234 (0.004) | 0.425 (<0.001) | 0.403 (<0.001) | 0.404 (<0.001) | 0.348 (<0.001) |
GPT−3.5 | 0.222 (0.008) | 0.336 (<0.001) | 0.382 (<0.001) | 1 | 0.167 (0.052) | 0.422 (<0.001) | 0.341 (<0.001) | 0.184 (0.072) | 0.130 (0.204) |
Llama−3.1 | 0.140 (0.008) | 0.134 (0.065) | 0.234 (0.004) | 0.167 (0.052) | 1 | 0.122 (0.122) | 0.564 (<0.001) | 0.256 (0.001) | 0.295 (<0.001) |
Gemma 2 | 0.365 (<0.001) | 0.430 (<0.001) | 0.425 (<0.001) | 0.422 (<0.001) | 0.122 (0.122) | 1 | 0.353 (<0.001) | 0.387 (<0.001) | 0.328 (0.001) |
Mistral-Nemo | 0.250 (<0.001) | 0.297 (<0.001) | 0.403 (<0.001) | 0.341 (<0.001) | 0.564 (<0.001) | 0.353 (<0.001) | 1 | 0.470 (<0.001) | 0.470 (<0.001) |
Gemini 1.5 | 0.348 (<0.001) | 0.534 (<0.001) | 0.404 (<0.001) | 0.184 (0.072) | 0.256 (0.001) | 0.387 (<0.001) | 0.470 (<0.001) | 1 | 0.597 (<0.001) |
Gemini 1.0 | 0.275 (0.002) | 0.410 (<0.001) | 0.348 (<0.001) | 0.130 (0.204) | 0.295 (<0.001) | 0.328 (0.001) | 0.470 (<0.001) | 0.597 (<0.001) | 1 |