Skip to main content
. 2024 Dec 26;28:9–15. doi: 10.1016/j.csbj.2024.12.013

Table 3.

Inter-Rater Reliability Comparison Expressed Through Cohen’s Kappa. This table outlines inter-rater reliability among the LLMs. Values below 0.2 indicate minimal agreement, 0.21–0.39 suggest weak agreement, and 0.40–0.59 indicate moderate agreement [23], [28]. No model achieved a Kappa value ≥ 0.6, which signifies strong agreement. The p-values associated with each comparison are presented in parentheses.

LLM GPT-4o GPT-4Turbo GPT-4mini GPT-3.5 Llama-3.1 Gemma 2 Mistral-Nemo Gemini 1.5 Gemini 1.0
GPT−4o 1 0.448 (<0.001) 0.401 (<0.001) 0.222 (0.008) 0.140 (0.008) 0.365 (<0.001) 0.250 (<0.001) 0.348 (<0.001) 0.275 (0.002)
GPT−4Turbo 0.448 (<0.001) 1 0.512 (<0.001) 0.336 (<0.001) 0.134 (0.065) 0.430 (<0.001) 0.297 (<0.001) 0.534 (<0.001) 0.410 (<0.001)
GPT−4mini 0.401 (<0.001) 0.512 (<0.001) 1 0.382 (<0.001) 0.234 (0.004) 0.425 (<0.001) 0.403 (<0.001) 0.404 (<0.001) 0.348 (<0.001)
GPT−3.5 0.222 (0.008) 0.336 (<0.001) 0.382 (<0.001) 1 0.167 (0.052) 0.422 (<0.001) 0.341 (<0.001) 0.184 (0.072) 0.130 (0.204)
Llama−3.1 0.140 (0.008) 0.134 (0.065) 0.234 (0.004) 0.167 (0.052) 1 0.122 (0.122) 0.564 (<0.001) 0.256 (0.001) 0.295 (<0.001)
Gemma 2 0.365 (<0.001) 0.430 (<0.001) 0.425 (<0.001) 0.422 (<0.001) 0.122 (0.122) 1 0.353 (<0.001) 0.387 (<0.001) 0.328 (0.001)
Mistral-Nemo 0.250 (<0.001) 0.297 (<0.001) 0.403 (<0.001) 0.341 (<0.001) 0.564 (<0.001) 0.353 (<0.001) 1 0.470 (<0.001) 0.470 (<0.001)
Gemini 1.5 0.348 (<0.001) 0.534 (<0.001) 0.404 (<0.001) 0.184 (0.072) 0.256 (0.001) 0.387 (<0.001) 0.470 (<0.001) 1 0.597 (<0.001)
Gemini 1.0 0.275 (0.002) 0.410 (<0.001) 0.348 (<0.001) 0.130 (0.204) 0.295 (<0.001) 0.328 (0.001) 0.470 (<0.001) 0.597 (<0.001) 1