. 2024 Dec 26;28:9–15. doi: 10.1016/j.csbj.2024.12.013

Table 3.

Inter-Rater Reliability Comparison Expressed Through Cohen’s Kappa. This table outlines inter-rater reliability among the LLMs. Values below 0.2 indicate minimal agreement, 0.21–0.39 suggest weak agreement, and 0.40–0.59 indicate moderate agreement [23], [28]. No model achieved a Kappa value ≥ 0.6, which signifies strong agreement. The p-values associated with each comparison are presented in parentheses.

LLM	GPT-4o	GPT-4Turbo	GPT-4mini	GPT-3.5	Llama-3.1	Gemma 2	Mistral-Nemo	Gemini 1.5	Gemini 1.0
GPT−4o	1	0.448 (<0.001)	0.401 (<0.001)	0.222 (0.008)	0.140 (0.008)	0.365 (<0.001)	0.250 (<0.001)	0.348 (<0.001)	0.275 (0.002)
GPT−4Turbo	0.448 (<0.001)	1	0.512 (<0.001)	0.336 (<0.001)	0.134 (0.065)	0.430 (<0.001)	0.297 (<0.001)	0.534 (<0.001)	0.410 (<0.001)
GPT−4mini	0.401 (<0.001)	0.512 (<0.001)	1	0.382 (<0.001)	0.234 (0.004)	0.425 (<0.001)	0.403 (<0.001)	0.404 (<0.001)	0.348 (<0.001)
GPT−3.5	0.222 (0.008)	0.336 (<0.001)	0.382 (<0.001)	1	0.167 (0.052)	0.422 (<0.001)	0.341 (<0.001)	0.184 (0.072)	0.130 (0.204)
Llama−3.1	0.140 (0.008)	0.134 (0.065)	0.234 (0.004)	0.167 (0.052)	1	0.122 (0.122)	0.564 (<0.001)	0.256 (0.001)	0.295 (<0.001)
Gemma 2	0.365 (<0.001)	0.430 (<0.001)	0.425 (<0.001)	0.422 (<0.001)	0.122 (0.122)	1	0.353 (<0.001)	0.387 (<0.001)	0.328 (0.001)
Mistral-Nemo	0.250 (<0.001)	0.297 (<0.001)	0.403 (<0.001)	0.341 (<0.001)	0.564 (<0.001)	0.353 (<0.001)	1	0.470 (<0.001)	0.470 (<0.001)
Gemini 1.5	0.348 (<0.001)	0.534 (<0.001)	0.404 (<0.001)	0.184 (0.072)	0.256 (0.001)	0.387 (<0.001)	0.470 (<0.001)	1	0.597 (<0.001)
Gemini 1.0	0.275 (0.002)	0.410 (<0.001)	0.348 (<0.001)	0.130 (0.204)	0.295 (<0.001)	0.328 (0.001)	0.470 (<0.001)	0.597 (<0.001)	1