Skip to main content
. 2025 Aug 22;272(9):586. doi: 10.1007/s00415-025-13261-3

Table 2.

Models'performance differences

Features AI Fleiss kappa AI–physician Fleiss kappa Cohen Kappa# Physician Fleiss kappa Three-way p value p value intervention vs. control
Finding Score (0–1 range) 0.70 ± 0.27 0.32 0.92 ± 0.14 0.69 0.12 0.92 ± 0.16 0.76  < 0.001 0.43
Clinical diagnosis score (− 5 to 0 range) − 1.45 ± 1.45 0.33 − 0.24 ± 0.44 0.67 0.08 − 0.24 ± 0.53 0.74  < 0.001 0.51
Clinical diagnosis score (Normalized 0–1 range) † 0.71 ± 0.29 0.33 0.95 ± 0.09 0.67 0.08 0.95 ± 0.11 0.74  < 0.001 0.51
AIGERS SCORE ‡ 0.70 ± 0.26 0.32 0.94 ± 0.11 0.68 0.1 0.94 ± 0.13 0.75  < 0.001 0.32
Report word count 204.60 ± 38.35 93.47 ± 46.05 77.29 ± 40.83  < 0.001 0.019
Report recommendations numbers 3.54 ± 0.92 0.5 ± 0.62 0.09 ± 0.29  < 0.001  < 0.001
Report non relevant recommendations numbers 0.10 ± 0.3 0 0  < 0.001

Values are presented as mean ± standard deviation. The Fleiss kappa values represent inter-rater reliability between three physicians (MK, GK, IT) who scored each report. Three-way p values were calculated using Kruskal–Wallis tests across all groups. The intervention vs. control column shows p values for direct comparison between AI–physician (intervention) and physician-only (control) groups

*Statistically significant (p < 0.05)

†Clinical Diagnosis Score normalized from original − 5 to 0 range to 0–1 scale using the formula: normalized score = (original score + 5)/5

‡AIGERS score calculated as 50% Finding Score + 50% normalized Clinical Diagnosis Score # Cohen's kappa values represent agreement between the AI and AI–physician groups

Bold are statistically significant