. 2025 Aug 22;272(9):586. doi: 10.1007/s00415-025-13261-3

Table 2.

Models'performance differences

Features	AI	Fleiss kappa	AI–physician	Fleiss kappa	Cohen Kappa#	Physician	Fleiss kappa	Three-way p value	p value intervention vs. control
Finding Score (0–1 range)	0.70 ± 0.27	0.32	0.92 ± 0.14	0.69	0.12	0.92 ± 0.16	0.76	< 0.001	0.43
Clinical diagnosis score (− 5 to 0 range)	− 1.45 ± 1.45	0.33	− 0.24 ± 0.44	0.67	0.08	− 0.24 ± 0.53	0.74	< 0.001	0.51
Clinical diagnosis score (Normalized 0–1 range) †	0.71 ± 0.29	0.33	0.95 ± 0.09	0.67	0.08	0.95 ± 0.11	0.74	< 0.001	0.51
AIGERS SCORE ‡	0.70 ± 0.26	0.32	0.94 ± 0.11	0.68	0.1	0.94 ± 0.13	0.75	< 0.001	0.32
Report word count	204.60 ± 38.35	–	93.47 ± 46.05	–	–	77.29 ± 40.83	–	< 0.001	0.019
Report recommendations numbers	3.54 ± 0.92	–	0.5 ± 0.62	–	–	0.09 ± 0.29	–	< 0.001	< 0.001
Report non relevant recommendations numbers	0.10 ± 0.3	–	0	–	–	0	–	< 0.001	–

Values are presented as mean ± standard deviation. The Fleiss kappa values represent inter-rater reliability between three physicians (MK, GK, IT) who scored each report. Three-way p values were calculated using Kruskal–Wallis tests across all groups. The intervention vs. control column shows p values for direct comparison between AI–physician (intervention) and physician-only (control) groups

^*Statistically significant (p < 0.05)

†Clinical Diagnosis Score normalized from original − 5 to 0 range to 0–1 scale using the formula: normalized score = (original score + 5)/5

‡AIGERS score calculated as 50% Finding Score + 50% normalized Clinical Diagnosis Score # Cohen's kappa values represent agreement between the AI and AI–physician groups

Bold are statistically significant