Table 2.
Models'performance differences
| Features | AI | Fleiss kappa | AI–physician | Fleiss kappa | Cohen Kappa# | Physician | Fleiss kappa | Three-way p value | p value intervention vs. control |
|---|---|---|---|---|---|---|---|---|---|
| Finding Score (0–1 range) | 0.70 ± 0.27 | 0.32 | 0.92 ± 0.14 | 0.69 | 0.12 | 0.92 ± 0.16 | 0.76 | < 0.001 | 0.43 |
| Clinical diagnosis score (− 5 to 0 range) | − 1.45 ± 1.45 | 0.33 | − 0.24 ± 0.44 | 0.67 | 0.08 | − 0.24 ± 0.53 | 0.74 | < 0.001 | 0.51 |
| Clinical diagnosis score (Normalized 0–1 range) † | 0.71 ± 0.29 | 0.33 | 0.95 ± 0.09 | 0.67 | 0.08 | 0.95 ± 0.11 | 0.74 | < 0.001 | 0.51 |
| AIGERS SCORE ‡ | 0.70 ± 0.26 | 0.32 | 0.94 ± 0.11 | 0.68 | 0.1 | 0.94 ± 0.13 | 0.75 | < 0.001 | 0.32 |
| Report word count | 204.60 ± 38.35 | – | 93.47 ± 46.05 | – | – | 77.29 ± 40.83 | – | < 0.001 | 0.019 |
| Report recommendations numbers | 3.54 ± 0.92 | – | 0.5 ± 0.62 | – | – | 0.09 ± 0.29 | – | < 0.001 | < 0.001 |
| Report non relevant recommendations numbers | 0.10 ± 0.3 | – | 0 | – | – | 0 | – | < 0.001 | – |
Values are presented as mean ± standard deviation. The Fleiss kappa values represent inter-rater reliability between three physicians (MK, GK, IT) who scored each report. Three-way p values were calculated using Kruskal–Wallis tests across all groups. The intervention vs. control column shows p values for direct comparison between AI–physician (intervention) and physician-only (control) groups
*Statistically significant (p < 0.05)
†Clinical Diagnosis Score normalized from original − 5 to 0 range to 0–1 scale using the formula: normalized score = (original score + 5)/5
‡AIGERS score calculated as 50% Finding Score + 50% normalized Clinical Diagnosis Score # Cohen's kappa values represent agreement between the AI and AI–physician groups
Bold are statistically significant