Skip to main content
. 2020 Nov 22;62:103121. doi: 10.1016/j.ebiom.2020.103121

Table 3.

Comparison of model performance with subspecialists on cross-validation. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values.

Accuracy Cohen's κ Difference in κ p-value
Total
(n=1065)
Model 72•1% 0•548 (0•504, 0•590)
Rater 1 74•6% 0•605 (0•564, 0•644) 0•057 (0•007, 0•107) 0•03
Rater 2 72•1% 0•565 (0•523, 0•607) 0•017 (-0•034, 0•068) 0•52
Age (<12, n=268) Model 73•9% 0•557 (0•473, 0•641)
Rater 1 71•3% 0•544 (0•464, 0•625) -0•013 (-0•106, 0•079) 0•77
Rater 2 73•9% 0•587 (0•506, 0•666) 0•030 (-0•069, 0•128) 0•56
Age (12-18, n=277) Model 76•7% 0•617 (0•537, 0•693)
Rater 1 77•4% 0•646 (0•570, 0•721) 0•029 (-0•065, 0•126) 0•55
Rater 2 75•6% 0•615 (0•534, 0•689) -0•002 (-0•098, 0•094) 0•96
Age (19-36, n= 263) Model 75•8% 0•610 (0•523, 0•692)
Rater 1 77•8% 0•653 (0•571, 0•731) 0•043 (-0•062, 0•148) 0•43
Rater 2 70•6% 0•541 (0•451, 0•628) -0•069 (-0•174, 0•036) 0•22
Age (>36, n= 257) Model 62•2% 0•384 (0•291, 0•473)
Rater 1 72•1% 0•558 (0•472, 0•641) 0•174 (0•065, 0•284) 0•003
Rater 2 68•3% 0•499 (0•413, 0•583) 0•115 (0•004, 0•227) 0•05

Rater 1 and 2 are subspecialists.