Table 3.
Comparison of model performance with subspecialists on cross-validation. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values.
Accuracy | Cohen's κ | Difference in κ | p-value | ||
---|---|---|---|---|---|
Total (n=1065) |
Model | 72•1% | 0•548 (0•504, 0•590) | ||
Rater 1 | 74•6% | 0•605 (0•564, 0•644) | 0•057 (0•007, 0•107) | 0•03 | |
Rater 2 | 72•1% | 0•565 (0•523, 0•607) | 0•017 (-0•034, 0•068) | 0•52 | |
Age (<12, n=268) | Model | 73•9% | 0•557 (0•473, 0•641) | ||
Rater 1 | 71•3% | 0•544 (0•464, 0•625) | -0•013 (-0•106, 0•079) | 0•77 | |
Rater 2 | 73•9% | 0•587 (0•506, 0•666) | 0•030 (-0•069, 0•128) | 0•56 | |
Age (12-18, n=277) | Model | 76•7% | 0•617 (0•537, 0•693) | ||
Rater 1 | 77•4% | 0•646 (0•570, 0•721) | 0•029 (-0•065, 0•126) | 0•55 | |
Rater 2 | 75•6% | 0•615 (0•534, 0•689) | -0•002 (-0•098, 0•094) | 0•96 | |
Age (19-36, n= 263) | Model | 75•8% | 0•610 (0•523, 0•692) | ||
Rater 1 | 77•8% | 0•653 (0•571, 0•731) | 0•043 (-0•062, 0•148) | 0•43 | |
Rater 2 | 70•6% | 0•541 (0•451, 0•628) | -0•069 (-0•174, 0•036) | 0•22 | |
Age (>36, n= 257) | Model | 62•2% | 0•384 (0•291, 0•473) | ||
Rater 1 | 72•1% | 0•558 (0•472, 0•641) | 0•174 (0•065, 0•284) | 0•003 | |
Rater 2 | 68•3% | 0•499 (0•413, 0•583) | 0•115 (0•004, 0•227) | 0•05 |
Rater 1 and 2 are subspecialists.