Table 3.
Cohen’s kappa values comparing manual- and auto-scoring against three different comparators
| All stages | W | N1 | N2 | N3 | R | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Manual | Auto | Manual | Auto | Manual | Auto | Manual | Auto | Manual | Auto | Manual | Auto | |
| Dataset A | ||||||||||||
| vs. individual scorers | 0.62 ± 0.071 | 0.69 ± 0.054 | 0.80 ± 0.053 | 0.83 ± 0.052 | 0.32 ± 0.119 | 0.39 ± 0.113 | 0.59 ± 0.110 | 0.67 ± 0.081 | 0.42 ± 0.156 | 0.56 ± 0.185 | 0.79 ± 0.041 | 0.84 ± 0.041 |
| vs. unbiased consensus of scorers | 0.69 ± 0.063 | 0.78 ± 0.013 | 0.85 ± 0.046 | 0.89 ± 0.004 | 0.42 ± 0.099 | 0.51 ± 0.025 | 0.68 ± 0.094 | 0.77 ± 0.017 | 0.51 ± 0.151 | 0.69 ± 0.058 | 0.83 ± 0.032 | 0.90 ± 0.007 |
| vs. any scorer | 0.90 ± 0.050 | 0.96 ± 0.005 | 0.93 ± 0.045 | 0.96 ± 0.001 | 0.80 ± 0.089 | 0.88 ± 0.008 | 0.90 ± 0.072 | 0.96 ± 0.006 | 0.89 ± 0.113 | 0.99 ± 0.013 | 0.93 ± 0.034 | 0.98 ± 0.002 |
| Dataset B | ||||||||||||
| vs. individual scorers | 0.62 ± 0.062 | 0.66 ± 0.033 | 0.76 ± 0.047 | 0.79 ± 0.030 | 0.33 ± 0.097 | 0.41 ± 0.095 | 0.60 ± 0.090 | 0.65 ± 0.048 | 0.65 ± 0.093 | 0.72 ± 0.071 | 0.80 ± 0.056 | 0.82 ± 0.040 |
| vs. unbiased consensus of scorers | 0.69 ± 0.038 | 0.75 ± 0.005 | 0.82 ± 0.042 | 0.85 ± 0.003 | 0.42 ± 0.053 | 0.53 ± 0.028 | 0.67 ± 0.057 | 0.74 ± 0.008 | 0.72 ± 0.065 | 0.81 ± 0.020 | 0.85 ± 0.047 | 0.87 ± 0.005 |
| vs. any scorer | 0.95 ± 0.028 | 0.97 ± 0.002 | 0.96 ± 0.021 | 0.98 ± 0.002 | 0.92 ± 0.036 | 0.95 ± 0.005 | 0.95 ± 0.036 | 0.98 ± 0.004 | 0.96 ± 0.033 | 1.00 ± 0.001 | 0.96 ± 0.039 | 0.97 ± 0.002 |
| Dataset C | ||||||||||||
| vs. individual scorers | 0.60 ± 0.055 | 0.64 ± 0.038 | 0.75 ± 0.049 | 0.76 ± 0.041 | 0.32 ± 0.086 | 0.39 ± 0.084 | 0.57 ± 0.068 | 0.62 ± 0.052 | 0.50 ± 0.177 | 0.55 ± 0.080 | 0.84 ± 0.050 | 0.88 ± 0.036 |
| vs. unbiased consensus of scorers | 0.69 ± 0.045 | 0.76 ± 0.004 | 0.82 ± 0.042 | 0.82 ± 0.006 | 0.44 ± 0.078 | 0.54 ± 0.012 | 0.66 ± 0.057 | 0.76 ± 0.006 | 0.59 ± 0.133 | 0.75 ± 0.005 | 0.88 ± 0.046 | 0.93 ± 0.003 |
| vs. any scorer | 0.96 ± 0.024 | 0.99 ± 0.001 | 0.97 ± 0.021 | 0.98 ± 0.002 | 0.94 ± 0.036 | 0.97 ± 0.002 | 0.96 ± 0.026 | 0.99 ± 0.001 | 0.96 ± 0.034 | 1.00 ± 0.001 | 0.97 ± 0.033 | 1.00 ± 0.001 |
Data are presented as mean ± SD Cohen’s kappa values. Individual Scorers: Pairwise comparison between the evaluated scorer (manual- or auto-scoring) and each remaining scorer (resulting kappa values are averaged). Unbiased Consensus of Scorers: Each evaluated scorer is compared to the consensus of the remaining scorers (unbiased consensus), and the auto-scoring is compared to the same unbiased consensus for each scorer (resulting kappa values are averaged). Any Scorer: Each evaluated scorer is compared to all remaining scorers, where each epoch is considered correct if at least one of the remaining scorers agreed with the evaluated scorer, and the auto-scoring is compared to the same combinations of scorers (resulting kappa values are averaged). Kappa values of 0.21–0.4, 0.41–0.6, 0.61–0.8, and >0.8 represent fair, moderate, substantial, and almost-perfect agreement
[32].