. 2022 Jul 3;46(2):zsac154. doi: 10.1093/sleep/zsac154

Table 3.

Cohen’s kappa values comparing manual- and auto-scoring against three different comparators

	All stages		W		N1		N2		N3		R
	Manual	Auto	Manual	Auto	Manual	Auto	Manual	Auto	Manual	Auto	Manual	Auto
Dataset A
vs. individual scorers	0.62 ± 0.071	0.69 ± 0.054	0.80 ± 0.053	0.83 ± 0.052	0.32 ± 0.119	0.39 ± 0.113	0.59 ± 0.110	0.67 ± 0.081	0.42 ± 0.156	0.56 ± 0.185	0.79 ± 0.041	0.84 ± 0.041
vs. unbiased consensus of scorers	0.69 ± 0.063	0.78 ± 0.013	0.85 ± 0.046	0.89 ± 0.004	0.42 ± 0.099	0.51 ± 0.025	0.68 ± 0.094	0.77 ± 0.017	0.51 ± 0.151	0.69 ± 0.058	0.83 ± 0.032	0.90 ± 0.007
vs. any scorer	0.90 ± 0.050	0.96 ± 0.005	0.93 ± 0.045	0.96 ± 0.001	0.80 ± 0.089	0.88 ± 0.008	0.90 ± 0.072	0.96 ± 0.006	0.89 ± 0.113	0.99 ± 0.013	0.93 ± 0.034	0.98 ± 0.002
Dataset B
vs. individual scorers	0.62 ± 0.062	0.66 ± 0.033	0.76 ± 0.047	0.79 ± 0.030	0.33 ± 0.097	0.41 ± 0.095	0.60 ± 0.090	0.65 ± 0.048	0.65 ± 0.093	0.72 ± 0.071	0.80 ± 0.056	0.82 ± 0.040
vs. unbiased consensus of scorers	0.69 ± 0.038	0.75 ± 0.005	0.82 ± 0.042	0.85 ± 0.003	0.42 ± 0.053	0.53 ± 0.028	0.67 ± 0.057	0.74 ± 0.008	0.72 ± 0.065	0.81 ± 0.020	0.85 ± 0.047	0.87 ± 0.005
vs. any scorer	0.95 ± 0.028	0.97 ± 0.002	0.96 ± 0.021	0.98 ± 0.002	0.92 ± 0.036	0.95 ± 0.005	0.95 ± 0.036	0.98 ± 0.004	0.96 ± 0.033	1.00 ± 0.001	0.96 ± 0.039	0.97 ± 0.002
Dataset C
vs. individual scorers	0.60 ± 0.055	0.64 ± 0.038	0.75 ± 0.049	0.76 ± 0.041	0.32 ± 0.086	0.39 ± 0.084	0.57 ± 0.068	0.62 ± 0.052	0.50 ± 0.177	0.55 ± 0.080	0.84 ± 0.050	0.88 ± 0.036
vs. unbiased consensus of scorers	0.69 ± 0.045	0.76 ± 0.004	0.82 ± 0.042	0.82 ± 0.006	0.44 ± 0.078	0.54 ± 0.012	0.66 ± 0.057	0.76 ± 0.006	0.59 ± 0.133	0.75 ± 0.005	0.88 ± 0.046	0.93 ± 0.003
vs. any scorer	0.96 ± 0.024	0.99 ± 0.001	0.97 ± 0.021	0.98 ± 0.002	0.94 ± 0.036	0.97 ± 0.002	0.96 ± 0.026	0.99 ± 0.001	0.96 ± 0.034	1.00 ± 0.001	0.97 ± 0.033	1.00 ± 0.001

Data are presented as mean ± SD Cohen’s kappa values. Individual Scorers: Pairwise comparison between the evaluated scorer (manual- or auto-scoring) and each remaining scorer (resulting kappa values are averaged). Unbiased Consensus of Scorers: Each evaluated scorer is compared to the consensus of the remaining scorers (unbiased consensus), and the auto-scoring is compared to the same unbiased consensus for each scorer (resulting kappa values are averaged). Any Scorer: Each evaluated scorer is compared to all remaining scorers, where each epoch is considered correct if at least one of the remaining scorers agreed with the evaluated scorer, and the auto-scoring is compared to the same combinations of scorers (resulting kappa values are averaged). Kappa values of 0.21–0.4, 0.41–0.6, 0.61–0.8, and >0.8 represent fair, moderate, substantial, and almost-perfect agreement

^[32].