. 2021 Apr 15;4:72. doi: 10.1038/s41746-021-00440-5

Table 3.

Consensus score results on datasets (a) DOD-H and (b) DOD-O.

DOD-H: Healthy controls, N = 25
Scorer	Fit	Wake	N1	N2	N3	REM	Mean
Expert 1	–	0.83 ± 0.11	0.49 ± 0.15	0.86 ± 0.12	0.78 ± 0.24	0.84 ± 0.16	0.76 ± 0.11
Expert 2	–	0.83 ± 0.14	0.52 ± 0.11	0.88 ± 0.05	0.78 ± 0.23	0.89 ± 0.06	0.78 ± 0.07
Expert 3	–	0.84 ± 0.12	0.54 ± 0.13	0.88 ± 0.05	0.74 ± 0.25	0.93 ± 0.05	0.79 ± 0.07
Expert 4	–	0.73 ± 0.18	0.40 ± 0.15	0.83 ± 0.07	0.75 ± 0.22	0.90 ± 0.09	0.72 ± 0.11
Expert 5	–	0.83 ± 0.14	0.53 ± 0.12	0.89 ± 0.04	0.76 ± 0.24	0.90 ± 0.09	0.78 ± 0.08
U-Sleep	✗	0.88 ± 0.10	0.56 ± 0.14	0.86 ± 0.05	0.73 ± 0.23	0.93 ± 0.05	0.79 ± 0.06
SimpleNet	✓	0.83 ± 0.13	0.57 ± 0.14	0.90 ± 0.04	0.80 ± 0.23	0.90 ± 0.09	0.80 ± 0.07
DeepSleepNet	✓	0.84 ± 0.10	0.56 ± 0.13	0.90 ± 0.05	0.79 ± 0.24	0.88 ± 0.10	0.79 ± 0.07
SeqSleepNet	✓	0.81 ± 0.18	0.54 ± 0.14	0.87 ± 0.08	0.73 ± 0.25	0.86 ± 0.12	0.76 ± 0.11

DOD-O: OSA patients, N = 55
Scorer	Fit	Wake	N1	N2	N3	REM	Mean
Expert 1	–	0.87 ± 0.11	0.38 ± 0.15	0.82 ± 0.13	0.59 ± 0.31	0.81 ± 0.25	0.69 ± 0.12
Expert 2	–	0.87 ± 0.09	0.46 ± 0.17	0.82 ± 0.11	0.61 ± 0.29	0.86 ± 0.22	0.72 ± 0.12
Expert 3	–	0.88 ± 0.09	0.42 ± 0.16	0.83 ± 0.13	0.46 ± 0.33	0.85 ± 0.22	0.69 ± 0.11
Expert 4	–	0.89 ± 0.09	0.46 ± 0.15	0.84 ± 0.07	0.52 ± 0.33	0.83 ± 0.24	0.71 ± 0.12
Expert 5	–	0.90 ± 0.08	0.48 ± 0.15	0.86 ± 0.08	0.62 ± 0.33	0.85 ± 0.22	0.74 ± 0.11
U-Sleep	✗	0.89 ± 0.09	0.53 ± 0.14	0.85 ± 0.08	0.66 ± 0.30	0.88 ± 0.20	0.76 ± 0.10
SimpleNet	✓	0.89 ± 0.09	0.52 ± 0.16	0.88 ± 0.11	0.63 ± 0.35	0.85 ± 0.22	0.75 ± 0.11
DeepSleepNet	✓	0.86 ± 0.11	0.46 ± 0.17	0.87 ± 0.10	0.67 ± 0.30	0.84 ± 0.22	0.74 ± 0.12
SeqSleepNet	✓	0.84 ± 0.13	0.46 ± 0.20	0.86 ± 0.10	0.59 ± 0.33	0.77 ± 0.28	0.71 ± 0.14

Highest scores from human experts and U-Sleep are highlighted in bold. Scores where one of the trained ML models (last three rows) performed as well or superior to U-Sleep are indicated by underlined numbers. However, these models were fit to the particular datasets, while U-Sleep has not seen any data from DOD-H and DOD-O during model building and training, indicated by checkmarks or crosses in the Fit column. Numbers shown are mean ± 1 standard deviation per-subject F1 scores computed between the output of a single model or human expert and the consensus scores generated from the 4 (N − 1) remaining (when comparing human to consensus) or best human annotators (when comparing model to consensus).