Skip to main content
. 2021 Apr 15;4:72. doi: 10.1038/s41746-021-00440-5

Table 3.

Consensus score results on datasets (a) DOD-H and (b) DOD-O.

DOD-H: Healthy controls, N = 25
Scorer Fit Wake N1 N2 N3 REM Mean
Expert 1 0.83 ± 0.11 0.49 ± 0.15 0.86 ± 0.12 0.78 ± 0.24 0.84 ± 0.16 0.76 ± 0.11
Expert 2 0.83 ± 0.14 0.52 ± 0.11 0.88 ± 0.05 0.78±0.23 0.89 ± 0.06 0.78 ± 0.07
Expert 3 0.84 ± 0.12 0.54 ± 0.13 0.88 ± 0.05 0.74 ± 0.25 0.93±0.05 0.79±0.07
Expert 4 0.73 ± 0.18 0.40 ± 0.15 0.83 ± 0.07 0.75 ± 0.22 0.90 ± 0.09 0.72 ± 0.11
Expert 5 0.83 ± 0.14 0.53 ± 0.12 0.89±0.04 0.76 ± 0.24 0.90 ± 0.09 0.78 ± 0.08
U-Sleep 0.88±0.10 0.56±0.14 0.86 ± 0.05 0.73 ± 0.23 0.93±0.05 0.79±0.06
SimpleNet 0.83 ± 0.13 0.57±0.14 0.90±0.04 0.80±0.23 0.90 ± 0.09 0.80±0.07
DeepSleepNet 0.84 ± 0.10 0.56 ± 0.13 0.90±0.05 0.79 ± 0.24 0.88 ± 0.10 0.79 ± 0.07
SeqSleepNet 0.81 ± 0.18 0.54 ± 0.14 0.87 ± 0.08 0.73 ± 0.25 0.86 ± 0.12 0.76 ± 0.11
DOD-O: OSA patients, N = 55
Scorer Fit Wake N1 N2 N3 REM Mean
Expert 1 0.87 ± 0.11 0.38 ± 0.15 0.82 ± 0.13 0.59 ± 0.31 0.81 ± 0.25 0.69 ± 0.12
Expert 2 0.87 ± 0.09 0.46 ± 0.17 0.82 ± 0.11 0.61 ± 0.29 0.86 ± 0.22 0.72 ± 0.12
Expert 3 0.88 ± 0.09 0.42 ± 0.16 0.83 ± 0.13 0.46 ± 0.33 0.85 ± 0.22 0.69 ± 0.11
Expert 4 0.89 ± 0.09 0.46 ± 0.15 0.84 ± 0.07 0.52 ± 0.33 0.83 ± 0.24 0.71 ± 0.12
Expert 5 0.90±0.08 0.48 ± 0.15 0.86±0.08 0.62 ± 0.33 0.85 ± 0.22 0.74 ± 0.11
U-Sleep 0.89 ± 0.09 0.53±0.14 0.85 ± 0.08 0.66±0.30 0.88±0.20 0.76±0.10
SimpleNet 0.89 ± 0.09 0.52±0.16 0.88±0.11 0.63 ± 0.35 0.85 ± 0.22 0.75 ± 0.11
DeepSleepNet 0.86 ± 0.11 0.46 ± 0.17 0.87±0.10 0.67±0.30 0.84 ± 0.22 0.74 ± 0.12
SeqSleepNet 0.84 ± 0.13 0.46 ± 0.20 0.86±0.10 0.59 ± 0.33 0.77 ± 0.28 0.71 ± 0.14

Highest scores from human experts and U-Sleep are highlighted in bold. Scores where one of the trained ML models (last three rows) performed as well or superior to U-Sleep are indicated by underlined numbers. However, these models were fit to the particular datasets, while U-Sleep has not seen any data from DOD-H and DOD-O during model building and training, indicated by checkmarks or crosses in the Fit column. Numbers shown are mean ± 1 standard deviation per-subject F1 scores computed between the output of a single model or human expert and the consensus scores generated from the 4 (N − 1) remaining (when comparing human to consensus) or best human annotators (when comparing model to consensus).