Abstract
Study Objectives
Conventional sleep staging relies on 30-s epochs, potentially concealing transient sleep stage intrusion and reducing precision. Building on our previous study of mini-epochs, we investigated whether U-Sleep, an existing automatic deep learning-based sleep staging model with high performance in epochs, could be optimized to similar performance level in 5-s mini-epoch scoring, thereby enabling more detailed sleep characterization.
Methods
We created a dataset of 48 000 human-scored 5-s mini-epochs from 100 polysomnographies. We compared mini-epochs to human-scored epochs before U-Sleep was optimized using transfer learning and evaluated on a test set. Model performance was assessed using F1-scores, confusion matrices, stage distributions and transition rates comparing scorings of the original U-Sleep before, and the optimized U-Sleep after transfer learning to human-scored mini-epochs.
Results
Compared to human-scored epochs, human-scored mini-epochs captured significantly more transitions (1.70/min vs. 0.21/min, p < .001), and significantly more wake (8.4 per cent vs. 5.4 per cent), N1 (7.2 per cent vs. 5.4 per cent), and N2 (51.8 per cent vs. 40.9 per cent), less N3 (15.4 per cent vs. 25.2 per cent), and REM sleep (16.7 per cent vs. 23.0 per cent) (all p < .001). Optimizing U-Sleep improved its performance significantly from F1 = 0.74 to F1 = 0.81 (p < .05) and gave increased transition rates in the test set (original U-Sleep: 1.06/min, optimized U-Sleep: 1.34/min, human-scored mini-epochs: 1.70/min). Stage distributions did not differ between optimized U-Sleep’s scorings and human-scored mini-epochs.
Conclusion
After optimization, U-Sleep performance in mini-epochs matched the high performance levels previously reported in both human and automated 30-s epoch scoring. This demonstrates the feasibility of precise, automated high-resolution sleep staging. Future work should include external validation and application to full-night recordings.
Statement of Significance
Conventional 30-s epochs limit temporal resolution in sleep staging and may conceal transient intrusions of wake or sleep stages. However, no validated methods are available for high-resolution scoring. In this study, we trained and validated the state-of-the-art deep learning model U-Sleep for accurate automatic 5-s mini-epoch scoring using a large dataset of human-scored mini-epochs. The optimized model achieved a high performance, matching levels from previously reported automatic and human epoch scoring. Compared to epoch scoring, mini-epochs captured significantly more stage transitions, supporting their ability to uncover sleep dynamics that are otherwise lost. Our findings show the potential of high-resolution sleep staging for more detailed characterization of sleep architecture and demonstrate the feasibility of precise, automatic mini-epoch scoring.
Keywords: polysomnography, mini-epochs, sleep staging, automatic sleep classification, U-sleep, artificial intelligence
Graphical Abstract
Graphical Abstract.
Introduction
Accurate sleep staging is fundamental for the assessment of sleep disorders and the understanding of sleep physiology. Traditionally, sleep stages are scored manually in 30-s epochs based on visual evaluation of polysomnography (PSG) recordings. Early sleep staging relied on paper recordings, where each paper sheet could include 30 s, establishing the 30-s epoch length [1]. The first sleep stage scoring manual was defined by Rechtschaffen and Kales (R&K) in 1968 [2]. The American Academy of Sleep Medicine (AASM) updated the scoring guidelines in 2007 [3], refining staging criteria while maintaining the 30-s epoch length for simplicity [4, 5].
However, the fixed 30-s epoch length does not fully reflect the continuous and dynamic nature of sleep, and characteristics from multiple sleep stages can appear within a single 30-s epoch. This has been illustrated by several studies including hypnodensity plots, which show the probability distribution of stages co-occurring within the same epoch [6–8]. The AASM guideline addresses the issue of co-occurring stages by instructing scorers to classify an epoch based on the predominant sleep stage [3]. While this approach potentially simplifies scoring, it may also contribute to inter-rater variability when scorers for example need to decide which of the potentially several present sleep stages fills the majority of the epoch [9]. Moreover, clinical experience suggests that 30-s epochs may conceal transient sleep stage intrusions and microstructural dynamics, such as frequent stage transitions, which could be clinically relevant. Several studies further suggest that important sleep dynamics can occur within shorter time windows than the conventional 30-s epochs allow [10–15]. In line with this, a recent review described 30-s epoch sleep staging as “temporal compression” and highlighted high-resolution sleep staging (scoring of shorter segments) as a solution for more precise sleep characterization though such approaches are generally challenged by the lack of a gold standard [16]. Imbach et al. [17] systematically analyzed sleep in human-scored 5-s intervals based on AASM scoring rules, first scoring 30-s epochs and then splitting these into 5-s intervals to visually adapt the sleep stage when needed. Their small, unblinded study demonstrated that approximately 15 per cent of 30-s epochs contained more than one sleep stage, suggesting that the higher temporal resolution of 5-s scoring may describe sleep more accurately. They further observed interhemispheric oscillations that were not visible at the 30-s resolution, indicating that clinically relevant dynamics may be masked by conventional epoch scoring. Similarly, Schoch et al. [18] applied human 5-s scoring in narcolepsy type 1 (NT1) patients and healthy controls and reported 83.5 per cent agreement between 5-s and 30-s.
However, scoring a full PSG in 5-s segments is not feasible for human scorers due to the workload involved. Automation therefore becomes essential in order to make such approaches feasible in clinical practice, though human-scored data as a reference anchor remains essential to ensure clinical trust and acceptance.
Automated sleep staging of shorter segments has been explored [6, 12, 13, 19, 20]. Cesari et al. [21] applied 5-s scoring in patients with Parkinson’s disease and found increased microstructural sleep instability, i.e. fragmented NREM sleep and short REM episodes, that were not captured in 30-s scoring, and Perslev et al. [22] found that analyzing smaller segments than 30-s epochs improved automatic detection of sleep apnea. While these studies demonstrate the potential of higher temporal resolution sleep stage scoring, none have systematically retrained and validated a deep learning model for high-resolution sleep staging, achieving a performance equivalent to human epoch scoring (i.e. a clinical acceptable performance) on a large dataset of human-scored mini-epochs.
U-Sleep, a state-of-the-art deep learning model with flexible scoring resolution, has shown high performance (F1-score was 0.79) in automated scoring of conventional 30-s epochs [22]. However, despite the flexibility to score sleep at higher temporal resolutions, it has only been trained on and validated against human-scored 30-s epochs. Previously, our group evaluated the originally published U-Sleep model’s performance in 5-s mini-epochs by comparing its mini-epoch scorings to human-scored 5-s mini-epochs [23]. We found significantly more stage transitions when the PSG was assessed in mini-epochs than 30-s epochs, supporting that epochs may conceal important aspects of sleep dynamics. Although the original U-Sleep model was indeed able to classify sleep in shorter segments, its agreement with human-scored mini-epochs was lower than with human-scored epochs (F1-score of 0.54 for mini-epochs vs. 0.78 for epochs) [23]. This was likely due to the model’s original epoch-based training (i.e. it was trained to browse through longer intervals and ignore smaller intermittent stage characteristics if another stage was overall predominant), suggesting that it must be trained on human-scored mini-epochs to obtain a high performance in the mini-epoch setting. However, training an automated classifier to score 5-s mini-epochs at a level comparable to how the original U-Sleep model performs in 30-s epochs would require a substantial number of human-scored mini-epochs, being an unfeasible human workload. For instance, the original U-Sleep model was trained on 19 924 full-night PSGs, comprising millions of 30-s epochs [22].
A potential solution to this challenge is transfer learning, which enables models originally trained on large datasets to be fine-tuned and optimized for specific tasks using limited new data. Transfer learning involves re-training a pre-trained model to adapt it to a new but related task. Ganglberger et al. [24] showed that transfer learning could effectively fine-tune existing sleep scoring models across different datasets, reducing the dependency on large human-scored datasets. Similarly, Guillot and Thorey [25] applied transfer learning to adapt an automated sleep classifier to diverse PSG montages and demographics suggesting it as a relevant tool for optimizing existing models.
In this present study, we apply transfer learning to U-Sleep aiming to optimize its performance in 5-s mini-epoch sleep staging using 48 000 human-scored mini-epochs for re-training, validation, and testing. Moreover, we compare our human and automatically scored mini-epochs to the conventional human-scored 30-s epochs. High-resolution sleep staging has the potential to approximate “biological sleep” more closely, capturing brief wake intrusions and other short stage shifts that are mostly ignored in conventional epoch scoring. Clinically, high-resolution staging may enable a more precise and clinically relevant description of sleep, which is particularly needed in complex sleep disorders such as narcolepsy where sleep is known to be highly fragmented [26, 27] and the narcolepsy borderland where diagnostic test–retest stability is low and specific biomarkers are missing [28, 29]. The goal of the study is to facilitate a fast, more detailed, and precise sleep stage analysis by optimizing automated scoring of 5-s mini-epochs.
Materials and Methods
Cohort
In the period from February 2015 to July 2024, 155 NT1 patients and their 123 non-narcoleptic siblings were recruited to The Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Department of Rare Diagnosis, Oslo University Hospital (OUS). Diagnosis was confirmed in patients and excluded in siblings according to the ICSD-3 criteria [30] by European Sleep research society (ESRS) certified somnologists and NT1 experts (S.K.H. and B.H.H.). Aiming to include mini-epochs across different health profiles for the optimization of U-Sleep, we randomly selected 100 PSGs from this cohort (38 NT1 patients (age 22.3 ± 8.7 years, 23 females [60.5 per cent]) and 62 non-narcoleptic siblings (age 23.2 ± 12.3 years, 36 females [58.1 per cent])), with no significant difference in age and sex, to be included in the present study. As the selection of individuals was random, the sample included both NT1 patients with and without a non-narcoleptic sibling, and non-narcoleptic siblings with and without a related NT1 patient in the sample. PSG exclusion criteria were a total sleep time <6 h and/or apnea-hypopnea index >5.
The patients paused their use of stimulants, antidepressants, and/or sodium oxybat for 14 days prior to the PSG. One included patient only paused modafinil 9 days before the PSG. Siblings were not required to pause medication. Two patients and five siblings had either allergies, epilepsy, ulcerative colitis, or hypothyroidism and used medications with potential side effects of drowsiness or somnolence reported in 1–10 per cent of users (loratadine, desloratadine, lamotrigine, and mesalamine, levothyroxine); however, the siblings reported no subjective increase in sleepiness and their PSGs were normal.
The study is ethically approved by the Norwegian South-East Regional Committees for Medical and Health Research Ethics (REK) (2014/450), and written informed consent was signed by all individuals or their parents in case of minors <16 years. A subset of the individuals were included in previous studies conducted by our research group [23, 31–41].
Data
The 100 PSGs used for 5-s mini-epoch scorings were recorded using the SOMNOmedics Plus system (DOMINO software, version 2.9.0, SOMNOmedics, Randersacker, Germany), following the recording standards specified by the AASM [3]. The PSG data included the following channels: EOG1:A2, EOG2:A1, C3:A2, C4:A1, F3:A2, F4:A1, O1:A2, O2:A1 (sampled at 256 Hz), and two electromyography (EMG) m. submentalis channels (sampled at 256 or 512 Hz). To enhance the chance of covering all sleep stages, two 20-min segments (240 mini-epochs per segment, i.e. a total of 480 mini-epochs per PSG) were randomly selected from each PSG between the “lights off” and “lights on” marks. To obtain a diverse dataset, we selected 100 PSGs from our cohort as a compromise between including more PSGs with shorter segments or fewer PSGs with longer segments. We specifically chose the 20-min segment length to balance the workload of human mini-epoch scoring with the need to retain sufficient temporal context for the model, and to align with the segment length of 17.5 min used in the original U-Sleep study by Perslev et al. [22] In total, the dataset comprised 48 000 human-scored 5-s mini-epochs, which were all manually scored by an experienced human scorer (J.V., a ESRS certified somnologist-sleep technician).
Human mini-epoch scoring procedure
The scoring procedure followed a structured protocol developed specifically for 5-s mini-epochs. An overview of the procedure is shown in Figure 1. Initially, the human scorer spent 10 min scrolling in SOMNOmedics DOMINO (the scoring software used in our clinical practice) to get familiar with the PSG data. The human scorer was blinded to age and sex of the individuals, patient/sibling status and furthermore, to the nocturnal timing and the full hypnogram.
Figure 1.
Overview of the human scoring procedure for 5-s mini-epochs. The human scorer first reviewed the full-night PSG in DOMINO for 10 min, then scored two 20-min segments (240 mini-epochs each) per PSG in MATLAB. The scoring decision for each mini-epoch follows a structured decision tree: If no clear characteristics were present, the most likely stage was determined based on surrounding context and background EEG activity, and the mini-epoch was marked as “without characteristics.” If only stage-specific characteristic from a single sleep stage was present, the corresponding stage was scored. If characteristics from multiple stages were present, the predominant stage was scored. This procedure was repeated for all 100 PSGs in the dataset.
The mini-epochs were then shown and scored in a MATLAB 2020b based scoring interface with a setup mimicking the SOMNOmedics DOMINO software layout. To illustrate the scoring interface, an example of a mini-epoch is provided in Figure S1. The human scorer was allowed to scroll back and forth between the mini-epochs within a 20-min segment. Each mini-epoch was scored as independently as possible; however, it was allowed to rely on context from previous or subsequent mini-epochs to support decision-making.
Scoring rules
Our proposed rules for mini-epoch scoring are an adaptation of the AASM rules developed collaboratively in a group of sleep clinicians, including two ESRS certified somnologists (S.K.H. and J.V.), and sleep researchers/technicians (L.F.F., J.A.E.C., R.V., and H.T.J.). This process involved thorough discussions and repeated examinations and reviews of mini-epochs to refine the scoring rules. The full mini-epochs scoring manual is available in Supplementary Material.
The mini-epochs were scored based on the presence of stage-specific characteristics, as summarized in Table 1, exemplified in Figure 1. Notably, if clear characteristics were absent in a mini-epoch, the surrounding context was used to determine the most likely stage and the given mini-epoch (N1, N2, or REM) was additionally marked as “without characteristics”; this was not applicable for wake and N3 (where characteristics are mandatory).
Table 1.
Stage-specific characteristics used for human sleep stage scoring in 5-s mini-epochs.
| Stage characteristics | Scoring rules |
|---|---|
|
Wake ≥50% Alpha activity |
Score wake if alpha activity is present in more than 50% of the mini-epoch. |
|
N1 Slow eye movements Vertex sharp waves |
Score N1 if a slow eye movement or vertex sharp wave is present. If characteristics are absent, but the surrounding mini-epochs are scored as N1, and/or the background EEG resembles N1 or does not suggest another stage; score “N1 without characteristics.” |
|
N2 Sleep spindles K-complexes |
Score N2 if sleep spindle or K-complex are present. If characteristics are absent, but the surrounding mini-epochs are scored as N2, and/or the background EEG resembles N2 or does not suggest another stage; score “N2 without characteristics.” |
|
N3 High-amplitude, low-frequency delta waves (SWS) |
Score N3 only if high-amplitude, low-frequency delta waves (SWS) are present in more than 50% of the mini-epoch. |
|
REM Rapid eye movements |
Score REM if a rapid eye movement is present. If no rapid eye movements are observed, but the surrounding mini-epochs are scored as REM, and/or the background EEG resembles REM or does not suggest another stage; score “REM without characteristics.” |
Automated mini-epoch scoring procedure
As in our previous smaller study of 5-s mini-epochs [23], we used U-Sleep 2.0, a deep learning model developed by Perslev et al. [22] as the basis for automated sleep stage classification. The U-Sleep model is a fully convolutional neural network that returns a hypnogram based on a single electroencephalography/electrooculography (EEG/EOG) channel combination. If data from more channels are present in the PSG, the hypnogram is determined by majority voting from all possible EEG/EOG combinations. U-Sleep pre-process the input data by resampling the signals to 128 Hz and removing outliers of the signals.
In the present study, U-Sleep predicted sleep stages in each 20-min segment, using the six EEG channels; C3:A2, C4:A1, F3:A2, F4:A1, O1:A2, and O2:A1 and the EOG channels EOG1:A2 and EOG2:A1. Application of the original U-Sleep model per segment created a baseline mini-epoch scoring.
The source code for the automated U-Sleep model [22] was obtained via a link available in the original article (https://github.com/perslev/U-Time, downloaded March 29, 2023) and the weights for the pre-trained neural network were kindly made available the original U-Sleep authors on request.
Transfer learning
After our U-Sleep baseline mini-epoch scoring, we then optimized the original U-Sleep model via transfer learning on 20-min segments of human-scored 5 s mini-epochs, since the original U-Sleep model was trained on 17.5 min segments so as to at least provide the model with a similar duration of input [22]. The human-scored PSG segments were then divided into three sets to be used for training (n = 80), validation (n = 10), and testing (n = 10) (Table 2), ensuring that both 20-min segments from each PSG remained within the same set. This prevented data leakage between training and testing. The distribution of NT1 status, age, and sex was maintained across the sets. The U-Sleep model was both trained and evaluated using all 12 pairwise combinations of the six EEG and two EOG electrodes in all segments. We modified the model to train on and predict sleep stages every 5 s, and fine-tuned all components (encoder, decoder, and segment classifier) with a learning rate of 10−7 using an Adam optimizer and a cross-entropy loss function. Convergence was monitored using the F1-score on the validation set, and model training was terminated after 500 consecutive mini-epochs with no apparent increase in validation F1. The code was implemented in Tensorflow version 2.8 [42] based on the original source code.
Table 2.
Number of PSGs and mini-epochs in the training, validations, and test set used to optimize and evaluate the U-sleep model.
| Set | PSGs (n) | Mini-epochs (n) |
|---|---|---|
| Training | 80 | 38 400 |
| Validation | 10 | 4800 |
| Test | 10 | 4800 |
| Grand total | 100 | 48 000 |
To assess the robustness of the fine-tuned model, we additionally performed fivefold cross-validation, where in each fold 20 PSGs were held out for testing, 70 PSGs were used for training, and 10 PSGs were used as a validation set for monitoring convergence.
Statistical analysis
To assess inter-scorer agreement, confusion matrices were constructed for each pairwise scorer combination (human-scored epochs vs. human-scored mini-epochs, U-Sleep vs. human-scored epochs, and U-Sleep vs. human-scored mini-epochs). Mini-epochs labeled with artifacts (i.e. that could not be scored) by the human scorer (221/48 000 mini-epochs, 0.46 per cent) were removed from the analysis before constructing the confusion matrices. As a result, the total number of observations in each segment’s confusion matrix reflects only the artifact-free mini-epochs.
All analyses were performed at the PSG level. From each PSG, the two 20-min segments were combined to generate one single confusion matrix per PSG, to reduce random variation and mitigate the problem that the 20-min segments occasionally lacked one or more sleep stages, resulting in missing agreement estimates. This ensured that stage-wise agreement estimates were based on a more complete dataset across PSGs. From each confusion matrix, i.e. per PSG, the F1-score, i.e. harmonic mean of precision and recall, ranging from 0 (no agreement) to 1 (perfect agreement), was then calculated for each sleep stage and overall, i.e. across all stages, using the standard formula to evaluate model performance:
![]() |
where
and
were computed for each sleep stage.
,
, and
denote true positives, false positives, and false negatives, respectively. To account for class imbalance, each stage’s F1-score was weighted according to its prevalence in the dataset when calculating the mean F1-score across all stages.
Since one or more sleep stages sometimes were absent in the combined two 20-min of PSG data, missing values should be handled carefully. For stages that were absent in a given PSG in both the truth and prediction (TP = 0 and FP = 0), the corresponding F1-score was zero, representing a missing agreement value rather than true disagreement. To avoid artificially lowering mean values, these zeros were excluded from stage-wise averages. The number of PSGs contributing to each mean value is reported in the results.
Bootstrapping (n = 1000) was used to estimate means and 95 per cent confidence intervals for F1-scores and sleep stage fractions as these measures were not normally distributed. The non-parametric Wilcoxon signed-rank was applied to compare paired F1-scores, sleep stage fractions and stage transitions between scoring methods across PSGs. Effect sizes for performance comparison (before/after U-Sleep model optimization) were calculated by the correlation coefficient
(0.1, 0.3, and 0.5 indicating small, medium, and large effects [43]). All comparisons were pre-planned and related to the stated aims of the study. Because the endpoints are highly correlated (stage-wise and overall F1-scores derived from the same PSGs), we did not apply correction for multiple testing.
The following scorings were compared pairwise: Human-scored 5-s mini-epochs; conventional human-scored 30-s epochs (splitting each 30-s epoch into six 5-s mini-epochs and assigning the epoch’s label to each mini-epoch); automated 5-s mini-epochs scorings by the original U-Sleep model [22]; automated 5-s mini-epochs scorings by the optimized U-Sleep model.
In addition, we conducted an exploratory subgroup analysis evaluating model performance separately in NT1 patients and siblings, and further we computed percentages of human-scored mini-epochs without stage-specific characteristics for the subgroups. Given the limited number of PSGs in the test set (four NT1 and six siblings), these analyses were not powered for statistical testing and are hence reported descriptively.
Confusion matrices were for visualization purposes either row- or column-normalized depending on the evaluation objective. When comparing two human scorings (e.g. epochs vs. mini-epochs), the confusion matrix was precision-weighted (column-normalized) to reflect scorer alignment. When evaluating automated versus human scorings, matrices were recall-weighted (row-normalized) to reflect performance across sleep stages.
Lastly, we analyzed sleep–wake transitions, defined as the number of transitions between any sleep stage and stage wake, normalized per minute, across all 200 PSG segments. We used the entire dataset to maximize statistical power. For each PSG, indices from the two segments were combined, resulting in 100 outcomes. Comparisons were made between patients with NT1 and siblings. To account for the dependency introduced by related individuals, a linear mixed-effects model was applied with Family ID as a random effect, and the model was adjusted for age and sex. Model assumptions were evaluated by inspection of residuals, and since residuals showed mild deviations from normality, a square-root transformation was applied to all dependent variables prior to analysis. Standardized effect sizes (Cohen’s d) were derived as the estimated group (NT1 vs. sibling) coefficient divided by the model residual standard deviation on the transformed scale (0.2, 0.5, and 0.8 indicating small, medium, and large effects [43]). Four different scoring approaches were investigated as outcomes: sleep–wake transition indices from (1) human-scored epochs, (2) human-scored mini-epochs, (3) the original U-Sleep model, and (4) the optimized U-Sleep model.
Results
Mini-epochs versus epochs
To contextualize the 48 000 human-scored mini-epochs, we first compared them to the conventional human-scored 30-s epochs. We examined differences in sleep stage distribution and classification to assess how increased temporal resolution affects sleep staging.
Figure 2 A shows distribution of sleep stages across PSGs in mini-epochs and epochs. Though the scored segments only represent a selection of the full-night PSG recordings, the fractions were similar to typical full-night sleep stage fractions [44] and followed the typical sleep distribution, with N2 being the dominant stage and wake and N1 appearing less often. However, mini-epoch scorings contained significantly more wake, N1 and N2, and less N3 and REM sleep than the epochs (all p < .001). Furthermore, we analyzed sleep stage transitions per PSG. Analyzing mini-epochs resulted in averagely 1.70 ± 0.79 stage transitions per minute while epoch analysis resulted in significantly less stage transitions, i.e. only 0.21 ± 0.12 per minute (p < .001), reflecting that the higher temporal resolution enables detection of frequent stage shifts not captured by epochs.
Figure 2.

The figure is based on 100 PSGs, each contributing two 20-min PSG segments. (A) Mean fraction of each sleep stage (W, N1, N2, N3, REM) in human scored 30-s epochs and human-scored mini-epochs. The bar heights show the means across PSGs, and the whiskers represent 95% confidence intervals obtained via bootstrapping. Asterisks indicate significant differences (p < .001) obtained with the Wilcoxon signed-rank test. (B) Precision weighted, column-wise normalized confusion matrix between human-scored epochs (rows) and human-scored mini-epochs (columns). Diagonal values represent agreement for each sleep stage. Off-diagonal values show misclassification patterns (e.g. row 3, column 2 indicates that 31 per cent of the mini-epochs scored as N1 in the human mini-epoch scorings comes from N2 in the human-scored epochs).
Figure 2 B visualizes the column-normalized (precision) confusion matrix between mini-epochs and epochs (the row-normalized (recall) confusion matrix is shown in Figure S2). The rows represent the sleep stages scored in the epochs while the columns represent the corresponding scorings in the mini-epochs. The matrix was column-wise normalized (precision-weighted), meaning that each value represents the proportion of times a given mini-epoch classification originated from a specific epoch sleep stage. As seen, the agreement between epochs and mini-epochs, presented in the diagonal of the matrix, was highest in N3 and REM sleep, and lowest in N1 sleep.
Of the stage wake-scored mini-epochs, 20 per cent came from N2 epochs whereas 11 per cent and 12 per cent came from N1 and REM sleep epochs, respectively. Of the N1-scored mini-epochs, 31 per cent came from N2 epochs and 24 per cent came from REM sleep epochs. Regarding the N2 scored mini-epochs, 20 per cent came from N3 scored epochs, whereas 9 per cent came from REM sleep epochs. N3 mini-epochs mainly came from N3 epochs, with only 4 per cent coming from N2 epochs. Of the REM-scored mini-epochs, 3 per cent came from N1, and 5 per cent from N2 scored epochs.
U-Sleep’s performance before and after transfer learning
To assess the effect of optimizing U-Sleep via transfer learning, we compared predictions from both the original U-Sleep model, and the transfer learning optimized U-Sleep model to the human-scored mini-epochs in the test set (20 segments from 10 PSGs not used in model training or validation, 4800 mini-epochs in total).
Figure 3 summarizes the mean performance of U-Sleep in mini-epochs before and after transfer learning. Figure 3A shows stage-wise and overall performance of U-Sleep in terms of the F1-scores. Effect sizes are reported in Table S2. Transfer learning improved U-Sleep's mean performance across all stages except REM, where mean performance decreased slightly. The overall F1-scores across PSGs improved significantly from 0.74 to 0.81, with a large effect of optimization (r = 0.66, p < .05). A significant improvement was also observed in stage wake (r = 0.69, p < .05), while changes in other stages did not reach significance. Although the increase in N1 was visibly large, the difference was not significant, most likely due to high F1-score variability across PSGs in N1.
Figure 3.

The figure is based on the test data from 10 PSGs each contributing two times 20-min of PSG data. (A) Performance of U-Sleep before and after transfer learning in terms of the F1-score per stage and overall (i.e. the weighted mean F1-score across all stages) when compared to the human-scored mini-epochs. Mean F1-scores per stage were calculated across PSGs with at least one mini-epoch in the given stage (scored by either the human scorer or the model). Included PSGs per mean: Human vs. original U-Sleep model–wake (n = 10), N1 (n = 8), N2 (n = 9), N3 (n = 8), REM (n = 6); human vs. optimized U-Sleep model–wake (n = 10), N1 (n = 10), N2 (n = 10), N3 (n = 8), REM (n = 6). (B and C) Recall-weighted, row-wise normalized confusion matrices between human-scored mini-epochs vs. (B) original and (C) optimized U-Sleep model. each cell shows the fraction of mini-epochs scored in a given stage by U-Sleep (columns), conditional on the human mini-epoch scorings (rows). Diagonal values represent agreement per stage and off-diagonal values show misclassifications (e.g. row 2, column 3 shows that U-Sleep scored N2 in 54 per cent [original model] and 31 per cent [optimized model] of the mini-epochs scored as N1 in human-score mini-epochs). (D) Mean sleep stage fractions scored by human, original U-Sleep, and optimized U-Sleep. In (A) and (D), the bar heights show the means across PSGs, and the whiskers show bootstrap-derived 95% confidence intervals. Asterisks denote significant differences (p < .05) obtained with the signed Wilcoxon test.
To evaluate robustness, we also performed fivefold cross-validation (Table S3). As expected, performance during fivefold cross-validation was slightly lower (F1 = 0.79) than when the optimized U-Sleep model was trained on the full dataset (F1 = 0.81) but remained higher than the original U-Sleep model (F1 = 0.74).
To further evaluate the scoring discrepancies, recall-weighted confusion matrices were constructed with human-scored mini-epochs as the ground truth versus U-Sleep before transfer learning (original U-Sleep), and U-Sleep after transfer learning (optimized U-Sleep), respectively, as predictions (Figure 3B and C). Diagonal values indicate stage-wise agreement, which improved after transfer learning for wake, N1, and N2, at the cost of REM agreement. Notably, stage REM, N1, and N2 included mini-epochs both with and without stage-specific characteristics. As specified in Figure 4, the reduced REM agreement was primarily driven by mini-epochs without stage-specific characteristics i.e. mini-epochs resembling stage REM EEG-wise (and possibly EMG-wise) but lacking rapid eye movements.
Figure 4.

The figure is based on the test set data from 10 PSGs, each contributing two 20-min PSG segments. (A) F1-scores between human-scored mini-epochs and U-Sleep before and after transfer learning, stratified by whether the mini-epochs included stage-specific characteristics or not (with/without) according to the human scorer. Note that the option to mark a mini-epoch as “without characteristics” did not apply for wake and N3. Due to limited data points, statistical comparisons were underpowered and formal significance testing was not performed. Bar heights show the mean F1-score across PSGs for N1, N2, and REM, with whiskers representing 95% confidence intervals obtained via bootstrapping. (B and C) Recall-weighted, row-wise normalized confusion matrices between human-scored mini-epochs (rows) and U-Sleep scorings (columns). The human scorings are stratified into whether the mini-epochs included stage-specific characteristics or not (with/without). Panel (B) shows the performance of the original U-Sleep model, and panel (C) shows the performance of the optimized model. Each matrix cell represents the fraction of mini-epochs classified as a given stage by U-Sleep (columns), conditional of the human mini-epoch scoring (rows). Diagonal values reflect agreement per stage, while off-diagonal values represent misclassifications.
After transfer learning, U-Sleep’s most frequent misclassifications occurred between N1 and N2. Of the human-scored N1 mini-epochs, 31 per cent were predicted as N2 and 18 per cent as REM by the optimized U-Sleep model. For human-scored wake mini-epochs, 23 per cent were predicted as N2 and 14 per cent as N1. Human-scored REM mini-epochs were predicted as N2 in 16 per cent of the cases.
Figure 3 D shows the sleep stage distributions based on mini-epoch scorings by the human scorer, the original U-Sleep model and the optimized U-Sleep model, respectively. Before transfer learning, the original U-Sleep model significantly underestimated N1 and overestimated N3 compared to the human-scored mini-epochs; however, these differences were no longer present after model optimization.
In terms of sleep stage shifting, the number of stage transitions differed significantly between both U-Sleep models and the human-scored mini-epochs. Analyzing human-scored mini-epochs resulted in an average of 1.70 ± 0.86 transitions per minute, whereas the original U-Sleep model produced fewer transitions (1.06 ± 0.99, p < .05). The optimized U-Sleep model also produced fewer transitions compared to the human-scored mini-epochs (1.34 ± 0.80, p < .05); however, the number of transitions was significantly higher than the original U-Sleep model (p < .05), reflecting an improved ability to capture sleep fragmentation after transfer learning.
In Figure 5, the mean overall F1-scores are shown for all pairwise comparisons between human-scored epochs, human-scored mini-epochs, and U-Sleep scorings before and after transfer learning, respectively, in the test set. Before transfer learning, the original U-Sleep model agreed more with the conventional epoch scoring than with the mini-epoch scorings (F1 = 0.85 vs. 0.74), reflecting that the model was originally trained on epochs. After transfer learning, the optimized U-Sleep model’s agreement with epochs decreased (F1 = 0.81), while agreement with mini-epochs increased significantly (F1 = 0.81, p < .05).
Figure 5.

Pairwise agreement rates between conventional epochs, human-scored mini-epochs, and U-Sleep before and after transfer learning across the 10 PSGs in the test set represented as mean overall F1-scores (i.e. across all sleep stages) and bootstrapped 95% confidence interval.
Subgroup analyses suggested small differences in model performance between NT1 patients and siblings. For the original U-Sleep model, the overall mean F1-score was 0.74 (95% CI 0.65–0.81) in siblings and 0.75 (95% CI 0.63–0.87) in NT1 patients. After transfer learning, the optimized U-Sleep model performed slightly better in siblings (0.83, 95% CI 0.74–0.90) than in NT1 patients (0.78, 95% CI 0.70–0.87). Stage-wise subgroup results are provided in Table S4. Given the limited sample size (four NT1 and six siblings), no formal statistical testing was performed.
U-Sleep’s performance in mini-epochs with and without stage-specific characteristics
To investigate how U-Sleep performed in mini-epochs without clear stage-specific characteristics, we stratified the mini-epoch dataset based on whether the human scorer had marked the mini-epochs as “without characteristics” or not. On average, 30.8 ± 16.6 per cent of the mini-epochs were marked as “without characteristics.” Specifically, 38 per cent of N1 and 43.9 per cent of N2 were marked without characteristics, and 37.7 per cent of REM mini-epochs lacked eye movements.
Figure 4 presents an analysis of performance of the original and optimized U-Sleep models across mini-epochs with and without characteristics in the test set. As shown in Figure 4A, in N1 and REM, U-Sleep agreed more with human-scored mini-epochs with characteristics than mini-epochs without characteristics, both before and after transfer learning (e.g. after transfer learning the F1-scores were 0.45 vs. 0.37 for N1 with vs. without characteristics, respectively, and 0.84 vs. 0.56 for REM with vs. without characteristics, respectively). For N2, the opposite pattern was observed for U-Sleep both before and after transfer learning, although the difference was minimal after transfer learning (F1-scores were 0.81 with vs. 0.83 without characteristics after transfer learning). We did not perform significance testing in this analysis because the limited number of data points made statistical comparisons underpowered.
The recall-weighted confusion matrices in Figure 4A–C support these findings. The most common misclassifications in mini-epochs without characteristics occurred between N1 and N2, but these decreased notably after transfer learning. Computing human-scored mini-epochs without characteristics per PSG for NT1 patients and siblings in the test set separately showed higher percentages in NT1 patients (39.4 ± 14.7 per cent) than siblings (26.5 ± 18.6 per cent).
Comparison of sleep–wake transition indices in NT1 patients and siblings
As NT1 patients are known to have increased sleep–wake fragmentation [26, 27], we specifically investigated if automatic scoring of mini-epochs better detects a difference in sleep–wake transitions, we analyzed sleep–wake transitions per minute in epochs and mini-epochs (Figure 6, Table S5). We found that NT1 patients had significantly higher rates of sleep–wake transitions than siblings with all four scoring approaches (all p < .05). However, the three mini-epoch scorings approaches had larger effects (Cohen’s d for human: 0.55, original U-Sleep: 0.80, optimized U-Sleep: 0.75) in separating patients from siblings than the 30-s epochs approach (Cohen’s d for human: 0.46).
Figure 6.
Sleep–wake transitions per minute in PSG segments from patients with narcolepsy type 1 (NT1) and siblings (Sib) based on the four different approaches: Human-scored 30-s epochs, human-scored 5-s mini-epochs, and automatically scored mini-epochs by the original and optimized U-Sleep model, respectively. Boxplots show medians, interquartile ranges, and dots represent individual values obtained by combining transition indices from the two 20-min segments per PSG. p-values are from linear mixed-effects models adjusted for age, sex, and family ID. Numbers are provided in Table S5.
Discussion
In this study, we investigated whether U-Sleep, an existing automated sleep stage classifier [22] with high performance in 30-s epochs, could be optimized via transfer learning to more accurately score sleep in 5-s mini-epochs. Using a dataset of 48 000 human-scored mini-epochs following AASM-inspired scoring rules, we demonstrated that transfer learning significantly improved the model’s agreement with human scoring and achieved a high performance with an overall F1-score of 0.81, supporting the feasibility of automated mini-epoch sleep staging. We additionally found indication of general increased performance in siblings, and that the effect of NT1 versus siblings on sleep–wake transitions was more pronounced when using mini-epochs than epoch scorings, suggesting that higher temporal resolution enhances the ability to detect group differences in sleep–wake instability.
Before assessing model performance, we compared our human-scored mini-epochs to conventional epoch scorings to understand how increased temporal resolution affects sleep stage distribution and structure. The same human (J.V.) scored both the mini-epochs and the epochs, which may have introduced a scoring bias although done with minimum 1 year in between. However, this approach was intentionally chosen to avoid disagreement between mini-epochs and epochs reflecting differences between scorers. We found that the numbers of stage transitions were significantly higher in the human-scored mini-epochs than epochs, showing that mini-epochs provide a more detailed representation of sleep architecture as also suggested by our previous smaller study [23]. Moreover, mini-epoch scoring resulted in significantly more wake, N1, and N2, and less N3 and REM compared to conventional 30-s epoch scoring. This difference might partly be explained by the AASM epoch scoring rule, where ≥20 per cent of an epoch with N3 activity results in the entire epoch being scored as N3. When N3 appears together with another stage, the non-dominant stage (typically N2) is ignored in epoch scoring because the ≥20 per cent rule favors N3. However, this non-dominant stage can be captured in mini-epoch scoring likely contributing to the higher N2 and lower N3 proportions observed in mini-epochs. The reduced REM sleep proportion in mini-epochs may be due to our exclusion of low muscle tone as a REM-specific characteristic, causing some ambiguous REM-like mini-epochs (i.e. REM-like mini-epochs without rapid eye movements) to be scored as N1 or N2. This was, however, chosen because low muscle tone was also observed in mini-epochs with clear NREM or W characteristics, i.e. we concluded that low muscle tone could not be used as a REM-specific feature in mini-epochs without rapid eye movements. The decision was also motivated by the inclusion of NT1 patients in our study, a sleep disorder in which REM sleep often lacks atonia. In such cases, applying standard AASM scoring criteria based on muscle tone may increase risk of misclassification. It remains unknown how the sleep stage distributions would look if we had included low muscle tone as a REM-specific characteristic.
A key advantage of mini-epochs is the ability to capture frequent stage transitions and brief events that are missed in epochs. For example, brief intrusions of wake or sleep may appear only for a few seconds and would be averaged out in traditional epoch scoring. Previous studies have shown that repeated multiple sleep latency tests (MSLTs) in the same patient can give different results and even lead to change in diagnoses [28, 29]. Although this could reflect changes in disease state (progression, remission), the known problems with the 30-s epoch length (several sleep stages in one epoch; inter-rater variability, etc.) are also likely to contribute. Hence, such diagnostic instability underlines the need for more precise and physiologically sensitive sleep analysis methods. Mini-epochs could potentially contribute with new disease biomarker and play a role in improving diagnostic accuracy, particularly in disorders like narcolepsy and it borderland, where frequent stage transitions are common and may be missed in traditional scoring frameworks [26].
However, higher temporal resolutions also come with trade-offs. By requiring a sleep stage label for each 5-s period, mini-epochs increase the number of decisions, which can be difficult particularly in biologically gradual transition periods, possibly introducing more scoring noise and increasing human inter-rater variability. On the other hand, shorter intervals might reduce the well-known problem of epochs containing characteristics from multiple stages. Younes et al. [9] reported that such “equivocal epochs” are the main contributors to epoch scoring disagreement and that equivocal epochs often result in nearly random scoring. Mini-epochs help reduce this problem by using shorter time intervals, which means that each mini-epoch is less likely to include characteristics from more than one sleep stage. This was also reported by the human scorer in present study. Whether inter-rater variability in human mini-epoch scoring will increase or decrease compared to conventional epochs remains to be evaluated in future studies.
Despite the different advantages, human scoring of mini-epochs is far too time-consuming to be feasible in clinical settings. To address this limitation, we applied an automated solution by using our human-scored mini-epoch dataset as ground truth for optimizing the U-Sleep model via transfer learning. We specifically chose the U-Sleep model due to its previously shown high performance [22, 45], however in 30-s epochs, and its flexibility to handle different temporal resolutions.
In our test set, optimizing U-Sleep via transfer learning led to a significant increase in the overall F1-score from 0.74 to 0.81 when compared to human-scored mini-epochs. Hence, the optimized U-Sleep model’s overall score falls within the range reported in previous studies of automated sleep staging (F1 = 0.70–0.87) [22, 46–49], though in epochs, and is notably higher than the F1-score observed in our previous study of the original U-Sleep model’s performance in human-scored mini-epochs (F1 = 0.54) [23]. Our present stage-wise results also showed trends of improvement, particularly in wake and N1, although only wake reached statistical significance. The lowest performance of U-Sleep was observed in N1, consistent with previous studies (in epochs) reporting that automated sleep stage classifiers generally perform worst in N1 [19, 22, 24, 46, 50, 51]. N1 is also the stage with greatest human inter-rater variability [9, 52, 53]. The highest optimized U-Sleep model performance was achieved in N3, which has likewise previously been reported as the stage in which automated classifiers performs best [54], although these findings were based on 30-s epochs.
The relatively lower agreement in REM sleep after U-Sleep optimization via transfer learning likely reflects both methodological and biological challenges. Our mini-epoch scoring rules (Supplementary Material) excluded low muscle tone as a REM-specific characteristic potentially making stage REM more difficult to score consistently. While this aligned the REM scoring rules with the U-Sleep input channels (EEG and EOG, but no EMG), this also represented a deviation from the AASM standard (as explained above), i.e. the scoring rules that U-Sleep’s original training was based on [22]. This possibly required U-Sleep to learn a new representation of REM sleep in our present study which may have increased the learning costs, particularly if our human scorings of ambiguous REM mini-epochs were not fully consistent. On the other hand, our choice of excluding low muscle tone as a REM-specific characteristic may enhance the clinical relevance of mini-epoch scoring especially in sleep disorders where REM sleep lacks atonia and AASM rules are less applicable.
Additionally, many REM-scored mini-epochs lacked rapid eye movements and often resembled light NREM sleep, especially N2 (if mini-epochs also lacked spindles and K-complexes). The optimized U-Sleep model frequently misclassified such mini-epochs as N2 (30 per cent) which may reflect true biological ambiguity or mixed-stage phenomena rather than misclassification.
While transfer learning improved U-Sleep’s overall performance in mini-epochs (F1 = 0.81), this performance did not exceed the performance of the optimized U-Sleep model in epochs (F1 = 0.81). This may be due to several factors. First, Perslev et al. [22] trained the original U-Sleep model on large-scale 30-s epoch data which has sustained influence on its predictions in our present study even after transfer learning. For instance, the original U-Sleep model’s scorings resulted in the fewest stage transitions per segment, while the optimized U-Sleep model’s scorings resulted in more transitions, but still fewer than the human-scored mini-epochs. This suggests that though the optimized U-Sleep model did become much better at scoring mini-epochs and reached a high performance, its predictions remained smoother (i.e. with fewer transitions between stages).
Second, both U-Sleep and the human mini-epoch scorings may contain frequent transitions, but if these are not exactly aligned, the scorings will disagree. Even small temporal mismatches such as one transition occurring one mini-epoch earlier or later will count as disagreement. In contrast, epoch-based scoring inherently averages over brief sleep or wake intrusions, which may lead to better agreement—even if the underlying biologically true stage transitions are not perfectly captured.
Third, some of the disagreement between human and automated scorings of mini-epochs might reflect true uncertainty in the data. For example, the boundaries between N1 and N2 sleep are biologically gradual and sometimes ambiguous, even to expert human scorers [52]. This ambiguity can limit the achievable agreement, not because the model performs poorly, but because the underlying signal lacks a clear categorical boundary. So, even a perfect algorithm might struggle to agree with a human scorer on these transitional periods, just as two trained human scorers can disagree [52, 53]. In fact, studies reporting inter-rater agreement for human scorers using the F1-score have found values ranging between 0.70 and 0.79 [22, 46, 49].
In average, 30.8 per cent of the human-scored mini-epochs in the present study were marked as lacking defining stage-specific characteristics, making them difficult to score—a problem likely less common in 30-s epochs, where the longer time windows provide more contextual information. These ambiguous mini-epochs without characteristics contribute to uncertainty and may potentially lead to variability in both human and automated sleep stage scoring. We therefore explored how the presence or absence of stage-specific characteristics in mini-epochs influenced U-Sleep’s performance. As expected, agreement between U-Sleep and the human scorer was higher in mini-epochs with characteristics for N1 and REM, both before and after U-Sleep optimization, underlining the model’s reliance on distinct visual characteristics such as slow or rapid eye movements. Interestingly, for N2, U-Sleep performed better in mini-epochs without characteristics, both before and after transfer learning. This counterintuitive result may reflect the model's ability to detect underlying background patterns associated with N2 even when spindles or K-complexes are absent. However, many of the human-scored N1 mini-epochs without characteristics were also by U-Sleep scored as N2 (72 per cent by the original U-Sleep model and 38 per cent by the optimized U-Sleep model). This indicates a tendency of the U-Sleep model to default to N2 when clear characteristics are absent possibly due to the high prevalence of N2 in both the original epoch-based training dataset and our mini-epoch training data. Together, these findings highlight the trade-off of mini-epoch scoring: it reduces equivocal cases with mixed characteristics but introduces more stage decisions where clear defining characteristics are missing.
An exploratory subgroup analysis suggested that the optimized U-Sleep model performance was slightly higher in siblings with normal sleep patterns than in NT1 patients. This finding was expected, as a previous study investigating automatic epoch-based scoring performance in PSGs from patients with narcolepsy and from a community cohort also found the lowest performance in the narcolepsy group [6]. Narcolepsy is notoriously difficult to score due to abnormal, complex, and fragmented sleep, often requiring adaptations of traditional scoring rules. In line with this, we observed a higher proportion of mini-epochs without stage-specific characteristics in NT1 patients compared to siblings, further challenging sleep staging in NT1. Despite this challenge, the optimized U-Sleep model still reached high performance in NT1 indicating that the optimized model has potential not only in healthy populations but also in complex sleep disorders.
When analyzing sleep–wake transitions in the full dataset of segments from 100 different PSGs, we, as expected, found higher sleep–wake transition indices in all three human and automatic mini-epoch scorings compared to human 30-s epochs. This increase is partly explained by the fact that higher temporal resolution allows for more transition opportunities, but it likely also reflects a more biologically accurate representation of the sleep–wake dynamics. The NT1 versus sibling difference was already significant for human 30-s epochs (Cohen’s d = 0.46, p = .040) and became stronger with human mini-epochs (d = 0.55, p = .014). The largest group effects were, however, observed with automated mini-epoch scoring (original U-Sleep: d = 0.80, p = .0002; optimized U-Sleep: d = 0.75, p=.0004), indicating that mini-epoch scoring increase sensitivity to detect clinically meaningful sleep–wake fragmentation. It is noteworthy that the original U-Sleep model, trained on 30-s epochs, in this certain subtask separated the groups more effectively than the optimized U-Sleep model. However, these results are based on a task analyzing only the two states sleep (including both N1-3 and REM sleep) and wake and do not generalize to overall performance. Additionally, present results are based on relatively short PSG segments, and full-night analyses in a larger dataset are needed to investigate differences in clinical applicability of the two models.
We acknowledge that our study has limitations. Although the cross-validation and subsequent hold-out evaluation of the optimized model were consistent, the transfer learning and evaluation of U-Sleep were performed within one study population, which may increase risk of overfitting and thus limit generalizability. Future work should focus on external validation in independent datasets. We did not investigate nor consider human inter-rater variability in mini-epochs in this study as all human scorings were performed by a single scorer; however, this should be explored further in larger studies. Several statistical tests were done in the study (up to six per comparison family), increasing the risk of false positives; however, in this context strict p-value adjustment could also inflate the risk of false negatives, especially as we have strongly correlated endpoints [55]. Therefore, we reported the unadjusted p-values of the study. Statistical comparisons were limited by relatively small numbers of mini-epochs per stage in the test set, which potentially affected the stage-wise F1-scores. In addition, the subgroup analyses were based on only four NT1 patients and six siblings (the test set) and should therefore be interpreted with caution. Larger and more diverse datasets will be needed to confirm whether the optimized model generalizes robustly across clinical groups.
Additionally, we evaluated model performance on 20-min segments rather than full-night PSGs in which performance remains unknown. While we followed the original U-Sleep approach using 17.5-min training windows, we did not investigate whether training on shorter windows could influence performance. Furthermore, in this study, we implemented a simple transfer learning approach and did not investigate layer-freezing or more advanced transfer learning strategies as our results from re-training the full model significantly improved performance. These approaches should be investigated in future work. Mini-epoch were scored without knowing the nocturnal timing, which is a deviation from standard clinical scoring procedures that may have influenced scoring decisions in unknown ways. The human scorer reviewed the full PSG prior to scoring to gain familiarity with each PSG’s sleep patterns. However, during actual segment scoring, the human scorer could not scroll outside the 20-min segments, which we acknowledge may have reduced contextual information at the segment boundaries. Excluding the first and last 30 s of each segment gave results consistent with the main analyses, suggesting that boundary effects did not influence our conclusions. However, since the U-Sleep model was likewise not trained on out-of-context data, this represents a logical step to investigate further in future work.
Mini-epochs may challenge existing diagnostic frameworks as features. For example, REM-latency or sleep onset REM criteria are based on epochs and may not translate directly to high-resolution mini-epoch scoring. Traditional diagnostic criteria may need to be reconsidered or adapted if in future applied to mini-epochs.
Finally, high-resolution sleep staging can only be considered directly applicable to clinical practice once it demonstrates robust performance across diverse populations and establishes the clinical relevance of new derived sleep metrics (e.g. sleep dynamics such as more frequent stage transitions). Without these future steps, its utility in real-world clinical settings remains unknown.
This study demonstrated that automated sleep staging at high temporal resolution is both feasible and accurate. The mini-epoch framework enabled a more detailed representation of sleep architecture, capturing significantly more stage transitions than in traditional epoch-based scoring. By optimizing the deep learning model U-Sleep on 5-s mini-epochs using human-scored sleep stage data, we achieved a significant improvement in agreement between the optimized model and the human scorer compared to the original model both in patients with NT1 and siblings. The performance of the optimized U-Sleep model reached levels equally high in mini-epochs as those previously published of other automated classifiers and as agreement between human scorers in 30-s epochs. Importantly, in this dataset, differences between NT1 versus siblings in sleep–wake transitions were more pronounced with mini-epoch analyses, highlighting the potentially clinically meaningful added sensitivity of high-resolution sleep staging.
Supplementary Material
Acknowledgments
We especially thank all the patients and siblings of our study. We also thank Cathrine Brunborg for assistance in statistical analyses. This study was funded by a grant from the South-Eastern Norway Regional Health Authority (RV grant: 2017070; HTJ grant: 2019032), the Norwegian Ministry of Health and Care Services (LFF, RV, JV, HTJ, JAEC, SKH), the Lundbeck Foundation (ANZ grant: R347-2020-2439), the Independent Research Fund Denmark (MP grant: 9131-00099B).
Contributor Information
Louise Frøstrup Follin, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway; Institute of Clinical Medicine, University of Oslo, Oslo, Norway.
Julie Anja Engelhard Christensen, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway.
Janita Vevelstad, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway.
Hilde T Juvodden, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway.
Rannveig Viste, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway.
Berit Hjelde Hansen, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway.
Mathias Perslev, Danish Center for Sleep Medicine, Rigshospitalet, Copenhagen, Denmark.
Tobias Kaufmann, Centre for Precision Psychiatry, Institute of Clinical Medicine, University of Oslo, Oslo, Norway; Department of Psychiatry and Psychotherapy, University of Tübingen, Tübingen, Germany; German Center for Mental Health (DZPG), Partner Site Tübingen, Tübingen, Germany.
Alexander Neergaard Zahid, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark; AI Center of Excellence, WS Audiology, Lynge, Denmark.
Stine Knudsen-Heier, Department of Rare Disorders, Norwegian Centre of Expertise for Neurodevelopmental Disorders and Hypersomnias (NevSom), Oslo University Hospital, Oslo, Norway; Institute of Clinical Medicine, University of Oslo, Oslo, Norway.
Disclosure statement
Financial disclosure: None declared.
Non-financial disclosure: S.K.H. has served as an expert consultant for the Norwegian state.
Data availability
The data underlying this article cannot be shared publicly since we do not have ethical approval to share the data due to privacy of individuals of the study.
References
- 1. Schulz H. Rethinking sleep analysis. J Clin Sleep Med. 2008;4(2):99–103. 10.5664/jcsm.27124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Rechtschaffen A, Kales A. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects. Los Angeles: Brain Information Service/Brain Research Institute, University of California. 1968. [Google Scholar]
- 3. Iber C, Ancoli-Israel S, Chesson AL, Quan SF. The AASM Manual for the Scoring of Sleep and Associated Events, 1st ed. Westchester, IL: American Academy of Sleep Medicine. 2007.
- 4. Grigg-Damberger MM. The AASM scoring manual: a critical appraisal. Curr Opin Pulm Med. 2009;15(6):540–549. 10.1097/MCP.0b013e328331a2bf [DOI] [PubMed] [Google Scholar]
- 5. Grigg-Damberger MM. The AASM scoring manual four years later. J Clin Sleep Med. 2012;8(3):323–332. 10.5664/jcsm.1928 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Stephansen JB, Olesen AN, Olsen M, et al. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat Commun. 2018;9(1):5229–5215. 10.1038/s41467-018-07229-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Huijben IAM, Hermans LWA, Rossi AC, Overeem S, van Gilst MM, van Sloun RJG. Interpretation and further development of the hypnodensity representation of sleep structure. Physiol Meas. 2023;44(1):015002. 10.1088/1361-6579/aca641 [DOI] [PubMed] [Google Scholar]
- 8. Bakker JP, Ross M, Cerny A, et al. Scoring sleep with artificial intelligence enables quantification of sleep stage ambiguity: hypnodensity based on multiple expert scorers and auto-scoring. Sleep. 2023;46(2):1–12. 10.1093/sleep/zsac154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Younes M, Raneri J, Hanly P. Staging sleep in polysomnograms: analysis of inter-scorer variability. J Clin Sleep Med. 2016;12(06):885–894. 10.5664/jcsm.5894 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Moul DE, Germain A, Cashmere JD, Quigley M, Miewald JM, Buysse DJ. Examining initial sleep onset in primary insomnia: a case-control study using 4-second epochs. J Clin Sleep Med. 2007;3(5):479–488. 10.5664/jcsm.26912 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Tanaka H, Hayashi M, Hori T. Statistical features of hypnagogic EEG measured by a new scoring system. Sleep. 1996;19(9):731–738. 10.1093/sleep/19.9.731 [DOI] [PubMed] [Google Scholar]
- 12. Younes M, Ostrowski M, Soiferman M, et al. Odds ratio product of sleep EEG as a continuous measure of sleep state. Sleep. 2015;38(4):641–654. 10.5665/sleep.4588 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Koch H, Jennum P, Christensen JAE. Automatic sleep classification using adaptive segmentation reveals an increased number of rapid eye movement sleep transitions. J Sleep Res. 2019;28(2):e12780. 10.1111/jsr.12780 [DOI] [PubMed] [Google Scholar]
- 14. Younes M. The case for using digital EEG analysis in clinical sleep medicine. Sleep Sci Pract. 2017;1(1):2. 10.1186/s41606-016-0005-0 [DOI] [Google Scholar]
- 15. Banluesombatkul N, Mignot E, Wilaiprasitporn T. Comprehensive evaluation of odds ratio product and spectral slope as continuous sleep depth measures: advancing sleep staging and clinical applications. Sleep. 2025;48(10). 10.1093/sleep/zsaf130 [DOI] [PubMed] [Google Scholar]
- 16. Hermans LW, Huijben IA, van Gorp H, et al. Representations of temporal sleep dynamics: review and synthesis of the literature. Sleep Med Rev. 2022;63:101611. 10.1016/j.smrv.2022.101611 [DOI] [PubMed] [Google Scholar]
- 17. Imbach LL, Werth E, Kallweit U, Sarnthein J, Scammell TE, Baumann CR. Inter-hemispheric oscillations in human sleep. PLoS One. 2012;7(11):e48660. 10.1371/journal.pone.0048660 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Schoch SF, Werth E, Poryazova R, Scammell TE, Baumann CR, Imbach LL. Dysregulation of sleep behavioral states in narcolepsy. Sleep. 2017;40(12). 10.1093/sleep/zsx170 [DOI] [PubMed] [Google Scholar]
- 19. Olesen AN, Jørgen Jennum P, Mignot E, Sorensen HBD. Automatic sleep stage classification with deep residual networks in a mixed-cohort setting. Sleep. 2021;44(1). 10.1093/sleep/zsaa161 [DOI] [PubMed] [Google Scholar]
- 20. Aboalayon KAI, Faezipour M, Almuhammadi WS, Moslehpour S. Sleep stage classification using EEG signal analysis: a comprehensive survey and new investigation. Entropy. 2016;18(9):1–31. 10.3390/e18090272 [DOI] [Google Scholar]
- 21. Cesari M, Christensen JAE, Muntean M, , et al. A data-driven system to identify REM sleep behavior disorder and to predict its progression from the prodromal stage in Parkinson’s disease. Sleep Med. 2021;77:238–248. 10.1016/j.sleep.2020.04.010 [DOI] [PubMed] [Google Scholar]
- 22. Perslev M, Darkner S, Kempfner L, Nikolic M, Jennum PJ, Igel C. U-sleep: resilient high-frequency sleep staging. NPJ Digit Med. 2021;4(1):72–12. 10.1038/s41746-021-00440-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Follin L, Neergaard A, Viste R, et al. An inter-rater variability study between human and automatic scorers in 5-s mini-epochs of sleep. Sleep Med. 2024;128:139–150. 10.1016/j.sleep.2025.02.005 [DOI] [PubMed] [Google Scholar]
- 24. Ganglberger W, Nasiri S, Sun H, et al. Refining sleep staging accuracy: transfer learning coupled with scorability models. Sleep. 2024;47(11). 10.1093/sleep/zsae202 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Guillot A, Thorey V. RobustSleepNet: transfer learning for automated sleep staging at scale. IEEE Trans Neural Syst Rehabil Eng. 2021;29:1441–1451. 10.1109/TNSRE.2021.3098968 [DOI] [PubMed] [Google Scholar]
- 26. Roth T, Dauvilliers Y, Mignot E, et al. Disrupted nighttime sleep in narcolepsy. J Clin Sleep Med. 2013;9(9):955–965. 10.5664/jcsm.3004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Maski K, Mignot E, Plazzi G, Dauvilliers Y. Disrupted nighttime sleep and sleep instability in narcolepsy. J Clin Sleep Med. 2022;18(1):289–304. 10.5664/jcsm.9638 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Trotti LM, Staab BA, Rye DB. Test-retest reliability of the multiple sleep latency test in narcolepsy without cataplexy and idiopathic hypersomnia. J Clin Sleep Med. 2013;9(8):789–795. 10.5664/jcsm.2922 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Lopez R, Doukkali A, Barateau L, et al. Test–retest reliability of the multiple sleep latency test in central disorders of hypersomnolence. Sleep. 2017;40(12). 10.1093/sleep/zsx164 [DOI] [PubMed] [Google Scholar]
- 30. American Academy of Sleep Medicine . International Classification of Sleep Disorders (ICSD). 3rd ed. Darien, IL: American Academy of Sleep Medicine. 2014.
- 31. Viste R, Soosai J, Vikin T, Thorsby PM, Nilsen KB, Knudsen S. Long-term improvement after combined immunomodulation in early post-H1N1 vaccination narcolepsy. Neurol Neuroimmunol NeuroInflammation. 2017;4(5):e389–e383. 10.1212/NXI.0000000000000389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Juvodden HT, Alnæs D, Lund MJ, et al. Widespread white matter changes in post-H1N1 patients with narcolepsy type 1 and first-degree relatives. Sleep. 2018;41(10):1–11. 10.1093/sleep/zsy145 [DOI] [PubMed] [Google Scholar]
- 33. Juvodden HT, Alnæs D, Lund MJ, et al. Hypocretin-deficient narcolepsy patients have abnormal brain activation during humor processing. Sleep. 2019;42(7). 10.1093/sleep/zsz082 [DOI] [PubMed] [Google Scholar]
- 34. Nordstrand SEH, Hansen BH, Rootwelt T, et al. Psychiatric symptoms in patients with post-H1N1 narcolepsy type 1 in Norway. Sleep. 2019;42(4). 10.1093/sleep/zsz008 [DOI] [PubMed] [Google Scholar]
- 35. Nordstrand SEH, Juvodden HT, Viste R, et al. Obesity and other medical comorbidities among NT1 patients after the Norwegian H1N1 influenza epidemic and vaccination campaign. Sleep. 2020;43(5). 10.1093/sleep/zsz277 [DOI] [PubMed] [Google Scholar]
- 36. Viste R, Lie BA, Viken MK, Rootwelt T, Knudsen-Heier S, Kornum BR. Narcolepsy type 1 patients have lower levels of effector memory CD4+ T cells compared to their siblings when controlling for H1N1-(Pandemrix™)-vaccination and HLA DQB1*06:02 status. Sleep Med. 2021;85:271–279. 10.1016/j.sleep.2021.07.024 [DOI] [PubMed] [Google Scholar]
- 37. Viste R, Viken MK, Lie BA, et al. High nocturnal sleep fragmentation is associated with low T lymphocyte P2Y 11 protein levels in narcolepsy type 1. Sleep. 2021;44(8). 10.1093/sleep/zsab062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Juvodden HT, Alnæs D, Lund MJ, et al. Larger hypothalamic volume in narcolepsy type 1. Sleep. 2023;46(11). 10.1093/sleep/zsad173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Viste R, Follin LF, Kornum BR, et al. Increased muscle activity during sleep and more RBD symptoms in H1N1-(Pandemrix)-vaccinated narcolepsy type 1 patients compared with their non-narcoleptic siblings. Sleep. 2023;46(3). 10.1093/sleep/zsac316 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Hansen BH, Norsted H, Gjesvik J, Thorsby PM, Naerland T, Knudsen-heier S. Associations between psychiatric comorbid disorders and executive dysfunctions in hypocretin-1 deficient pediatric narcolepsy type1. Sleep Med. 2023;109:149–157. 10.1016/j.sleep.2023.06.021 [DOI] [PubMed] [Google Scholar]
- 41. Juvodden HT, Alnæs D, Agartz I, et al. Cortical thickness and sub-cortical volumes in post-H1N1 narcolepsy type 1: a brain-wide MRI case-control study. Sleep Med. 2024;116(February):81–89. 10.1016/j.sleep.2024.02.031 [DOI] [PubMed] [Google Scholar]
- 42. Abadi M, Barham P, Chen J. et al. TensorFlow: A System for Large-Scale Machine Learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). USENIX Association; 2016:265–283.
- 43. Ellis PD. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge: Cambridge University Press; 2010. 10.1017/CBO9780511761676 [DOI] [Google Scholar]
- 44. Keenan S, Hirshkowitz M. Sleep stage scoring. In: Principles and Practice of Sleep Medicine. 6th ed., Philadelphia, PA: Elsevier Inc.; 2017:1567-1575. doi: 10.1016/B978-0-323-24288-2.00161-6 [DOI] [Google Scholar]
- 45. Fiorillo L, Monachino G, van der Meer J, et al. U-Sleep’s resilience to AASM guidelines. NPJ Digit Med. 2023;6(1):33–39. 10.1038/s41746-023-00784-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Cesari M, Stefani A, Penzel T, et al. Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence-based Stanford-STAGES algorithm. J Clin Sleep Med. 2021;17(6):1237–1247. 10.5664/jcsm.9174 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Fiorillo L, Pedroncelli D, Agostini V, Favaro P, Faraci FD. Multi-scored sleep databases: how to exploit the multiple-labels in automated sleep scoring. Sleep. 2023;46(5). 10.1093/sleep/zsad028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Guillot A, Sauvet F, During EH, Thorey V. Dreem open datasets: multi-scored sleep datasets to compare human and automated sleep staging. IEEE Trans Neural Syst Rehabil Eng. 2020;28(9):1955–1965. 10.1109/TNSRE.2020.3011181 [DOI] [PubMed] [Google Scholar]
- 49. Moeller AL, Perslev M, Paulsrud C, et al. Artificial intelligence or sleep experts: comparing polysomnographic sleep staging in children and adolescents. Sleep. 2025;48(7). 10.1093/sleep/zsaf053 [DOI] [PubMed] [Google Scholar]
- 50. Lee YJ, Lee JY, Cho JH, Choi JH. Interrater reliability of sleep stage scoring: a meta-analysis. J Clin Sleep Med. 2022;18(1):193–202. 10.5664/jcsm.9538 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Younes M, Hanly PJ. Minimizing interrater variability in staging sleep by use of computer-derived features. J Clin Sleep Med. 2016;12(10):1347–1356. 10.5664/jcsm.6186 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Rosenberg RS, Van Hout S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013;9(1):81–87. 10.5664/jcsm.2350 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Danker-Hopfe H, Kunz D, Gruber G, et al. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004;13(1):63–69. 10.1046/j.1365-2869.2003.00375.x [DOI] [PubMed] [Google Scholar]
- 54. Fiorillo L, Puiatti A, Papandrea M, et al. Automated sleep scoring: a review of the latest approaches. Sleep Med Rev. 2019;48:101204–101212. 10.1016/j.smrv.2019.07.007 [DOI] [PubMed] [Google Scholar]
- 55. Rothman KJ. Six persistent research misconceptions. J Gen Intern Med. 2014;29(7):1060–1064. 10.1007/s11606-013-2755-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article cannot be shared publicly since we do not have ethical approval to share the data due to privacy of individuals of the study.




