Skip to main content
Sleep logoLink to Sleep
. 2026 Feb 6;49(4):zsag022. doi: 10.1093/sleep/zsag022

Mamba-based deep learning approach for sleep staging on a wireless multimodal wearable system without electroencephalography

Andrew H Zhang 1,2,3,#, Alex He-Mo 4,5,#, Richard Fei Yin 6,7,#, Chunlin Li 8, Yuzhi Tang 9, Dharmendra Gurve 10, Veronique van der Horst 11, Aron S Buchman 12, Nasim Montazeri Ghahjaverestan 13,14, Maged Goubran 15,16, Bo Wang 17,18,19, Andrew S P Lim 20,21,
PMCID: PMC13089490  PMID: 41649157

Abstract

Study Objectives

We investigate a Mamba-based deep learning approach for sleep staging on signals from ANNE One (Sibel Health, Chicago, IL), a non-intrusive dual-module wireless wearable system measuring chest electrocardiography, triaxial accelerometry, chest temperature, and finger photoplethysmography and finger temperature.

Methods

We obtained wearable sensor recordings from 357 adults undergoing concurrent polysomnography at a tertiary care sleep lab. Each polysomnography recording was manually scored, and these annotations served as ground truth labels for training and evaluation of our models. Polysomnography and wearable sensor data were automatically aligned using their electrocardiography channels with manual confirmation by visual inspection. We trained a Mamba-based recurrent neural network architecture on these recordings. Ensembling of model variants with similar architectures was performed.

Results

After ensembling, the model attains a 3-class (wake, non-rapid eye movement sleep, rapid eye movement sleep) balanced accuracy of 84.02 per cent, F1 score of 84.23 per cent, Cohen’s κ of 72.89 per cent, and a Matthews correlation coefficient (MCC) score of 73.00 per cent; a 4-class (wake, light NREM [N1/N2], deep NREM [N3], REM) balanced accuracy of 75.30 per cent, F1 score of 74.10 per cent, Cohen’s κ of 61.51 per cent, and MCC score of 61.95 per cent; a 5-class (wake, N1, N2, N3, REM) balanced accuracy of 65.11 per cent, F1 score of 66.15 per cent, Cohen’s κ of 53.23 per cent, MCC score of 54.38 per cent.

Conclusions

Our Mamba-based deep learning model can successfully infer major sleep stages from the ANNE One, a wearable system without electroencephalography, and can be applied to data from adults attending a tertiary care sleep clinic.

Keywords: deep learning, sleep staging, wearable devices


Statement of Significance Sleep is traditionally measured by in-lab polysomnography followed by visual identification of sleep stages by specialized technologists. However, the need for specialized personnel and equipment makes polysomnography difficult to scale, and poses a barrier to many older adults. To address this, we developed and evaluated in 357 patients attending a tertiary care sleep clinic a Mamba-based AI model for inferring sleep stage from the ANNE One sensor system—a pair of wearable minimally intrusive sensors placed on the chest and finger. We show good performance at distinguishing sleep stages across a range of ages and co-morbid sleep pathologies, suggesting that this may represent a scalable, accurate, minimally intrusive approach to the ambulatory assessment of sleep in clinical populations.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Sleep staging and polysomnography

Sleep is essential to physical and mental health. Quantitative evaluation of sleep architecture and physiology is central to the diagnosis of many sleep disorders, playing an important role in evaluating individuals with and at risk for neurodegenerative diseases [1]. Sleep is traditionally and most reliably measured by polysomnography (PSG), in which brain activity; eye movements, heart rate (HR), muscle tone, and breathing patterns are recorded simultaneously by sensors attached to a sleeping patient in a laboratory environment. This is followed by sleep staging, the categorization of time periods into sleep stages based on a trained sleep technician’s visual inspection of the collected physiological data. The exact sleep staging methodology depends on the specific framework; under the American Academy of Sleep Medicine’s (AASM) [2] conventions, recordings are divided into 30-second windows, or “epochs.” Electroencephalography (EEG), electromyography (EMG), and electro-oculography (EOG) are used to classify epochs into three stages: awake, rapid eye movement (REM) sleep, and non-rapid eye movement (NREM) sleep, which can be further subcategorized as N1, N2, and N3 sleep, with N1 as the lightest and N3 as the deepest NREM sleep. EEG is the principal signal used for sleep staging [2–4].

Wearable devices and automated sleep staging

The need for specialized personnel, equipment, and laboratory space makes PSGs expensive to conduct and difficult to scale. The cumbersome sensor setup and unfamiliar resting environment may also discomfort patients, resulting in sleep that is unrepresentative of that in the home environment. Particularly sick or frail older adults may be physically unable to attend in-laboratory PSG, in addition to those living in remote areas without PSG services. Furthermore, monitoring patients throughout the night and manually labeling sleep epochs is a time-inefficient process that limits the volume of sleep analyses.

In response to these challenges, wearable devices have been developed that record a subset of conventional PSG signals. Devices that record EEG have the advantage of directly measuring brain activity, but the disadvantage of the discomfort associated with sensors on the head, similar to PSGs. On the other hand, EEG-less devices are less obtrusive and more easily set up for in-home use without a technician, but infer rather than directly measure sleep stages.

There has been considerable interest in pairing wearable devices with automated sleep stage inference in order to minimize the need for specialized technologists for labeling and hence improve scalability. Indeed, even for the analysis of conventional in-laboratory PSG, automated sleep staging has been explored as a means of addressing the time-intensiveness of sleep epoch labeling. The focus on pattern recognition and loosely rule-based nature of sleep staging is ideal for machine learning approaches, which have employed model architectures like convolutional neural networks (CNNs) [5], recurrent neural networks (RNNs) [6], hybrids of the previous two (e.g. convolutional recurrent neural networks (CRNN)) [7], transformers [8], and graph neural networks [9]. Deep learning models trained on PSG data, most notably the EEG signal or some combination of EEG, EMG, and EOG, have surpassed inter-scorer agreement levels, with 5-class accuracy and Cohen’s κ as high as 92 and 86 per cent, respectively, and can make predictions orders of magnitude faster than a human evaluator at the cost of sacrificed interpretability [10, 11].

Recurrent neural networks and mamba

RNNs are a class of neural networks that take in a sequence of vectors (e.g. a time series of pulse waves or breathing) and produce a vector output per input, allowing them to execute a wide breadth of time-series tasks. Prevalent modern RNN architectures include long short-term memory (LSTM) [12] and gated recurrent units (GRU) [13], which are known to forget very long-range dependencies and whose architecture cannot exploit GPU acceleration, limiting the training speed. Transformers [14] address these problems but scale poorly in speed for longer sequences, both for training and inference.

Mamba [15] as a new generation of RNN provides a solution to these issues as a type of state space model (SSM) [16] that combines the excellent long-range dependency modeling of SSMs with a time-dependent selection mechanism that allows for focusing on particular features, as well as a hardware-friendly algorithm that is computationally efficient (see S.1 for details). Empirically, Mamba has achieved state-of-the-art performance rivaling that of transformers on supervised and unsupervised metrics from datasets of text, audio waveforms, and deoxyribonucleic acid (DNA) sequences [15], while Mamba-based models have been adopted to perform sleep staging using EEG and other PSG signals to high degrees of accuracy [17, 18].

Contribution

This paper applies a Mamba-based deep learning approach to the problem of sleep staging using data from the ANNE One [19] sensor. ANNE One, as seen in Figure 1A, is an FDA-cleared flexible, minimally-intrusive, and clinical-grade wireless dual-module wearable system by Sibel Health composed of a finger module, which measures photoplethysmography (PPG) and limb temperature, and a separate module attached to the chest via an adhesive, which measures electrocardiography (ECG), triaxial accelerometry, and chest temperature. The raw signals of a typical recording are plotted in Figure 1B alongside a hypnogram, a technician-labeled plot of sleep stage over time.

Figure 1.

Figure 1

A. The finger and chest modules of ANNE one (image by Sibel Health [19]). B. Raw physiological signals recorded by ANNE one during a full night sleep recording and its accompanying hypnogram. C. Demographics and sleep characteristics of the subjects used in this study.

A prior study centered on ANNE One [20] exists, which primarily emphasized the development of person- specific models that were trained and evaluated on data from the same patient; in contrast, the population model, trained and evaluated on data from different patients, showed much poorer performance. We aim to develop a superior population model to achieve accurate sleep staging from ANNE One signals recorded from previously unseen patients.

There is a considerable body of work applying machine learning to the problem of sleep stage inference from non-EEG signals. For the most part, these have been trained and tested using the cardiopulmonary sensors of conventional PSG [21–27], which are typically artifact-free. Relatively few studies have applied these approaches to data from wearable devices, which are, in general, more prone to poor signal quality and signal loss. Consequently, such studies often need to reject epochs of poorer-quality signal prior to model training and prediction. Moreover, these works have focused on analysis of data from single sensors, typically ECG or PPG [28, 29]. Finally, models developed for wearable sensors have largely trained on data from relatively young and healthy volunteers, rather than patients presenting to sleep clinics, raising a question of generalizability to clinical populations.

The current study builds on and extends this body of work in several ways. First, we leverage the ANNE One devices’ multi-sensor nature, which may, in principle, allow for predictions even in settings where one or more sensors are reading low-quality signals. Second, we train and test on data from relatively older patients with a range of sleep disorders (41% were over the age of 65 and 46% had sleep apnea; Table 1) rather than younger and healthier patients, ensuring applicability to older adults with sleep complaints. This is particularly important as age, sleep, and neurological conditions can all plausibly alter the relationship between sleep stage and autonomic, cardiac, and pulmonary function, meaning that models trained on younger adults may generalize poorly to older adults. Finally, this work broadens the Mamba architecture to EEG-less sleep staging by exploring whether Mamba’s relative performance increases [17, 18] are also noticed in this context.

Table 1.

Sleep characteristics of study participants

Statistic μ σ Range/count
Recordings 357
Male/female 181/176 (51%/49%)
Recording duration [h] 8.16 1.39 1.7512.94
Awake [%] 40.96 16.58 10.4795.24
N1 sleep [%] 11.13 6.24 1.0538.42
N2 sleep [%] 28.37 10.61 0.5863.59
N3 sleep [%] 10.85 8.12 0.0054.83
REM sleep [%] 8.70 6.08 0.0029.14
Age [y] 57.49 17.56 18.1792.07
AHI 11.54 17.21 0.00127.00
TST [h] 5.10 1.46 0.669.16
BMI [kg m−2] 27.88 5.59 16.9050.40
PLMI 15.01 26.99 0.00169.50
Sleep efficiency [%] 71.83 17.91 11.6098.50

Demographics and sleep characteristics of the subjects used in this study.

Methods

Dataset

ANNE wearable sensors were used to collect full-night recordings from 360 adults undergoing concurrent clinical PSG at a tertiary care clinical sleep laboratory in Sunnybrook Health Sciences Centre, Toronto, Canada. PSGs were recorded according to AASM practice parameters for in-laboratory PSGs with a Grael PSG system (Compumedics, Victoria, Australia). The study was conducted in accordance with the latest version of the Declaration of Helsinki. All subjects provided written informed consent for the full study protocol.

Participants were first instrumented with the ANNE One, whose recordings were started ahead of the setup and initiation of PSG recordings. In order to mimic what would occur in a home environment, the ANNE signals were not monitored in real time, and no attempt was made to correct sensor-related issues (e.g. poor skin contact) overnight. Upon completion, each PSG recording was manually labeled by one of four technologists according to AASM standards [2]. In addition, the portion of the ANNE recordings preceding the onset of PSG recording was labeled as wake based on direct observation of participants by study staff. PSG and wearable sensor data were automatically aligned by their shared ECG signals using cross-correlation analysis and verified by visual inspection.

Three participants who did not sleep at all were excluded from analysis, as were time periods prior to the sensor placement or after the sensors were taken off, leaving a total of 357 patients with 2914 hours of recording. The characteristics of the study’s participants are outlined in Table 1. Generally, older participants (age ≥ 65) and those with sleep apnea had poorer sleep quality, with lower total sleep time (TST), poorer sleep efficiency, and increased N1 sleep time relative to other sleep stages (Table S1, S.2).

The machine learning task is defined as multiclass 1D-signal segmentation, wherein the ANNE One signals constitute the features, the PSG-annotated sleep stages form the labels, and class predictions are made per point in time to form a hypnogram, as demonstrated in Figure 1B.

Preprocessing

The ECG data were recorded at 512 Hz and processed by a multilevel 1D Coiflet transform with 5 moments. Peaks in the resulting signal were identified and used to define windows; within each window, the original ECG signal was searched for local maxima and minima to determine the precise timing of the R-peak. The wavelet-transformed ECG data were subsequently downsampled to 100 Hz for additional analyses. We then computed an ECG signal quality index (SQI) as follows. Trains of 5 or more physiological QRS complexes were identified on the basis of relatively stable HR (differing by no more than 10 beats per minute between consecutive beats) and amplitude (standard deviation of amplitude <25). All identified ECG peaks in the recording were then clustered using HDBSCAN [30] based on amplitude and morphology; peaks that clustered with the template QRS complexes were considered to have passed quality control (QC). Segments of ECG of 2 seconds on either side of these peaks were assigned an SQI of 1, while all other segments of ECG were given an SQI of 0.

The PPG data were recorded at 128 Hz, then band-pass filtered between 0.5 and 5 Hz and downsampled to 100 Hz. Trains of 5 or more physiological pulse waves were identified on the basis of relatively stable HR (differing by no more than 10 beats per minute between consecutive beats) and amplitude (differing by no more than twofold between beats). These high-quality pulse waves were normalized to a standard amplitude and duration, then clustered using HDBSCAN. For each cluster, the mean of its normalized pulse waves was taken as a template wave for that cluster. Beat-to-beat template matching was then performed to determine the extent to which each pulse wave in the rest of the recording matched one of the template waves, with the lowest sum-squared similarity to the template constituting a measure of signal quality for each PPG pulse wave. Pulse waves with minimum sum squared <0.4 were considered to have passed QC. Segments of PPG signal were set to 1 if they were within 2 seconds of a pulse wave passing, and 0 otherwise.

The chest XYZ accelerometry data were recorded at 210 Hz, downsampled to 100 Hz, similar to ECG, and band-pass filtered over 0.0050.2 Hz with a second-order Butterworth filter.

Log-magnitude spectrograms were taken of each signal by means of a Short-Time Fourier Transform (STFT) with a Hamming window of size equal to the signals’ sampling rate (i.e. 100) and with no overlap, creating samples at k = 1 Hz (an arbitrary value that was determined experimentally). Only the first 12 frequency bins of the XYZ data were kept, as the other higher-frequency bins were made redundant by the band-pass filter. Finally, the frequency bins were concatenated with limb/chest temperature, which was recorded at 1 Hz, and sleep stage label, which was created at Inline graphic Hz (as labels are for 30-second windows) and upsampled to 1 Hz by repeating each label 30 times.

Model

A Mamba-based RNN model (summarized in Figure 2B) was designed to process batches of 2D tensors with axes for the number of Inline graphic second windows and number of features, which include chest temperature, limb temperature, and frequency bins for ECG, PPG, and XYZ. Logits for each sleep stage are returned at each time step. A model with c output classes has 70 K parameters, which constitute:

Figure 2.

Figure 2

A. Overview of deep learning pipeline of this study. B. Implementation of the RNN architecture in this study. Batch normalizations precede each of the five linear layers in the MLPs. LeakyReLU is the activation function (see S.3 for details). C. The inference-time ensembling pipeline.

  • 1) An element-wise multi-layer perceptron (MLP) with h = 100 hidden dimensions that diminishes the feature-size axis to d = 10 dimensions, where h and d are experimentally determined.

  • 2) A bidirectional three-layer Mamba block. This consists of two Mamba blocks [31] that process time windows sequentially in opposite directions, whose outputs are concatenated in the feature- length axis to allow for a widened receptive field that detects patterns in signals more effectively.

  • 3) An element-wise MLP that downsizes the feature size dimension into c logits.

In accordance with AASM practice, we form 30-second window predictions by grouping 1-second logits into contiguous buckets of 30, taking the mean logit for each class, and outputting the sleep stage with the largest corresponding logit.

Training

The training, validation, and testing sets were randomly split in a 75/10/15 per cent ratio to sizes 267/36/54. During the training process, recordings were clipped to random segments of size 5000 to allow for mini- batch training (n = 4) and to prevent the model from overfitting to the high-level structure of most recordings, i.e. wake periods at the start and end with sleep in the middle. The objective function was cross-entropy loss, weighted on inverse class frequency to adjust for the moderate class imbalance shown in Table 1, then averaged across batches and time windows. The model was optimized with Adam [32], a widely-used adaptive learning rate optimizer, with learning rate 0.001. The parameters from the training iteration that maximized balanced validation accuracy were chosen for the final model.

Three models of varying sleep stage resolution were built: a 3-class, 4-class, and 5-class sleep staging model. The 5-class model has classes for wake, N1, N2, N3, and REM sleep, while the 3-class model groups N1, N2, and N3 sleep into non-REM sleep. The 4-class model designates N1 and N2 sleep as “light sleep” as opposed to N3 or “deep sleep,” a common practice in automatic sleep staging [4]. Separate models were trained for each class count as to treat each class equally.

Ensembling

For each of the n classes inside an n-class ensemble model, a new binary model of the same architecture was trained from scratch to predict the presence or absence of that class. The logits of these binary models were then stacked with logits from the regular 3/4/5-class models. This collection of logits formed the inputs to a suite of decision tree-based architectures (see S.3 for details), including AdaBoost [33, 34], random forests [34, 35], extra-tree classifiers [34, 36], histogram gradient boosting [34, 37], and XGBoost [38]. These architecture type processes time windows in a time-independent manner, progressively dividing them into smaller and more homogeneous groups by thresholding based on the value of their feature (i.e. a logit value for a sleep stage; the thresholds and choice of features are learned parameters. Sleep stage predictions in a single decision tree are made from the plurality sleep label of training set time windows inside a terminal node, while the proportions of each sleep label become the class probabilities. The model among the suite that maximized balanced validation accuracy was selected as the final ensembler model. A visual representation of the ensembling pipeline is presented in Figure 2C.

Statistical analyses

Results were compiled of 3-class, 4-class, and 5-class sleep staging for the regular and ensemble RNN models. Overall performance was evaluated by aggregating all test set recordings and computing the balanced accuracy, weighted precision, weighted recall, weighted F1, Cohen’s kappa (κ), and MCC.

For the best model, performance was plotted on a healthy subset of the test set. Ablation studies were carried out using various subsets of the ANNE One sensor data. We also retrained the model without downsampling the data, and considered alternative architectures, including the use of LSTM instead of Mamba, and consideration of a CRNN model utilizing the same preprocessing steps as our RNN model. For the best 5-class model, Bland–Altman plots were created to evaluate class-wise biases, followed by analyses of per-recording metrics individually as well as across differing signal quality scenarios, sleep disorders, and clinical variables including age, sex, Periodic Limb Movement Index (PLMI), Apnea-Hypopnea Index (AHI), Body Mass Index (BMI), TST, and sleep efficiency, and sleep disorders including insomnia, restless legs syndrome, and REM sleep behavior disorder. Because individual recordings did not necessarily contain instances of all five sleep stages, balanced accuracy may not have been a sensible metric for recording-wise comparisons; hence, accuracy was presented in lieu of balanced accuracy for those cases.

Results

Model evaluation

Typical-case recording predictions for all 5-class models are shown in Figure 3 as hypnograms along with predicted point-wise output class probabilities, known as hypnodensity graphs [39]. A Uniform Manifold Approximation and Projection (UMAP) [40] of an intermediate model output is also provided in S.4 (Figure S4), which is able to show the separation of sleep stages as clusters in the learned high-dimensional representation space.

Figure 3.

Figure 3

Predicted sleep stages (hypnograms) and class probabilities (hypnodensities) for a typical single full-night recording in the test set for 5-class regular and ensemble models. The hypnogram prediction at each epoch is the class with the highest probability. Probabilities are colored by ground truth sleep stage: Darker to lighter colors represent classes as ordered in the hypnograms from bottom to top. The methods of calculating both hypnodensities can be found in S.3. An alternative method of calculating hypnodensities is shown in S.4 (Figure S3).

Figure 4 displays the confusion matrices for all n-class ensemble models and Table 2 displays the overall metrics across all n-class regular and ensemble models. We plotted performance in both 1-second and 30-second epochs and found similar results irrespective of the epoch length (balanced accuracies of 1-second RNN ensembles are 83.74%, 75.06%, 64.86% for 3/4/5-class models). Accordingly, all results in this paper are for 30-second windows. Compared to a base RNN model, the ensembling procedure resulted in improved overall performance (5-class balanced accuracies of 86.19%, 69.62%, 67.72%, 81.16%, and 86.60% for wake, N1, N2, N3, and REM sleep respectively, S.5 Table S2). Most misclassifications were made between N1 and every other class, as well as between N2 and N3 for 5-class sleep staging.

Figure 4.

Figure 4

Test set confusion matrices of ensembled RNN model for 3, 4, and 5 class sleep staging (top).

Table 2.

Model evaluation metrics

Classes Model Metric (%)
Balanced accuracy Precision Recall F1 Cohen’s Inline graphic MCC
5 RNN 60.20 69.48 61.20 63.68 49.16 50.18
RNN Ensemble 65.11 71.61 64.41 66.15 53.23 54.38
4 RNN 67.59 73.08 66.59 68.34 51.95 53.06
RNN Ensemble 75.30 75.80 73.68 74.10 61.51 61.95
3 RNN 81.09 83.20 81.98 82.35 69.59 69.77
RNN Ensemble 84.02 84.76 84.04 84.23 72.89 73.00

Macro-evaluation metrics for all n-class models on the test set, where the best-performing model for each metric is bolded.

In a set of young and healthy sleepers (age < 40, AHI < 5, PLMI <5), the RNN ensemble model performed even better, as denoted in Figure 5 and Table 3 although removal of these participants from the test set did not substantially worsen model performance (S.7, Figure S7), indicating that overall model performance was not driven by this small number of healthy participants.

Figure 5.

Figure 5

Confusion matrices of 3-class, 4-class, and 5-class RNN ensemble models on a healthy subset of the test set (n = 11) defined as age <40, AHI <5, and PLMI <5.

Table 3.

Model evaluation metrics for healthy subset

Classes Metric (%)
Balanced accuracy Precision Recall F1 Cohen’s Inline graphic MCC
5 69.41 75.30 68.73 69.58 59.83 61.17
4 79.89 78.40 75.83 75.57 66.42 67.58
3 87.67 88.18 87.69 87.78 79.09 79.29

Macro-evaluation metrics of 3-class, 4-class, and 5-class RNN ensemble models on a healthy subset of the test set (n = 11) defined as age < 40 and AHI < 5 and PLMI <5.

Feature ablation study

We next investigated which sets of sensors (chest module vs. limb module; ECG vs. PPG. vs. other) were most influential to model performance. M instances of the RNN ensemble model were trained to determine the utility of its components. Firstly, features solely from the ANNE One’s chest module (ECG, accelerometry, chest temperature) or limb module (PPG, limb temperature) were examined. As shown in Table 4, the chest-only model trails closely behind the original, while the limb-only model exhibits a marked degradation (10%) for all metrics. This strongly suggests that the chest module is the principal driver behind model accuracy, although the limb module alone is capable of recovering sleep staging information to a rough degree. Next, because the ECG and PPG sensors themselves are the most likely to experience poor signal quality, we examined the performance of a model trained on only the accelerometry and temperature sensors. Remarkably, exclusion of ECG and PPG data resulted in minimal loss of accuracy, meaning that information in the ECG and PPG is also being captured effectively by the accelerometry signals, with temperature providing additional complementary information. Further exclusion of temperature data resulted in an accelerometry-only model that performed similarly to or even slightly better than the No ECG/PPG model (Table S3), indicating that chest accelerometry alone captures most of the discriminative information used by the network. This is corroborated by a per-sensor feature importance analysis with Integrated Gradients [41] (Figure S17), which suggests that accelerometry-derived features contribute the most to model performance relative to temperature, PPG, and ECG-derived features.

Table 4.

Model evaluation metrics for feature ablation study

Classes Model Metric (%)
Balanced accuracy Precision Recall F1 Cohen’s κ MCC
5 Full 65.11 71.61 64.41 66.15 53.23 54.38
Chest module 63.94 70.66 62.63 64.22 51.23 52.59
Limb module 54.83 62.29 55.08 56.56 41.60 42.59
No ECG/PPG 63.85 70.68 63.35 65.03 51.97 53.12
4 Full 75.30 75.80 73.68 74.10 61.51 61.95
Chest module 73.46 74.63 71.64 72.26 58.88 59.47
Limb module 64.89 68.63 63.59 64.66 48.22 49.13
No ECG/PPG 73.03 74.46 71.52 72.10 58.70 59.30
3 Full 84.02 84.76 84.04 84.23 72.89 73.00
Chest module 81.77 83.58 82.50 82.83 70.39 70.53
Limb module 73.28 78.10 74.71 75.84 58.21 58.69
No ECG/PPG 82.78 84.26 83.18 83.51 71.71 71.56

Evaluation metrics for the 3-class, 4-class, and 5-class RNN ensemble models using different subsets of the entire feature set. The best-performing model for each metric is bolded.

To better characterize the information available in chest accelerometry, we probed the signals within these channels. At the individual-participant level, aggregated accelerometry spectrograms aligned to the hypnogram (Section S.8, Figures S12, S13, S14, S15, S16) show stage-dependent structure, with clear shifts in power concentrated in a low-frequency band overlapping with the range of expected respiratory rates in sleep. This supports the hypothesis that very low-frequency components of chest accelerometry are important for distinguishing sleep stages. To examine this quantitatively, we (1) trained a model using accelerometry channels alone and then (2) replaced the band-pass filter with a high-pass cutoff to 0.5 Hz to attenuate the expected frequency range of resting respirations. Results are shown in Table S3. The accelerometry-only model performed marginally better than the No ECG/PPG model but worse than the full model; this may be attributable to the data’s unimodal nature, which could reduce the complexity of learning structural patterns in multiple data types during training, thus improving convergence. Meanwhile, the reduced performance of the high-passed accelerometry-only model is compatible with the possibility that some information used by the model is in the frequency range of human respirations.

Alternative approaches

To guard against overfitting and reduce the computation burden, we downsampled the ECG, PPG, and accelerometry signals to 100 Hz prior to use in our model. To explore the extent to which this would have impacted performance, we retrained the model without downsamping (i.e. using 512 Hz ECG, 128 Hz PPG, and 210 Hz accelerometry data) and band-pass filtering. This did not result in any decisive improvement in model performance (S.7, Figure S6). A strong correlation between high-frequency and low-frequency bins might have contributed to redundancy and increased the likelihood of overfitting to noise. In support of this hypothesis, we note the difference in balanced accuracy between the training and test sets was much less in the downsampled data (≈5%) than in the non-downsampled data (≈10%).

Two variants of the RNN model were produced to determine the value of the Mamba architecture and the 1-second window approach. The first, a LSTM-based RNN, replaces the Mamba blocks with LSTM blocks of identical input, hidden, and output dimension sizes, resulting in a comparable parameter count to the original RNN model. Following our previous study [42], the second is a CRNN (≈1271 K parameters) that groups every 30-time windows into buckets, bucket features with a CNN, and processes features sequentially using a Mamba-based RNN, resulting in predictions made per 30 seconds (see S.6 for details). The technique of extracting features from 30 seconds of data for an autoregressive model is common and has seen many variants [22, 23, 29], making the CRNN suitable as a “baseline” approach to windowing. As provided by Table 5 (confusion matrices in S.7, Figure S6), both the LSTM-based RNN models and the CRNN models have modestly reduced performance in all metrics, confirming the Mamba approach and 1-second window approach are beneficial.

Table 5.

Evaluation metrics for model architecture variants

Classes Model Metric (%)
Balanced accuracy Precision Recall F1 Cohen’s κ MCC
5 CRNN 60.02 56.30 60.02 56.37 46.96 47.92
LSTM-based RNN 62.64 69.46 62.47 64.03 50.76 51.87
RNN 65.11 71.61 64.41 66.15 53.23 54.38
4 CRNN 70.59 63.68 70.59 66.09 55.13 55.47
LSTM-based RNN 72.82 73.95 71.64 72.06 58.70 59.17
RNN 75.30 75.80 73.68 74.10 61.51 61.95
3 CRNN 80.21 73.43 80.21 75.76 65.26 65.46
LSTM-based RNN 81.06 82.90 81.37 81.82 68.71 68.95
RNN 84.02 84.76 84.04 84.23 72.89 73.00

Evaluation metrics for the 3-class, 4-class, and 5-class ensemble RNN, LSTM, and CRNN models. The best-performing model for each metric is bolded.

Global sleep metrics

As a means of evaluating model bias and verifying model robustness, Figure 6 contains scatter plots and Bland–Altman plots for the 5-class RNN ensemble, comparing predicted vs. actual values for a number of summary metrics, including time asleep, % N1 sleep, % N2 sleep, % N3 sleep, and % REM sleep. The model trends toward slightly underestimating N2 sleep and wake (as shown by the overestimated sleep in the first column) and slightly overestimating N1 and N3 sleep.

Figure 6.

Figure 6

Predicted vs. actual sleep stage scatter plots (top row) and bland–Altman plots (bottom row) for the 5-class RNN ensemble model on the test set. The center horizontal line is the mean. The two dotted lines are 1.96 standard deviations above and below the mean.

Recording-wise performance

We computed recording-wise performance metrics for the 5-class RNN ensemble model, the distributions of which are summarized in Table 6. We then considered recording-level accuracy as a function of clinical, demographic, and pathology features, and of ECG and PPG signal quality, using linear regression models or t-tests (plots in S.7, Figures S8, S9, S10, S11). The model performed similarly on recordings from individuals with insomnia as compared to those without (accuracy +0.03 for those with as compared to without insomnia, p = .42) and in those with restless legs syndrome (RLS) compared to those without (accuracy +0.03, p = .40). Accuracy sunk for those with REM sleep behavior disorder (accuracy −0.12, p = .02), largely driven by poorer classification of REM sleep (79% and 39% of REM sleep was detected by the 5-class RNN ensemble for those with and without REM sleep behavior disorder (RBD) respectively). Accuracy also declined with higher AHI (p = .04); however, this was a modest effect, with accuracy dropping only 2 per cent for every 10-unit higher AHI. Accuracy was not significantly associated with age, sex, periodic limb movement index, sleep efficiency, or TST.

Table 6.

Per-recording accuracy by sleep statistic, sleep disorder, and SQI

Metric (%) μ σ
Accuracy 64.54 10.77
F1 66.72 10.28
Cohen’s κ 50.58 14.50
MCC 52.92 13.40
Sleep statistic R2 p
Age 0.0618 .0700
Sex 0.0334 .1861
Periodic Limb Movement Index 0.0338 .1830
AHI 0.0751 .0449
BMI 0.0350 .1753
TST 0.0055 .5940
Sleep efficiency 0.0002 .9281
Disorder (n) R2 p
Restless Leg Syndrome (9) 0.0134 .4044
Insomnia (14) 0.0135 .4032
REM Sleep Behavior Disorder (5) 0.0965 .0222
% of Recording with good SQI in... R2 p
ECG 0.0623 .0687
PPG 0.0005 .8753
ECG or PPG 0.0913 .0264
ECG and PPG 0.0136 .4017

Per-recording accuracy for the 5-class RNN ensemble model (top); R2 and slope/t-test p-values of a line fitting various sleep (middle) and SQI statistics (bottom) to accuracy using linear regression. Statistically significant p-values (<.05) are bolded.

Model accuracy did not vary significantly with the proportion of the recording with bad quality ECG (estimate −0.1% decline in accuracy for each 1% increase in the proportion of poor-quality ECG, p = .07) or PPG (estimate −0.01% decline in accuracy per 1% increase in the proportion of poor-quality PPG, p = .88), but it was significantly worse in recordings with a high proportion of epochs with neither good ECG nor good PPG SQI (estimate −0.3% decline in accuracy for each 1% increase in the proportion of time where both ECG and PPG SQI were poor, p = .03). However, this effect was modest (R2 = 0.09) and we note that the proportion of time where this occurred was generally low, with all but 6 recordings having at least one of ECG or PPG with good quality for at least 90 per cent of the recording. Even in recordings in which 40 per cent or more of the recording had neither good quality ECG nor good quality PPG, 5-class model accuracy remained above 0.50.

Four annotators were involved in generating the ground truth labels used in this study, and the agreement between the model and the ground truth labels did not differ by the individual who generated the annotations (p = .89, one-way F-test).

Discussion

Key findings

This work has developed a Mamba-based RNN ensemble model to accurately perform inference from the ANNE One device for major sleep stages. The model demonstrates good performance across a wide range of patient characteristics, without the need to screen and exclude recordings and epochs on the basis of signal quality.

Related studies

Previous studies have undertaken the challenge of inferring sleep stage from non-EEG signals (Tables 7 and 8) using a mixture of cardiopulmonary (ECG, PPG, respiratory impedance bands, respiratory flow cannulas) and accelerometry data. Many studies used data from conventional PSG recordings [21–27] although a handful of studies used data from purpose-designed wearable sensors [20, 22, 24, 25, 28, 29, 41]. In general, models trained on and applied to PSG data performed better than models utilizing wearable sensor data. This is not surprising since signal quality is often much higher for attended in-laboratory PSGs where more intrusive sensors can be used, signal quality is continuously monitored, and issues can be addressed by the technologist as they occur. Even so, despite using wearable rather than PSG sensor data, the macro performance of our ensemble RNN model approached [21–23, 25, 26] or exceeded [24, 27] the performance reported for models trained on much larger sets of PSG data.

Table 7.

PSG-based sleep staging models in related literature

Study (Year) Dataset Metric % Signal features
Radha et al. (2019) [21] SIESTA (7 sleep laboratories)
-584 recordings (292 subjects)
-25% testing
4-stage Acc.
4-stage κ
77
61
ECG
Sun et al. (2020) [22] MGH (sleep laboratory)
-8682 recordings (7208 subjects)
-20% testing
3-stage B.Acc.
3-stage F1
3-stage κ
5-stage B.Acc.
5-stage F1
5-stage κ
86.8
84.2
76.0
70.3
68.1
58.5
ECG,
Resp. Effort
Kotzen et al. (2022) [23] SHHS Visit 1 (multi-center)
-5758 recordings (pretraining)
MESA (general population)
-2056 recordings
-204 testing
4-stage Acc
4-stage B. Acc
4-stage F1
4-stage κ
84
76
83
75
PPG
Kotzen et al. (2022) [23] SHSS Visit 1 (multi-center)
-5758 recordings (pretraining)
CFS Visit-5 v1 (family study)
-324 recordings
-80 testing
4-stage Acc
4-stage B. Acc
4-stage F1
4-stage κ
82
80
82
74
PPG
Pini et al. (2022) [24] Unspecified (academic sleep centers)
-12 404 recordings
CinC (sleep laboratory)
-994 recordings
-External testing set
3-stage B. Acc
3-stage F1
3-stage κ
3-stage MCC
4-stage B. Acc
4-stage F1
4-stage κ
4-stage MCC
74.0
81.4
61.6
61.7
63.6
72.0
53.6
54.0
ECG
Topalidis et al. (2023) [25] Unspecified (general population w/ sleep complaints)
-314 recordings (185 subjects)
-25% testing
-Ambulatory home recordings
4-stage B. Acc
4-stage F1
4-stage κ
4-stage MCC
85.4
86.3
79.2
79.2
ECG
Jones et al. (2024) [26] CCSHS, CFS, CHAT, MESA, WSC
-4000 random sampled recordings
-500 testing
5-stage B.Acc
5-stage Median κ
74
72.5
ECG
Kazemi et al. (2024) [27] UC Irvine Sleep Center
-123 recordings
-20% testing
3-stage κ
4-stage κ
5-stage κ
71.4
55.0
61.6
PPG,
Resp. Flow Rate,
Resp. Effort

Sleep staging models in related literature trained and evaluated on PSG data. For papers with confusion matrices, additional metrics were manually calculated and included in tandem with those explicitly reported by the paper. Acc. = accuracy, B.Acc. = balanced accuracy, κ = Cohen’s kappa.

Table 8.

Wearable-based sleep staging models in related literature

Study (year) Dataset Metric % Signal features
Zhang et al. (2018) [29] Beijing General Hospital of the Air Force (general population)
-39 recordings
-12.5% testing
5-stage Prec.
5-stage Rec
5-stage F1
58.5
61.1
58.5
HR, Actigraphy
Sun et al. (2020) [22]
Model from  Table 7
SHHS Visit 1 & 2 (multi-center)
-1000 random sampled recordings
-External testing set
3-stage B.Acc
3-stage F1
3-stage κ
5-stage B.Acc
5-stage F1
5-stage κ
80.8
80.2
69.7
62.9
58.6
53.3
ECG,
Resp. Effort
Wulterkens et al. (2021) [43] SOMNIA, N2N, HHS, HealthBed
-835 recordings
-35% testing
4-stage B.Acc.
4-stage F1
4-stage κ
4-stage MCC
75.4
76.4
63.7
63.7
PPG,
Accel
Chen et al. (2022) [20]
Individual-Based Model
Shirley Ryan AbilityLab (inpatient unit)
-10 recordings
-10% testing
3-class Mean κ
4-class B. Acc
4-class Mean κ
60.0
73.0
61.7
ECG, PPG,
Accel., Temp.
Chen et al. (2022) [20]
Population Model
Shirley Ryan AbilityLab (inpatient unit)
-10 recordings
-10% testing
3-class Mean κ
4-class Mean κ
17.1
6.1
ECG, PPG,
Accel., Temp.
Pini et al. (2022) [24]
Model from Table 4
Z3Pulse (general population)
-156 recordings (52 subjects)
-ECG-based wearable
-External testing set
3-stage B.Acc
3-stage F1
3-stage κ
3-stage MCC
4-stage B.Acc
4-stage F1
4-stage κ
4-stage MCC
72.5
79.8
61.2
61.4
64.2
69.8
53.2
53.6
ECG
Topalidis et al. (2023) [25]
Model from Table 4
Unspecified (general population w/ sleep complaints)
-314 recordings (185 subjects)
-25% testing
4-stage B.Acc
4-stage F1
4-stage κ
4-stage MCC
84.4
83.9
76.2
76.3
ECG, PPG
Schipper et al. (2024) [44] Sleep Medicine Center Kempenhaeghe,
OLVG Hospital (at-home recordings)
-323 recordings
4-stage B.Acc.
4-stage F1
4-stage κ
4-stage MCC
75.6
79.4
67.1
67.2
Accel.
Silva et al. (2024) [28] Instituto do Sono (general population)
-1522 recordings (1430 subjects)
-586 testing
4-stage Mean Acc.
4-stage Mean κ
70.7
56
HR, Accel

Sleep staging models in related literature that are trained (unless otherwise specified) and evaluated on wearable data. For papers with confusion matrices, additional metrics were manually calculated and included in tandem with those explicitly reported by the paper. Resp. = respiratory, Accel. = accelerometry, Temp. = temperature, Acc. = accuracy, B.Acc. = balanced accuracy, Prec. = precision, Rec. = recall, κ = Cohen’s kappa, MCC = Matthews correlation coefficient.

Our model exceeded [20, 22, 24, 28, 29] or matched [41, 43] the benchmarks of most wearable sensor-based models with the exception of one [25]. Of note, this study examined a younger set of participants (mean age 45) who were free of psychiatric and neurological co-morbidity, which may influence the relationship between sleep stage and its autonomic, cardiac, and pulmonary manifestations. In contrast, we studied an older (mean age 57.49) set of participants with significant sleep co-morbidity (45.81% with AHI > 5). When we limit our analyses to younger individuals without sleep apnea or significant periodic limb movements, model performance improves further (Figure 5). Another contrast between [25] and the present study is that [25] performed much stricter QC for recording quality, with 54/136 or nearly 40 per cent of nights of wearable sensor recording excluded from analysis due to poor signal quality. In contrast, in the interest of evaluating real-world performance, we did not exclude any recording or any epochs on the basis of signal quality, and the reported results reflect consideration of all available epochs of recording, irrespective of signal quality. The robustness of our model to missing data or poor signal quality is likely related in part to its use of multiple data streams, in particular accelerometry, which appeared to capture much of the information that would otherwise have been captured by ECG or PPG, such that sensors returning poor-quality data are compensated for in the model by those returning good-quality data. Indeed, poor signal quality was present in at least one sensor module for 866 (29.7%) of the 2914 hours of recording in our dataset. However, for 725 (83.7%) of these hours, the other module returned sufficient quality signal to allow accurate predictions.

Of particular comparative interest is the one previous paper addressing sleep stage inference from ANNE One data [20]. Ultimately, success in developing a population model was limited, and while there was some success in generating person-specific models, the performance of our ensemble RNN model exceeded even the best person-specific model in the paper, which had a 4-class balanced accuracy of 73.4 per cent.

Model performance

The RNN model architecture obtained maximal per-class performances for wake and REM, and minimal per-class performances for N1 and N2 sleep. As a transitory state, periods of N1 sleep are short and instable, making its discrimination a difficult task. Physiological similarities between N2 and N3 sleep, such as cardiorespiratory signatures, similarly complicate the model’s ability to differentiate them, as evidenced by N2–N3confusion as the most frequent misprediction for 5-class models (as well as N1/N2–N3confusion for 4-class models). Of note, these classes also represent those hardest to distinguish from each other in sleep staging by human annotators [45]. Conversely, wake and REM sleep have clearer differentiating characteristics, such as higher HR, which allows for easier classification.

No statistically significant variations in accuracy based on age, sex, PLMI, BMI, TST, sleep efficiency, insomnia, or RLS were exhibited by the 5-class ensemble model, providing an important advantage when applied in clinical contexts. The marginally significant (p = .0449) association between AHI and accuracy may owe to the fact that individuals with sleep apnea experience sleep fragmentation, resulting in N1 sleep, one of the most difficult stages to separate; nonetheless, the association was modest (R2 = 0.0751). The model showed poorer performance in individuals with RBD, driven by difficulties in classifying REM sleep. This may reflect in part the centrality of accelerometry to our model, which, in the context of abnormal movements in REM sleep, may be apt to misclassify REM sleep.

The series of ablation analyses establishes that much of the RNN model’s performance was likely fueled by chest module features (i.e. ECG, temperature, and XYZ accelerometry) rather than limb module (i.e. PPG and temperature), meaning that omitting the limb module may be reasonable if participant burden is a concern and collection of oximetry data is not necessary. Exclusion of both ECG and PPG data with retention of only accelerometry and temperature data resulted in model performance almost indistinguishable from use of all the chest module sensors, suggesting that the inclusion of traditional HR sensors (PPG, ECG) is not necessary. This highlights the richness of accelerometry data obtained at the chest, which may capture respiratory data, as well as measures of movement, and may also contribute to the robustness of our model, given the relative reliability of the accelerometers compared to the ECG/PPG sensors. At the same time, the attenuation of low-frequency activity (S.9) produced only modest degradations in performance, suggesting that the model is exploiting a nuanced combination of low-frequency respiratory motion, postural changes, and other movements, rather than any single narrow frequency band alone.

Limitations, strengths, and future directions

The core strengths of this study are: (1) the ANNE One system itself, with multiple channels of data recorded from a pair of minimally intrusive sensors, is a strength insofar as it provides an element of redundancy should one sensor return low quality signal, (2) the inclusion of Mamba allowed the achievement of relatively high performance with a relatively parameter-scarce architecture, and (3) the study population was large and varied, including participants from across a spectrum of sleep disorders.

A few methodological limitations are also worth considering. (1) The black box nature of the RNN makes it difficult to ascertain the specific mechanisms used to infer sleep stage from the sensor data. (2) The recordings were overnight with a brief period of wakefulness before lights out, resulting in a preponderance of sleep vs. wake, so it is difficult to know for sure how the model will perform in detecting daytime naps during long ambulatory recordings, and further studies with ambulatory recordings in clinical populations will be informative. (3) The recordings were all obtained from a single site, and additional work is needed to assess the extent to which these results will be generalizable to ANNE recordings made in different settings on different patient populations.

The choice of 1-second windows created both advantages and disadvantages. It reduced the number of frequency bins per window, saving on model parameter count, and inflated the number of epochs by 30-fold, which Mamba’s long-term dependencies could effectively capture. However, this setup complicated interpretability: while high-frequency content (>1 Hz) emerged in the model’s frequency bin features, low-frequency content (<1 Hz) manifested as changes across time windows, making their effects more difficult to capture with feature importance techniques.

An important future direction will be to better delineate how chest accelerometry contributes so strongly to model robustness, particularly in the absence of ECG and PPG. Our results suggest that some of this may be related to low-frequency, possibly respiratory, activity. However, the model appears to be making use of other information as well, and defining the physiological basis of this will be critical to better understanding the potential and limitations of chest accelerometry-based sleep staging.

Conclusions

This work introduces a Mamba-based RNN ensemble model to automatically perform 3-class, 4-class, and 5-class sleep staging using data from the ANNE One dual sensor, using sleep stages inferred from concurrent PSG recordings. The model meets or surpasses performance metrics in related literature when applied to our ANNE One dataset, especially for wearable devices, and demonstrates robustness across a wide range of clinical characteristics. Moreover, the model’s capability was achieved without prescreening and excluding large numbers of patients or epochs for reasons of signal quality. Compared to PSGs, the ANNE One sensor is far less intrusive of a device and procedure, which is significant for expanding the accessibility of sleep measurement for older and remote-living adults at greater risk of sleep-related illness. Our model provides a scalable, accurate, minimally intrusive approach to the ambulatory assessment of sleep staging without the need for EEG.

Supplementary Material

nov_supplementary_zsag022

Acknowledgments

Special thanks to Prof. Paul Kushner and Prof. Dylan Jones from the Department of Physics at the University of Toronto for offering various forms of support for this research during its infancy.

Contributor Information

Andrew H Zhang, Department of Computer Science, University of Toronto, Toronto, ON, Canada; Department of Medicine, Sunnybrook Research Institute, Toronto, ON, Canada; Department of Computer Science, Vector Institute for Artificial Intelligence, Toronto, ON, Canada.

Alex He-Mo, Department of Computer Science, University of Toronto, Toronto, ON, Canada; Department of Medicine, Sunnybrook Research Institute, Toronto, ON, Canada.

Richard Fei Yin, Department of Computer Science, University of Toronto, Toronto, ON, Canada; Department of Medicine, Sunnybrook Research Institute, Toronto, ON, Canada.

Chunlin Li, Department of Computer Science, University of Toronto, Toronto, ON, Canada.

Yuzhi Tang, Department of Computer Science, University of Toronto, Toronto, ON, Canada.

Dharmendra Gurve, Department of Medicine, Sunnybrook Research Institute, Toronto, ON, Canada.

Veronique van der Horst, Department of Neurology, Beth Israel Deaconess Medical Center, Boston, MA, United States.

Aron S Buchman, Department of Neurological Sciences, Rush University Medical Center, Boston, MA, United States.

Nasim Montazeri Ghahjaverestan, Department of Medicine, Sunnybrook Research Institute, Toronto, ON, Canada; Department of Physical Sciences, Sunnybrook Research Institute, Toronto, ON, Canada.

Maged Goubran, Dept. of Medical Biophysics, University of Toronto, Toronto, ON, Canada; Department of Electrical and Computer Engineering, Queen's University, Kingston, ON, Canada.

Bo Wang, Department of Computer Science, Vector Institute for Artificial Intelligence, Toronto, ON, Canada; Dept. of Computer Science, University of Toronto, Toronto, ON, Canada; Department of Laboratory Medicine, University Health Network, Toronto, ON, Canada.

Andrew S P Lim, Dept. of Medicine, University of Toronto, Toronto, ON, Canada; Department of Medicine, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, ON, Canada.

Disclosure statement

Financial disclosure: The authors have no financial conflicts of interest to disclose. This research is supported by the Centre for Aging and Brain Health Innovation, Canadian Institutes of Health Research, and National Institute on Aging. This research is not funded by Sibel Health, the manufacturer of the ANNE One device. This manuscript has been deposited in preprint form on https://arxiv.org/abs/2412.15947

Non-financial disclosure: The authors have no non-financial conflicts of interest to disclose.

Data availability

The data and code underlying this article will be shared on reasonable request to the corresponding author.

References

  • 1. Iranzo  A, Santamaria  J. Sleep in neurodegenerative diseases. In: Chokroverty S, Billiard M, eds. Sleep Medicine: A Comprehensive Guide to its Development. Clinical Milestones, and Advances in Treatment. New York, NY: Springer New York; 2015: 271–283. 10.1007/978-1-4939-2089-1_32 [DOI] [Google Scholar]
  • 2. Berry  RB, Gamaldo  CE, Harding  SM, et al.  The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications, Version 2.2. Darien (IL): American Academy of Sleep Medicine; 2015. [Google Scholar]
  • 3. Rechtschaffen  A, Kales  A. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects. Los Angeles (CA): UCLA Brain Information Service/Brain Research Institute; 1968. [Google Scholar]
  • 4. Imtiaz  SA. A systematic review of sensing technologies for wearable sleep staging. Sensors. 2021;21(5):1562. 10.3390/s21051562 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Tsinalis  O, Matthews  PM, Guo  Y, Zafeiriou  S. Automatic sleep stage scoring with single-channel EEG using convolutional neural networks [preprint]. arXiv. 2016. arXiv:1610.01683. https://arxiv.org/abs/1610.01683
  • 6. Malafeev  A, Laptev  D, Bauer  S, et al.  Automatic human sleep stage scoring using deep neural networks. Front Neurosci. 2016;12:781. 10.3389/fnins.2018.00781 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Seo  H, Back  S, Lee  S, Park  D, Kim  T, Lee  K. Intra-and inter-epoch temporal context network (IITNet) using sub-epoch features for automatic sleep scoring on raw single-channel eeg. Biomed Signal Process Control. 2020;61:102037. 10.1016/j.bspc.2020.102037 [DOI] [Google Scholar]
  • 8. Phan  H, Mikkelsen  K, Chén  OY, Koch  P, Mertins  A, De Vos  M. SleepTransformer: automatic sleep staging with interpretability and uncertainty quantification. IEEE Trans Biomed Eng. 2022;69(8):2456–2467. 10.1109/TBME.2022.3147187 [DOI] [PubMed] [Google Scholar]
  • 9. Jia  Z, Lin  Y, Wang  J, et al.  Graphsleepnet: adaptive spatial-temporal graph convolutional networks for sleep stage classification. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. Palo Alto (CA): International Joint Conferences on Artificial Intelligence Organization; 2020: 1324–1330. 10.24963/ijcai.2020/184 [DOI]
  • 10. Fiorillo  L, Puiatti  A, Papandrea  M, et al.  Automated sleep scoring: a review of the latest approaches. Sleep Med Rev. 2019;48:101204. 10.1016/j.smrv.2019.101204 [DOI] [PubMed] [Google Scholar]
  • 11. Phan  H, Mikkelsen  K. Automatic sleep staging of EEG signals: recent development, challenges, and future directions. Physiol Meas. 2022;43(4). 10.1088/1361-6579/ac6049 [DOI] [PubMed] [Google Scholar]
  • 12. Hochreiter  S, Schmidhuber  J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
  • 13. Cho  K. Learning phrase representations using RNN encoder-decoder for statistical machine translation  [Preprint]. arXiv. 2014. arXiv:1406.1078. https://arxiv.org/abs/1406.1078
  • 14. Vaswani  A, Shazeer  N, Parmar  N, et al.  Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems. Vol. 30. Red Hook (NY): Curran Associates, Inc.; 2017. [Google Scholar]
  • 15. Gu  A, Dao  T. Mamba: linear-time sequence modeling with selective state spaces [preprint]. arXiv. 2023. arXiv:2312.00752. https://arxiv.org/abs/2312.00752
  • 16. Gu  A, Johnson  I, Goel  K, et al.  Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems. Red Hook (NY): Curran associates. Inc.  2021;34:572–585. 10.48550/arXiv.2110.13985 [DOI] [Google Scholar]
  • 17. Zhang  C, Cui  W, Guo  J. MSSC-BiMamba: multimodal sleep stage classification and early diagnosis of sleep disorders with bidirectional mamba [preprint]. arXiv. 2024. arXiv:2405.20142. https://arxiv.org/abs/2405.20142
  • 18. Zhou  X, Han  Y, Liu  C, Ding  Y, Jia  Z, Liu  Y. BiT-MamSleep: bidirectional temporal mamba for EEG sleep staging [preprint]. arXiv. 2024. arXiv:2401.01589. 10.48550/arXiv.2411.01589 [DOI]
  • 19. Sibel Health. Clinical care | ANNE One [internet]. Chicago (IL): Sibel Health; n.d..  Retrieved May 29, 2024, from https://sibelhealth.com/clinical-care/
  • 20. Chen  P-W, O’Brien  MK, Horin  AP, et al.  Sleep monitoring during acute stroke rehabilitation: toward automated measurement using multimodal wireless sensors. Sensors.  2022;22(16):6190. 10.3390/s22166190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Radha  M, Fonseca  P, Moreau  A, et al.  Sleep stage classification from heart-rate variability using long short-term memory neural networks. Sci Rep. 2019;9(1):14149. 10.1038/s41598-019-49703-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Sun  H, Ganglberger  W, Panneerselvam  E, et al.  Sleep staging from electrocardiography and respiration with deep learning. Sleep.  2020;43(7). 10.1093/sleep/zsz306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Kotzen  K, Charlton  PH, Salabi  S, Amar  L, Landesberg  A, Behar  JA. Sleep-PPG-Net: a deep learning algorithm for robust sleep staging from continuous photoplethysmography. IEEE J Biomed Health Inform. 2022;27(2):924–932. 10.1109/JBHI.2022.3225363 [DOI] [PubMed] [Google Scholar]
  • 24. Pini  N, Ong  JL, Yilmaz  G, et al.  An automated heart rate-based algorithm for sleep stage classification: validation using conventional polysomnography and an innovative wearable electrocardiogram device. Front Neurosci. 2022;16:974192. 10.3389/fnins.2022.974192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Topalidis  PI, Baron  S, Heib  DPJ, Eigl  E-S, Hinterberger  A, Schabus  M. From pulses to sleep stages: towards optimized sleep classification using heart-rate variability. Sensors.  2023;23(22):9077. 10.3390/s23229077 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Jones  AM, Itti  L, Sheth  BR. Expert-level sleep staging using an electrocardiography-only feed-forward neural network. Comput Biol Med. 2024;176:108545. 10.1016/j.compbiomed.2024.108545 [DOI] [PubMed] [Google Scholar]
  • 27. Kazemi  K, Abiri  A, Zhou  Y, et al.  Improved sleep stage predictions by deep learning of photoplethysmogram and respiration patterns. Comput Biol Med. 2024;179:108679. 10.1016/j.compbiomed.2024.108679 [DOI] [PubMed] [Google Scholar]
  • 28. Silva  FB, Uribe  LFS, Cepeda  FX, et al.  Sleep staging algorithm based on smartwatch sensors for healthy and sleep apnea populations. Sleep Med. 2024;119:535–548. 10.1016/j.sleep.2024.05.033 [DOI] [PubMed] [Google Scholar]
  • 29. Zhang  X, Kou  W, Eric  I, et al.  Sleep stage classification based on multi-level feature learning and recurrent neural networks via wearable device. Comput Biol Med. 2018;103:71–81. 10.1016/j.compbiomed.2018.10.010 [DOI] [PubMed] [Google Scholar]
  • 30. McInnes  L, Healy  J, Astels  S. Hdbscan: hierarchical density based clustering. J Open Source Softw. 2017;2(11):205. 10.21105/joss.00205 [DOI] [Google Scholar]
  • 31. Torres-Leguet  A. Mamba.Py: a simple, hackable and efficient mamba implementation in pure pytorch and mlx [computer software]. GitHub.  2014. https://github.com/alxndrTL/mamba.py [Google Scholar]
  • 32. Kingma  DP, Ba  JL. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations. Amherst (MA): OpenReview; 2017.
  • 33. Freund  Y, Schapire  RE. A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi PMB, ed. European Conference on Computational Learning Theory. Springer. 1995; 23–37, Berlin, Heidelberg, 10.1007/3-540-59119-2_166 [DOI] [Google Scholar]
  • 34. Pedregosa  F, Varoquaux  G, Gramfort  A, et al.  Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
  • 35. Ho  TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–844. 10.1109/34.709601 [DOI] [Google Scholar]
  • 36. Geurts  P, Ernst  D, Wehenkel  L. Extremely randomized trees. Mach Learn. 2006;63:3–42. 10.1007/s10994-006-6226-1 [DOI] [Google Scholar]
  • 37. Cui  J, Hang  H, Wang  Y, Lin  Z. Gbht: gradient boosting histogram transform for density estimation. In: Meila M, Zhang T, eds. International Conference on Machine Learning. Brookline (MA): PMLR. 2021; 2233–2243. [Google Scholar]
  • 38. Chen  T, Guestrin  C. XGBoost: a scalable tree boosting system. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R, eds. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York (NY): ACM. 2016; 785–794. 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 39. Anderer  P, Ross  M, Cerny  A, Vasko  R, Shaw  E, Fonesca  P. Overview of the hypnodensity approach to scoring sleep for polysomnography and home sleep testing. Front  Sleep.  2023;2:1163477. 10.3389/frsle.2023.1163477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. McInnes  L, Healy  J, Melville  JU. Uniform manifold approximation and projection for dimension reduction. J Open Source Softw. 2018;3(29):861. 10.21105/joss.00861 [DOI] [Google Scholar]
  • 41. Sundararajan  M, Taly  A, Yan  Q. Axiomatic Attribution for Deep Networks. In International Conference on Machine Learning. Brookline (MA): PMLR; 2017:3319–3328. [Google Scholar]
  • 42. Zhang  A, Li  C, Tang  Y, et al.  1122 a deep learning model for inferring sleep stage from a flexible wireless dual sensor wearable system without eeg. Sleep.  2024;47(Supplement_1). 10.1093/sleep/zsae067.1120 [DOI] [Google Scholar]
  • 43. Wulterkens  BM, Fonseca  P, Hermans  LWA, et al.  It is all in the wrist: wearable sleep staging in a clinical population versus reference polysomnography. Nat Sci Sleep. 2021;13:885–897. 10.2147/NSS.S306808 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Schipper  F, Grassi  A, Ross  M, et al.  Overnight sleep staging using chest-worn accelerometry. Sensors.  2024;24(17):1424–8220. 10.3390/s24175717 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Rosenberg  RS, Van Hout  S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013;9(1):81–87. 10.5664/jcsm.2350 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

nov_supplementary_zsag022

Data Availability Statement

The data and code underlying this article will be shared on reasonable request to the corresponding author.

RESOURCES