Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 1.
Published in final edited form as: IEEE Trans Biomed Eng. 2026 Apr;73(4):1596–1608. doi: 10.1109/TBME.2025.3612158

AI-Driven Sleep Staging Using Instantaneous Heart Rate and Accelerometry: Insights from an Apple Watch Study

Tzu-An Song 1, Yubo Zhang 2, Ziyuan Zhou 3, Luke Hou 4, Masoud Malekzadeh 5, Aida Behzad 6, Joyita Dutta 7
PMCID: PMC12931632  NIHMSID: NIHMS2149179  PMID: 40971274

Abstract

Polysomnography, the gold standard for sleep evaluations, involves complex setup and data acquisition protocols and requires manual scoring of sleep data. Smartwatches and other multi-sensor consumer wearable devices with automated sleep staging capabilities offer a promising and scalable alternative for routine and long-term sleep evaluations in individuals. We conducted a multi-night study using a smartwatch for sleep assessment and created an AI-driven automated sleep staging framework based on instantaneous heart rate (IHR) and accelerometry data using sleep stage labels based on electroencephalography (EEG) as the reference. 47 healthy adults were recruited to record their sleep for up to seven consecutive nights using an Apple Watch Series 6 and a Dreem 2 Headband. Our sleep staging framework relies on a sequence-to-sequence long short-term memory (LSTM) model with additional convolutional layers. Our model yields a sleep staging accuracy of 71% for classifying every 30-s epoch into four classes: wake, light sleep, deep sleep, and rapid eye movement (REM) sleep. We show through an ablation study that an intra-epoch learning LSTM, incorporation of IHR sampling frequency information, and skip connections from early to late stages of the network are three key architectural advancements that enhance overall sleep staging performance. Our overall contributions include a dedicated Apple Watch app for multi-night raw data acquisition, an open-source library for automated four-class sleep staging, and a public dataset for future investigations.

Index Terms—: Sleep staging, Apple Watch, smartwatch, instantaneous heart rate, accelerometry, deep learning

I. Introduction

Sleep is vital for overall health and well-being and plays a key role in various bodily functions, including cognitive performance, metabolism, and immune function. Disrupted sleep has been linked to cardiovascular [1], metabolic [2], neurodegenerative [3], neuropsychiatric [4], and other disorders. Consequently, tracking sleep patterns could be crucial not only for the management of sleep disorders but also for diagnosing and managing a range of other medical conditions. The gold standard for clinical sleep evaluations is overnight polysomnography (PSG), which provides a comprehensive multimodal assessment encompassing electroencephalography (EEG), electrocardiography (ECG), pulse oximetry, electrooculography, electromyography (EMG), and respiratory tracking. Following PSG, trained professionals analyze sleep time series data by segmenting them into 30-second epochs and classifying them into one of five stages: wake, rapid eye movement (REM) sleep, and three non-REM (NREM) groups, N1, N2, and N3, in accordance with the American Academy of Sleep Medicine (AASM) guidelines [5]. PSG is expensive (typically costing over $1,500 a night in the US [6]), involves a complex setup and manual sleep scoring by trained sleep technologists, and, consequently, is unsuitable for long-term and repeated sleep assessment. The current digital landscape offers many wearable technologies (e.g., smartwatches, smart headbands, and smart rings) [7] that are available for consumer use and are promising for sleep assessment because of their low cost and low healthcare burden, unobtrusive and passive data collection, suitability for repeated recordings over many days or weeks, and consumer popularity. Recent studies have reported varying degrees of accuracy of sleep metrics derived from popular consumer wearables [8]–[11].

Among wearable devices for sleep tracking, the most popular ones are wristworn wearables, including the Apple Watch, Fitbit Sense, Google Pixel Watch, and many other popular smartwatch varieties. These devices are typically equipped with sensors with a multitude of functions like photoplethysmography (PPG), accelerometry, and skin temperature measurement and leverage recent advancements in artificial intelligence (AI) to perform automated sleep staging from multi-sensor data.

As part of our previous work, we developed a sequence-to-sequence long short-term memory (LSTM) model for automated mobile sleep staging (SLAMSS), which can use simple metrics that can be acquired from consumer-grade wristworn devices, namely mean activity, mean heart rate, and standard deviation of heart rate computed over 30-s epochs [12]. SLAMSS achieved reasonable accuracies for three-class (wake, NREM, and REM) and four-class (wake, light, deep, and REM) sleep staging. Most wrist wearables nowadays are capable of leveraging PPG data to collect densely sampled measures of cardiac rhythms and yield instantaneous heart rate (IHR) data. IHR is the heart rate measured as heartbeats per minute (bpm) at a specific instant of time. Recent studies suggest that IHR, with its higher temporal resolution than epoch-level measures, offers greater sensitivity to physiological changes and enhanced accuracy in sleep staging [13]–[15].

In contrast with conventional coarse heart rate measurements averaged over 30-s epochs, IHR provides deeper insight into cardiac rhythms. IHR can fluctuate quickly due to various physical states, such as exercise, stress levels, or changes in sleep cycles. Observing IHR is very useful for understanding its effects on health and overall well-being. In both clinical and research settings, IHR data is increasingly being used to examine the relationships between heart rate dynamics and various biological functions such as sleep patterns, cardiovascular health, and overall fitness. A strong physiological link exists between sleep and IHR. During deep sleep, heart rate typically decreases as the parasympathetic nervous system promotes restorative processes. In comparison, REM sleep, which is characterized by increased brain activity and vivid dreaming, leads to greater heart rate variability (HRV), sometimes resembling wakefulness. Fluctuations in IHR can provide key insights into sleep quality, with persistent irregularities potentially indicating stress, sleep disturbances, or disorders like sleep apnea [16] and cardiovascular diseases [17]. IHR can be measured using a wide variety of technologies, including both clinical-grade and mobile ECG, pulse oximetry, PPG in wristworn wearables, camera-based pulse PPG, ballistocardiography in bed sensors, and many other modalities. By enabling multi-night IHR monitoring during sleep, these wearable technologies provide deeper insights into HRV and sleep patterns, supporting the assessment of overall health, the evaluation of sleep efficiency, and the detection of potential sleep disturbances.

Given the informational richness of IHR data and its easy accessibility from modern wristworn devices, this work seeks to extend and adapt the SLAMSS framework to take advantage of IHR and instantaneous accelerometry information to enable automated four-class sleep staging. We conducted a multi-night sleep study on a cohort of 47 healthy adults using the Apple Watch Series 6. Leveraging a concurrently used Dreem 2 Headband, which yields EEG-derived epoch-wise sleep stage labels as the reference, we trained and validated our automated sleep staging model.

To the best of our knowledge, this is the first open-source four-class sleep staging model developed from a multi-night Apple Watch sleep study. While our model operates on a per-night basis, the dataset itself contains multi-night recordings from healthy individuals, making it well-suited for future research on longitudinal or continuous monitoring. Our key contributions include: (1) a dedicated Apple Watch app for collecting multi-night IHR and accelerometry raw data, (2) an open-source library for four-class sleep staging using IHR and accelerometry, and (3) a de-identified multi-night public dataset openly available for future research investigations.

II. Related Work

In our previous work, we developed the SLAMSS model [12], capable of classifying sleep into three stages (wake, NREM, REM) or four stages (wake, light, deep, REM) using activity data from wrist accelerometry and two coarse heart rate measures, both readily obtainable from consumer-grade wristworn devices. Traditionally, automated sleep staging has relied on EEG signals as the primary and most reliable input for determining sleep stage labels in wearable devices [18]–[21]. But, despite the availability of wearable EEG devices, these devices have yet to achieve widespread consumer adoption. A significant barrier to their adoption is user comfort, as wearing an EEG headband throughout the night can be inconvenient and disruptive to sleep. Recently, there has been increasing interest in ECG-based sleep staging [22]–[27], particularly in utilizing ECG-derived IHR and HRV for more accurate sleep assessment [15], [28]–[30]. A summary of representative studies is provided in Table I, which categorizes prior work by modality, architecture, task, and dataset size. ECG offers valuable insights into sleep physiology by capturing autonomic nervous system activity and HRV across different sleep stages. However, while smartwatches like the Apple Watch offer single-lead ECG capabilities, these measurements require deliberate user interaction, such as placing a finger on the sensor, which makes passive and continuous sleep monitoring impractical.

TABLE I.

Summary of Published sleep staging studies utilizing photoplethysmography (PPG), electrocardiography (ECG), and related cardiac signal Features. Studies are organized by sensing modality, model or architecture, and classification task

Reference Year Modality Model / Architecture Sleep Staging Task
Krauss et al. [30] 2025 ECG-derived HRV and RRV, actigraphy motion LSTM 3/5-class
Sharan et al. [14] 2024 ECG-derived IHR, ECG-derived respiration CNN + GRU 2/3/4/5-class
Attia et al. [34] 2024 PPG SleepPPG-Net2 (CNN) 4-class
Song et al. [12] 2023 ECG-derived HRV, actigraphy motion SLAMSS (CNN+LSTM) 3/4-class
Ganglberger et al. [27] 2023 ECG-derived HRV, respiration Deep neural network 3/5-class
Mathunjwa et al. [28] 2023 ECG-derived IHR CNN 4/5-class
Kotzen et al. [31] 2023 PPG SleepPPG-Net (CNN) with TL 4-class
Huttunen et al. [33] 2023 PPG, SpO2, nasal pressure CNN with MTL 4-class
Motin et al. [35] 2023 PPG-derived HRV Support vector machine 2/3/4-class
Radha et al. [32] 2021 PPG LSTM with TL 4-class
Sridhar et al. [15] 2020 ECG-derived IHR Dilated CNN 4-class
Sun et al. [24] 2020 ECG, respiration CNN + LSTM 4-class
Fonseca et al. [25] 2020 ECG-derived HRV, body movement LSTM 4-class
Walch et al. [13] 2019 PPG-derived HRV, actigraphy motion Shallow neural network 2/3-class
Radha et al. [22] 2019 ECG-derived HRV LSTM 4-class
Boe et al. [23] 2019 ECG-derived HRV, actigraphy motion, temperature Bagging classifier 2/3/4-class
Zhang et al. [39] 2018 PPG-derived HRV, actigraphy motion LSTM with feature learning 5-class
Beattie et al. [36] 2017 PPG-derived HRV, actigraphy motion Linear discriminant classifier 4-class

RRV: Respiration Rate Variability, LSTM: Long Short-Term Memory, CNN: Convolutional Neural Network, GRU: Gated Recurrent Unit, TL: Transfer Learning, MTL: Multi-Task Learning

In contrast, PPG sensors embedded in smartwatches allow for continuous, non-invasive heart rate monitoring, making them well-suited for sleep tracking. As a result, there has been a growing number of studies exploring sleep staging using PPG-derived data [31]–[35], including PPG-based IHR and HRV [13], [36]–[39]. These studies demonstrate the potential of PPG-based data in differentiating sleep stages, providing a scalable and user-friendly alternative to traditional methods. With the widespread availability of consumer-grade wearables equipped with PPG and accelerometer sensors, an automated sleep staging approach that relies solely on activity and heart rate data holds great promise for broad adoption. Such a method could enable more accessible, long-term sleep monitoring, benefiting both clinical research and everyday consumer applications.

While recent studies have investigated PPG-derived features for sleep staging, most have relied on processed HRV or other derived physiological metrics rather than the simple IHR signal that is ubiqui tously available across virtually all consumer smartwatches. This distinction is important because IHR is a standardized, low-complexity measure that can be acquired continuously without specialized preprocessing pipelines, making it more generalizable across devices and datasets. Leveraging such universally available PPG-derived IHR, in combination with accelerometry, enables the development of sleep staging models that are not constrained to specific brands, sensor configurations, or proprietary signal processing algorithms. From this perspective, our work aims to fill a gap in the current literature by demonstrating that robust, multi-class sleep staging can be achieved using only these universally accessible signals, thus supporting both broad scalability and cross-device applicability.

III. Methods

A. Study Population

We collected sleep data from 47 healthy adult volunteers with no history of sleep disorders, recruited from the local community through study advertisement flyers. The study protocol was approved by the University of Massachusetts Lowell Institutional Review Board. Participants were selected without regard for gender or ethnicity and provided written informed consent. To ensure a representative sample, all participants met the following criteria: a healthy weight (BMI <40 kg/m2), as obesity is known to impact sleep; a self-reported generally healthy status; and a self-reported relatively normal daily sleep pattern. Individuals with shift work or irregular day-night schedules were excluded. The demographic details of the participants are summarized in Table II. Each participant simultaneously wore a Dreem 2 Headband and an Apple Watch for up to seven consecutive nights, as shown in Fig. 1.

TABLE II.

Participant Details and Demographics

Metric Value
# of Subjects (# Female) 47 (30)
# of Nights 253
Age (Mean ± Std. Dev.) 34.87 ± 7.49
White 38
Asian 7
Black 1
Latino 1

Fig. 1.

Fig. 1.

Multi-night sleep monitoring setup: A participant wears a Dreem 2 Headband to collect EEG data and an Apple Watch Series 6 to collect sleep IHR and accelerometry data.

B. Devices Details

1). Apple Watch:

The wristworn wearable used in this study was the Apple Watch Series 6, a smartwatch equipped with advanced health and fitness tracking capabilities. Designed to be comfortable and non-intrusive, the Apple Watch does not significantly disrupt sleep or cause discomfort during use. It employs PPG sensors to monitor heart rate and a triaxial micro-electromechanical system (MEMS) accelerometer to track movement and activity levels. These sensors work in unison to provide comprehensive insights into the user’s physical activity and overall health. For this study, the Apple Watch was specifically utilized to measure heart rate and motion during sleep. The accelerometer captures acceleration along the x, y, and z axes (measured in units of g = 9.8 m/s2), while heart rate is monitored via PPG on the dorsal side of the wrist.

To interface with the Apple Watch, we developed BIDSleep, a companion iOS application that records IHR and accelerometer signals during sleep. The app runs in the background with minimal battery impact, automatically managing nightly sessions and saving data in a standardized comma-separated values (CSV) format stored directly in the iOS Files app for easy export. BIDSleep also supports multi-night tracking, enabling consistent and convenient monitoring across consecutive nights.

2). Dreem 2 Headband:

The Dreem 2 Headband was used to collect sleep EEG data and derive sleep stages in this study. According to [40], Dreem Headband’s proprietary algorithm for automated sleep staging is comparable to a consensus of five expert scorers relative to clinical-grade PSG data. This lightweight, cost-effective wearable device provides high-quality EEG recordings for home use. It has four EEG channels (FpZ-O1, FpZ-O2, FpZ-F7, F8-F7) to capture brain activity during sleep. In addition, the headband includes two pulse oximeters (red and infra-red) to measure blood oxygen saturation and heart rate, as well as three accelerometer channels (x, y, and z) to track head position and respiratory rate. The device provides automated sleep stage results and raw EEG for analysis, along with personalized sleep insights through a smartphone app.

C. Data Retrieval and Preprocessing

Fig. 2 illustrates the full variety of signal types used in this work. Sleep data from the Apple Watch was recorded through BIDSleep, which runs on both iPhones and Apple Watches, interfacing with HealthKit to access health data and with WatchKit to support device interaction between the two devices.

Fig. 2.

Fig. 2.

Data inputs to the SLAMSS model were collected from the Apple Watch: instantaneous heart rate (IHR), accelerometer data (Accel), epoch-wise heart rate sampling frequency (Freq), and epoch-wise time information (Time). The model’s target output comprises epoch-wise sleep stage labels (Stage) computed from EEG signals derived from the Dreem Headband.

1). Apple Watch IHR Data:

Apple Watch sleep IHR data is extracted via HealthKit from the device’s PPG sensor. Outliers are removed by applying a threshold of 3 to z-scores. The raw IHR data from the Apple Watch are irregularly sampled, with the sampling rate being higher when the wearer is more active. As a first preprocessing step, we apply linear interpolation to obtain a uniformly sampled 1-Hz signal, indicated as IHR in Fig. 2.

2). Apple Watch Accelerometer Data:

The Apple Watch accelerometer records movement along the x, y, and z axes and also generates irregularly sampled data. To improve measurement accuracy, outliers are removed by applying a threshold of 3 to z-scores. After outlier removal, the signals are interpolated to a uniform 1-Hz sampling rate, ensuring consistency across epochs. Finally, an overall movement magnitude, denoted as Accel, is then computed as:

Accel=ax2+ay2+az2, (1)

where ax, ay, and az represent acceleration along the three Cartesian coordinate axes.

3). IHR Sampling Rate:

The IHR sampling rate, denoted as Freq, refers to the frequency at which the Apple Watch samples and records heart rate data. Before applying linear interpolation to standardize the irregularly sampled IHR, we first calculate the original sampling rate for each epoch. Typically, the raw IHR sampling rate is around 0.2 Hz. A higher sampling rate indicates more frequent data collection, often corresponding to increased movement or activity. The enhanced temporal resolution during active phases allows the Apple Watch to capture more detailed variations in heart rate and motion, providing a more accurate representation of the subject’s physiological state over time.

4). Time Variables:

To estimate the circadian contributions to sleep, this study utilizes a “clock proxy” feature [13], denoted here as the variable Time, which models the body’s internal clock using two approaches: a simple cosine wave or a personalized circadian clock model. The latter is based on step data from the Apple Watch, serving as a proxy for light exposure, one of the primary regulators of circadian rhythm.

Step data is first collected and then mapped to estimated light exposure levels based on the time of day, as physical activity often correlates with ambient light availability. Higher step counts during daylight hours may indicate greater exposure to natural light, which helps regulate sleep-wake cycles, while lower step counts in the evening may suggest reduced light exposure, signaling the body to prepare for sleep. By incorporating step-derived light estimation, the personalized model provides a more dynamic and individualized representation of circadian phase, improving the accuracy of sleep-wake predictions.

5). Dreem 2 Headband Sleep Stage Labels:

The Dreem 2 Headband utilizes EEG sensors with dry electrodes for signal capture. It uses a proprietary algorithm to distinguish between different sleep stages, including wake, N1, N2, N3, and REM. For this study, these five stages are consolidated into four categories: wake, light sleep (comprising N1 and N2), deep sleep (N3), and REM sleep. It produced a hypnogram of sleep stage label assignments for every 30-s epoch of the night. We note the sleep stage label variables here as Stage. To ensure accurate alignment of sleep staging data with other physiological signals, both Apple Watch and Dreem Headband data are recorded using the same Apple iPhone. This synchronization guarantees that both devices share a common time reference, minimizing potential misalignment issues. As a result, Apple Watch data (including time series for variables IHR, Accel, Freq, and Time) can be seamlessly integrated with the sleep stage labels (Stage) from the Dreem Headband. In addition, all Dreem EEG signals and the corresponding auto-scored hypnograms were carefully reviewed by a sleep expert to verify quality and resolve obvious artifacts before analysis.

D. Data Selection and Quality Assurance

In this study, we focused on ensuring high-quality data by excluding recordings that were shorter than 5 hours. While collecting multi-night sleep data presented challenges such as devices experiencing unexpected shutdown, battery depletion, or participants waking up for extended periods (such as phone calls), we carefully managed data quality by excluding the affected recordings. Additionally, if the device was unintentionally removed or not worn after getting back into bed, those recordings were excluded. These measures helped to maintain a dataset with accurate, complete, and reliable data, ensuring the integrity of our analysis.

E. Network Design

For this study, we build on the SLAMSS model as the backbone of our network [12]. The SLAMSS architecture integrates a convolutional neural network (CNN) with an LSTM-based encoder-decoder sequence-to-sequence structure [41], which is further enhanced in this work by an attention mechanism to improve feature extraction and temporal dependencies in sleep staging.

Originally, the SLAMSS model was designed to process coarse heart rate and activity variables, including epoch-wise mean and standard deviation of heart rate and epoch-wise activity counts. The modified SLAMSS model in this work accepts the IHR and Accel inputs from the Apple Watch, which are instantaneous (as opposed to epoch-wise) signals, and predicts epoch-wise Stage labels based on sleep stages derived from the Dreem 2 Headband. The network modifications described in this paper allow for a more fine-grained analysis of physiological signals, improving the model’s capability to capture sleep dynamics. The overall network architecture of the modified SLAMSS model is shown in Fig. 3(a). This model consists of two learning components. The first, intra-epoch learning, focuses on extracting information within a single epoch, while the second component handles inter-epoch learning, which involves processing information across multiple epochs.

Fig. 3.

Fig. 3.

(a) The SLAMSS-IFS architecture for four-class sleep staging from Apple Watch data. The SLAMSS network has instantaneous heart rate (IHR) and accelerometer (Accel) signals as model inputs, and sleep stage labels (Stage), Wake (W), REM (R), Light (L), and Deep (D), as the model outputs. The SLAMSS-IFS network processes input sequences of 1200 epochs (each lasting 30 seconds), which are passed into separate CNN and LSTM subnetworks. The features extracted by both the CNN and LSTM are concatenated, followed by the concatenation of epoch-wise Time (Time) and Frequency (Freq) sequences. The combined features are then input into an attention-guided encoder-decoder system to generate the output labels. (b) The intra-epoch learning process of the CNN subnetwork for processing the 1200-epoch input sequences. (c) The intra-epoch learning process of the LSTM subnetwork for processing the 1200-epoch input sequences.

In the initial stage of the SLAMSS model, intra-epoch learning is applied to extract meaningful representations from IHR and Accel signals within each epoch. This process involves two specialized subnetworks: a CNN and an LSTM network, each designed to capture distinct aspects of the sleep signals.

The CNN subnetwork, responsible for capturing intra-epoch local patterns, comprises five convolutional layers, each with 128 channels, a kernel size of 5, zero-padding of 2, and a stride of 1. To enhance feature extraction and reduce dimensionality, two max-pooling operations are applied: the first, with a kernel size of 6 and a stride of 6, follows the first convolutional layer; the second, with a kernel size of 5 and a stride of 5, follows the fourth convolutional layer. A 50% dropout layer is applied at the end to mitigate overfitting. All convolutional layers use leaky rectified linear units (ReLUs) as activation functions, except for the final convolutional layer. This CNN subnetwork extracts epoch-wise localized features, as illustrated in Fig. 3(b).

In contrast, the LSTM subnetwork is designed to model temporal dependencies within each epoch. It consists of two bidirectional LSTM layers, each with a hidden size of 32, followed by a fully connected layer with a hidden size of 128 and a 50% dropout layer. Temporal features are obtained by passing the final sample of each epoch through the fully connected layer, allowing the model to capture long-range dependencies in the data, as shown in Fig. 3(c).

The output features from the CNN and LSTM subnetworks are concatenated to form a compact, low-dimensional representation for each epoch. To further enhance this representation, additional epoch-wise Time and Freq signals are integrated before being processed by an LSTM-based encoder-decoder subnetwork. This architecture facilitates inter-epoch learning, significantly improving the accuracy of sleep stage predictions. The LSTM encoder consists of an LSTM layer with a hidden size of 96, followed by a 20% dropout layer to mitigate overfitting. Similarly, the LSTM decoder is composed of an LSTM layer with a hidden size of 96, a 20% dropout layer, and a fully connected layer responsible for generating the final sleep stage predictions.

An attention mechanism is incorporated to refine the encoder’s hidden representation by dynamically integrating features across all epochs. This allows the model to focus on the most informative segments of the sequence, enhancing its ability to capture temporal dependencies. Through an attention-driven optimization process [42], the model learns to prioritize relevant features, thereby improving its predictive accuracy. The attention mechanism functions by calculating alignment scores, which measure the similarity between the encoder’s and decoder’s hidden states. These scores are normalized via a softmax activation function to derive attention weights. The model then constructs a context vector by computing a weighted sum of the encoder’s hidden states, assigning higher weights to the most pertinent time steps. This approach prevents the model from over-relying on the final elements of the sequence, a common limitation of traditional, non-attention-based decoders [42]. Instead, the model intelligently allocates focus to key temporal patterns, enhancing its ability to learn robust representations and generate more accurate sleep stage predictions.

Finally, as shown in Fig. 3, a skip connection is introduced to allow the decoder to directly access the low-dimensional epoch-wise representations. This connection facilitates a more efficient learning process, improving the model’s overall capacity to predict sleep stages with greater precision and accuracy.

Hereon we refer to the developed model as the SLAMSS-IFS model, where the “I” highlights the use of an intra-epoch learning LSTM, the “F” refers to the use of the sampling frequency variable Freq as an information source, and the “S” refers to the use of skip connections to the decoder.

F. Experimental Setup

We compared the model to several competitive approaches, namely SeqSleepNet [43], Dilated CNN [15], U-Sleep [44], as well as custom transformer and LSTM models incorporating a CNN front-end. Table III summarizes the network properties of all models used. Each model was assessed on its ability to perform four-class sleep staging utilizing whole night IHR and Accel inputs, with specific components and configurations as described below.

TABLE III.

Network Architectures

Network Properties SLAMSS-IFS SeqSleepNet [43] Dilated CNN [15] U-Sleep [44] CNN+Transformer [45] CNN+LSTM [46]
Skip Connection Yes No Yes Yes No No
Input Types IHR
Accel
Time
Freq
IHR
Accel
IHR
Accel
IHR
Accel
IHR
Accel
Time
Freq
IHR
Accel
Time
Freq

1). SLAMSS-IFS:

The SLAMSS-IFS model builds on the original SLAMSS model [12] with three additional components: an intra-epoch learning LSTM (“I”), a frequency variable (“F”), and a skip connection (“S”). The intra-epoch learning LSTM processes temporal dependencies within individual epochs. The frequency variable captures irregularities in heart rate data. Skip connections enables the decoder to access epoch-wise low-dimensional representations.

2). SeqSleepNet:

SeqSleepNet [43] was adapted to IHR by first using a 1D convolutional layer to extract short-range temporal features from each epoch’s IHR signal while preserving its length. An intra-epoch bidirectional gated recurrent unit (GRU) with attention summarizes each epoch into a single vector, and an inter-epoch bidirectional GRU models relationships across consecutive epochs. A final linear layer outputs the predicted stage for each epoch.

3). Dilated CNN:

We implemented a dilated CNN for benchmarking, as proposed by Sridhar et al. for IHR-based sleep staging [15]. This model employs dilated convolutional layers to capture multi-scale temporal patterns in the sleep data. The dilated CNN approach offers an efficient way to analyze long-range dependencies while reducing computational complexity.

4). U-Sleep:

U-Sleep [44], originally developed for multi-channel PSG data, was adapted to IHR with an encoder-decoder architecture that includes skip connections between symmetric stages. The encoder compresses the full-night input into progressively higher-level temporal representations, and the decoder reconstructs these features to generate a sequence of epoch-level predictions covering the entire recording.

5). CNN+Transformer:

We implemented a state-of-the-art transformer model, a deep learning architecture known for its effectiveness in handling sequential data [45]. Prior to the transformer layer, a CNN model is employed to extract short-range features. The transformer uses self-attention mechanisms to model long-range dependencies within the sleep data, allowing for better performance in capturing complex temporal relationships across sleep stages.

6). CNN+LSTM:

We used a standard LSTM backbone and enhanced the network with a CNN front-end to extract short-range features [46]. By incorporating a memory cell and gating mechanisms, the LSTM is capable of modeling temporal relationships within sleep data and making accurate predictions of sleep stages.

G. Ablation Study

We conducted an ablation study to examine the proposed upgrades and refinements to the SLAMSS model. We compared the sleep-staging accuracy of the new SLAMSS-IFS model with the following variants: SLAMSS-FS, SLAMSS-IS, SLAMSS-IF, SLAMSS-S, SLAMSS-F, SLAMSS-I, and SLAMSS. All variants share the same backbone and training protocol to ensure a fair comparison.

  • SLAMSS-IFS: The full model combines all three components: an intra-epoch LSTM to capture intra-epoch temporal dynamics (I), the sampling-frequency variable to account for rate-related effects (F), and skip connections to expose encoder representations to the decoder (S).

  • SLAMSS-FS & IS & IF: SLAMSS-FS retains the frequency variable and skip connections (no intra-epoch LSTM). SLAMSS-IS retains the intra-epoch LSTM and skip connections (no frequency variable). SLAMSS-IF retains the intra-epoch LSTM and the frequency variable (no skip connections).

  • SLAMSS-S: This variant retains only the skip connections, removing both the intra-epoch LSTM and the frequency variable. It tests whether feature bypass from encoder to decoder alone provides measurable benefit.

  • SLAMSS-F: This variant incorporates solely the frequency variable, eliminating the intra-epoch LSTM and skip connections. It measures the impact of modeling the sampling frequency independently.

  • SLAMSS-I: This variant includes only the intra-epoch LSTM, removing the frequency variable and skip connections. It serves as a sequential baseline focused on within-epoch temporal modeling.

  • SLAMSS: The baseline model removes all three components (no I, F, or S) while keeping the same backbone and training protocol. It provides a baseline reference for all SLAMSS variants.

H. Additional Analyses

To further evaluate model performance, we conducted five-fold cross-validation to ensure robustness and generalizability across subjects. In addition, we extended the classification task to five-class sleep staging (wake, N1, N2, N3, and REM), providing a more fine-grained assessment beyond the standard four-class setting.

I. SLAMSS-IFS Training, Validation, and Testing

The SLAMSS-IFS framework was implemented on the PyTorch platform using an NVIDIA RTX 3090 Ti graphics card for efficient processing. The data were partitioned subject-wise into independent training, validation, and testing subsets: the training set consists of 31 subjects, the validation set includes 5 subjects, and the test set contains the remaining 11 subjects. Hyperparameters such as the learning rate and batch size were set to 0.00015 and 20, respectively. The learning rate was selected based on preliminary experiments, as it consistently yielded the highest validation performance. The loss function used in this study is real-world (RW) weighting loss [47], which modifies the traditional cross-entropy loss to penalize both the omission of positive instances and the misclassification of negatives as positives. Detailed information on the RW loss function for sleep staging can be found in our previous work [12]. The model was trained for 500 optimization epochs using the Adam optimizer. The validation dataset was utilized to identify the optimal model parameters, which were then applied to generate the final results from the test set.

J. Evaluation Metrics

The performance of the SLAMSS-IFS model was assessed using multiple classifier evaluation metrics, including overall accuracy, sensitivity, specificity, precision, weighted F1 score, and weighted Matthews correlation coefficient (MCC). The formulas for computing these metrics are provided in Table IV. We computed the full confusion matrices for each of our experimental studies, comparing the SLAMSS-IFS model with all reference methods. These matrices highlight the model’s strengths and weaknesses in classifying different sleep stages. In addition to these metrics, we report several clinically relevant sleep metrics to further assess the model’s practical utility in sleep stage prediction. The sleep metrics, namely sleep efficiency, sleep onset latency, light sleep time, deep sleep time, REM sleep time, total sleep time, and the fractions of light, deep, and REM sleep, provide a more thorough evaluation of the model’s effectiveness in predicting clinically important sleep parameters. The sleep metric list provided below:

TABLE IV.

Classifier Evaluation Metrics

Metric Name Metric Definition1
Overall Accuracy i=1kTPii=1kTPi+FNi
Sensitivity/Recall i=1kTPik×TPi+FNi
Specificity i=1kTNik×TNi+FPi
Precision i=1kTPik×TPi+FPi
F1 Score TPTP+12FP+FN
MCC TP×TNFP×FNTP+FPTP+FNTN+FPTN+FN
Weighted F1 Score i=1kF1i×wi
Weighted MCC i=1kMCCi×wi
1

TP: True Positive, FN: False Negative, TN: True Negative, FP: False Positive, w: Inverse Class Frequency, k: Number of Classes

1). Sleep Efficiency:

Sleep efficiency represents the percentage of time a person spends asleep while in bed. High sleep efficiency indicates that a person is spending less time awake during the night, which is generally considered a sign of good sleep quality. The formula for calculating sleep efficiency is as follows:

Sleep Efficiency=Number of Sleep EpochsTotal Number of Epochs, (2)

where the “Number of Sleep Epochs” refers to the epochs where the individual is actually asleep, and the “Total Number of Epochs” includes both sleep and wake epochs during the time the person is in bed.

2). Sleep Onset Latency:

Sleep onset latency refers to the time it takes for an individual to transition from being awake to falling asleep after turning the lights off. It is an important indicator of the sleep initiation process and can be a useful measure in identifying potential sleep disorders, such as insomnia. To calculate sleep onset latency, we first identify the moment when the individual begins to fall asleep. This is determined by finding the first epoch in which sleep is detected. Once this initial epoch is identified, sleep onset latency is defined as the time between the initial epoch and the first three consecutive epochs of sleep. The use of three consecutive epochs ensures that this metric captures a stable transition.

3). Deep, Light, REM, and Total Sleep Time:

The durations of the light, deep, and REM sleep stages play a critical role in understanding the structure and quality of an individual’s sleep cycle. Each stage serves a distinct physiological function, contributing to various aspects of sleep health. Disruptions in the duration or balance of these stages may indicate underlying sleep disorders or poor sleep quality. To calculate the duration of each sleep stage, we sum the time spent in light sleep, deep sleep, and REM sleep across all epochs during the sleep period. Additionally, total sleep time (TST) is calculated as the total duration of sleep across all stages, encompassing the entire sleep session.

4). Deep, Light, and REM Fraction:

In addition to measuring the total duration of each sleep stage, it is also important to assess the proportion or fraction of time spent in each stage relative to the total sleep time. These fractions provide a deeper understanding of the individual’s sleep architecture and can offer clues about the quality of their sleep. The fractions of light sleep, deep sleep, and REM sleep are calculated by dividing the time spent in each stage by the TST. These values represent the percentage of the total sleep time allocated to each stage, with each stage contributing uniquely to restorative processes. A balanced distribution of these stages is often associated with healthy sleep, while imbalances may suggest disruptions in the sleep cycle that could affect overall sleep quality.

By comparing our model with reference models using both classifier evaluation metrics and clinical sleep metrics, we can gain a broader understanding of its performance and relevance to real-world applications in sleep medicine.

IV. Results

Classifier evaluation metrics comparing the different four-class sleep staging models, SLAMSS-IFS (the proposed model), SeqSleepNet, Dilated CNN, U-Sleep, CNN+Transformer, and CNN+LSTM, are summarized in Table V. The confusion matrices, which illustrate the accuracy of each model’s predictions for individual sleep stages across the experimental studies and comparison methods using our Apple Watch dataset, are presented in Fig. 4. Additionally, the comparison of clinical sleep metrics for the four-class sleep staging models is displayed in Fig. 5, providing further insights into their practical performance. Table VI summarizes the classifier evaluation metrics for the ablation study that compare the SLAMSS variants: SLAMSS-IFS, SLAMSS-FS, SLAMSS-IS, SLAMSS-IF, SLAMSS-S, SLAMSS-F, SLAMSS-I, and baseline SLAMSS.

TABLE V.

Performance Comparison of SLAMSS-IFS with SeqSleepnet, Dilated CNN, U-Sleep, CNN+Transformer, and CNN+LSTM Baselines based on multiple classifier evaluation metrics, including overall accuracy, sensitivity, specificity, precision, weighted F1 score, and weighted Matthews correlation coefficient (MCC)

Metric SLAMSS-IFS SeqSleepNet [43] Dilated CNN [15] U-Sleep [44] CNN+Transformer [45] CNN+LSTM [46]
Overall Accuracy 0.7104 0.6915 0.6699 0.6460 0.6327 0.6120
Sensitivity 0.6998 0.6687 0.6497 0.6230 0.6039 0.5818
Specificity 0.8902 0.8818 0.8720 0.8618 0.8557 0.8470
Precision 0.6983 0.6890 0.6782 0.6460 0.6365 0.6354
Weighted F1 Score 0.7079 0.6883 0.6663 0.6415 0.6293 0.6105
Weighted MCC 0.5599 0.5300 0.4951 0.4573 0.4373 0.4086

Fig. 4.

Fig. 4.

Confusion matrices for different four-class automated sleep staging models for the Apple Watch: (a) SLAMSS-IFS, (b) SeqSleepNet, (c) Dilated CNN, (d) U-Sleep, (e) CNN+Transformer, (f) CNN+LSTM.

Fig. 5.

Fig. 5.

Comparison of clinical sleep metrics from the Apple Watch based on four-class sleep staging using SLAMSS-IFS, SeqSleepNet, Dilated CNN, U-Sleep, CNN+Transformer, and CNN+LSTM. The red dotted line corresponds to the assumed ground truth (EEG) value of each metric based on the Dreem Headband.

TABLE VI.

Ablation study of the SLAMSS-IFS framework, evaluating the impact of an intra-epoch LSTM (I), a frequency variable (F), and a skip connection (S) on classification performance. Metrics include overall accuracy, sensitivity, specificity, precision, weighted F1 score, and weighted Matthews correlation coefficient (MCC)

Model Intra-epoch Learning LSTM Frequency Variable Skip Connection Overall Accuracy Sensitivity Specificity Precision Weighted F1 Score Weighted MCC
SLAMSS - - - 0.5050 0.4396 0.7952 0.4676 0.4559 0.2012
SLAMSS-I - - 0.5213 0.4686 0.8090 0.5134 0.5065 0.2589
SLAMSS-F - - 0.5192 0.4427 0.7960 0.5064 0.4650 0.2168
SLAMSS-S - - 0.6910 0.6909 0.8859 0.6696 0.6886 0.5381
SLAMSS-IF - 0.5335 0.4912 0.8141 0.5331 0.5151 0.2774
SLAMSS-IS - 0.7002 0.6796 0.8851 0.6962 0.6972 0.5424
SLAMSS-FS - 0.7015 0.6843 0.8865 0.6898 0.6984 0.5460
SLAMSS-IFS 0.7104 0.6998 0.8902 0.6983 0.7079 0.5599

A. SLAMSS-IFS vs. SeqSleepNet

SLAMSS-IFS achieved higher overall accuracy (0.7104 vs. 0.6915) and sensitivity (0.6998 vs. 0.6687) than SeqSleepNet, with improvements across all other metrics. Confusion matrices indicate better detection of deep (76.4% vs. 74.8%) and REM stages (70.4% vs. 66.5%), and fewer wake-to-light misclassifications. Clinical metrics show SLAMSS-IFS estimates more closely matched EEG-derived reference values, particularly for light time and deep time.

B. SLAMSS-IFS vs. U-Sleep vs. Dilated CNN

SLAMSS-IFS outperformed both U-Sleep (overall accuracy: 0.6460) and Dilated CNN (overall accuracy: 0.6699) across all metrics. In particular, REM detection was markedly higher for SLAMSS-IFS (70.4%) compared with U-Sleep (52.6%) and Dilated CNN (58.2%). Clinical metric comparisons further indicate that SLAMSS-IFS more closely matched EEG-derived reference values for light, deep, and REM sleep times, whereas both Dilated CNN and U-Sleep tended to overestimate light time, Dilated CNN overestimated deep time, and both underestimated REM time. Dilated CNN, however, achieved better sleep onset latency.

C. SLAMSS-IFS vs. CNN+Transformer vs. CNN+LSTM

Compared with CNN+Transformer (overall accuracy: 0.6327) and CNN+LSTM (overall accuracy: 0.6120) baselines, SLAMSS-IFS achieved higher accuracy, sensitivity, and more balanced stage classification. REM accuracy was 70.4% for SLAMSS-IFS versus 53.9% for CNN+Transformer and 53.0% for CNN+LSTM, while the corresponding numbers for deep accuracy were 76.4%, 60.3%, and 54.0% respectively. Clinical metrics further showed SLAMSS-IFS’s closer alignment with EEG-derived reference values.

D. Ablation Study Results

The ablation study evaluated the contribution of each SLAMSS component, an intra-epoch learning LSTM, a frequency variable, and a skip connection, to overall performance. Table VI summarizes the contributions of each SLAMSS-IFS component. The baseline SLAMSS model, without the intra-epoch LSTM, frequency variables, or skip connections, achieved the lowest performance. Adding individual components improved performance to varying degrees: an intra-epoch LSTM (SLAMSS-I) and a frequency variable (SLAMSS-F) increased overall accuracy to 0.5213 and 0.5192, respectively, while a skip connection alone (SLAMSS-S) yielded a substantially higher accuracy of 0.6910. Combining components further enhanced performance; SLAMSS-IS, SLAMSS-FS, and SLAMSS-IF all outperformed single-component variants, with SLAMSS-FS achieving the highest overall accuracy (0.7015) among the two-component models. The complete model, SLAMSS-IFS, integrating all three components, achieved the best results across all metrics (overall accuracy = 0.7104, sensitivity = 0.6998, specificity = 0.8902, precision = 0.6983, weighted F1 = 0.7079, MCC = 0.5599), indicating that each component contributes synergistically to performance gains.

E. Results of Additional Analyses

To assess model robustness, we conducted five-fold cross-validation, where the proposed SLAMSS-IFS achieved a mean accuracy of 69.76 ± 1.31% across folds, indicating consistent performance with low variability. We further evaluated SLAMSS-IFS on the more challenging five-class sleep staging task, comparing it with SeqSleepNet, the strongest baseline method. As shown in Table VII, SLAMSS-IFS achieved superior performance across all metrics and sleep stages, with an overall accuracy of 61.59%, representing a 4.58% absolute improvement over SeqSleepNet (57.01%). The largest gain was observed in N2 detection (+8.8%), highlighting improved differentiation of light sleep from adjacent stages (wake, N1, N3, and REM). Smaller but consistent improvements in N1, REM, N3, and wake further suggest that SLAMSS-IFS provides greater robustness and potential generalizability to more fine-grained multi-class sleep staging tasks.

TABLE VII.

Performance Comparison for five-class sleep staging

Model Overall Accuracy Wake N1 N2 N3 REM
SeqSleepNet 57.01% 63.6% 36.8% 41.3% 75.9% 68.8%
SLAMSS-IFS 61.59% 66.1% 38.7% 50.1% 76.5% 71.6%

V. Discussion

To the best of our knowledge, we are the first to generate a public multi-night sleep dataset using a combination of a smartwatch (recording IHR and accelerometer data) and an EEG-based headband (providing sleep hypnogram data). While we acknowledge that large datasets available through the National Sleep Research Resource (NSRR), such as the Multi-Ethnic Study of Atherosclerosis (MESA) dataset [48], [49] and SleepAccel [13], [48], have collected both PSG and actigraphy/smartwatch data, these studies primarily include one-night PSG recordings alongside a week of wrist wearable data, with only a single night of overlap. In contrast, our dataset comprises multi-night smartwatch recordings along with corresponding EEG-based sleep stage labels, providing a more continuous and detailed perspective on sleep patterns over multiple nights. This distinction makes our dataset a unique contribution to the field of sleep research.

PSG remains the gold standard for sleep staging due to its comprehensive analysis of various physiological signals. However, its complexity and high cost make it impractical for long-term or home-based monitoring. In this study, we utilized the EEG-based Dreem Headband as the reference device to train our SLAMSS model. As demonstrated by Arnal et al. [40], the Dreem Headband provides sleep staging results comparable to PSG, offering a more scalable and unobtrusive alternative. However, this study has two key potential limitations. First, the Dreem Headband primarily relies on EEG data rather than the full range of signals captured by PSG, which may impact its accuracy in sleep stage classification. Second, some study participants experienced occasional loss of contact of the headband from their forehead during recordings, potentially degrading EEG signal quality and affecting sleep stage annotations. These challenges highlight the need for further hardware advancements in wearable sleep monitoring technology to enhance its reliability for continuous, multi-night sleep assessment.

We note that the Apple Watch currently offers a built-in sleep assessment feature classifying sleep stages into four categories: wake, REM, core (N1/N2), and deep. Benchmarking SLAMSS-IFS against Apple’s built-in staging method could have greatly strengthened this evaluation. However, data acquisition for this study was completed before Apple released its sleep staging feature. Therefore, a direct comparison could not be included in our analyses. We will address this gap in a future sleep study in our lab.

In this study, we observed that the accuracy of wake stage detection was relatively low, around 60%. Generally, wake stages should be easier to predict than other sleep stages (light, deep, and REM) since they are typically characterized by distinct physiological patterns, such as increased movement and elevated heart rate. However, this unexpectedly low accuracy prompted us to investigate the underlying causes. Interestingly, studies [12], [13], [37] that used Apple Watch data from the SleepAccel dataset and relied solely on basic inputs, heart rate and accelerometer data, reported similar accuracy levels, ranging from 60% to 66%. This suggests that the challenge in accurately classifying wake stages may not be specific to our model but rather an inherent limitation of the available sensor data. Additionally, we found that wake stage samples in both our dataset and the SleepAccel dataset were significantly underrepresented compared to other sleep stages, including deep sleep. This class imbalance likely contributes to the lower accuracy, as models generally struggle with underrepresented classes. The model’s tendency to favor light sleep when distinguishing between similar stages is likely due to two factors: its large sample size, which can bias decision boundaries toward this class, and the substantial physiological overlap between light sleep and adjacent stages (wake, deep, and REM). In the five-class setting, this ambiguity contributes to low N1 and N2 accuracy, while wake detection slightly improves, suggesting that IHR-based features have limited capacity to reliably separate these lighter sleep stages.

To further investigate this issue, we conducted an experiment in which we separated wake periods occurring before and after sleep from those occurring during sleep. Our results revealed a notable difference: wake stages at the beginning and end of sleep were classified with approximately 80% accuracy, while wake stages occurring during sleep had a lower accuracy of around 15%. This suggests that wake periods interspersed within sleep exhibit more variability in physiological signals, potentially due to transient arousals or movement artifacts that share similarities with sleep-like patterns. These findings provide valuable insights into the characteristics of wake detection using smartwatch-derived data. While wake classification during sleep is inherently more complex, incorporating additional features, improving class balance, or refining modeling techniques could further enhance performance and better capture the nuances of different wake states.

VI. Conclusion

In this study, we recruited 47 healthy adult participants who recorded their sleep for up to seven consecutive nights using an Apple Watch Series 6 and a Dreem 2 Headband. We developed SLAMSS-IFS, an advanced version of our previous SLAMSS model, for four-class sleep staging using IHR and accelerometry signals from these wearable devices. Key innovations in the model, including an intra-epoch learning LSTM, frequency information incorporation, and skip connections, contribute to substantial performance improvements over other SLAMSS variants and other state-of-the-art models. Our results show that SLAMSS-IFS outperforms competing models in overall accuracy, sensitivity, specificity, precision, weighted F1 score, weighted MCC, and most clinical sleep metrics.

This work paves the way for more scalable and accessible routine sleep tracking solutions, particularly in clinical settings and personal health monitoring. To further enhance reproducibility and scalability, we also released BIDSleep, an iOS application that enables multi-night sleep data collection and export data from Apple Watches. As our future work, we plan to refine our model by addressing the limitations in wake stage classification and expanding the dataset to improve its robustness and generalizability. Additionally, exploring the integration of other physiological signals that can be obtained from wearable devices, such as blood oxygen level, respiratory rate, and temperature, could further enhance the model’s predictive power and clinical applicability. Overall, this study highlights the significant potential of wristworn wearable devices combined with advanced deep learning techniques for long-term, unobtrusive, and routine sleep monitoring.

Acknowledgment

This work was supported in part by the National Institutes of Health grants R21AG068890, R01AG082354, P30AG073107, and the Dreem Jury’s Prize. We would like to sincerely thank Drs. Shaun Purcell and Richa Saxena for their advice and valuable support at various points of this study. We also extend our gratitude to all the anonymous participants for their time and effort in recording their sleep data, which made this research possible.

Contributor Information

Tzu-An Song, Department of Biomedical Engineering at the University of Massachusetts Amherst, Amherst, MA 01003, USA..

Yubo Zhang, Department of Biomedical Engineering at the University of Massachusetts Amherst, Amherst, MA 01003, USA..

Ziyuan Zhou, Department of Biomedical Engineering at the University of Massachusetts Amherst, Amherst, MA 01003, USA..

Luke Hou, Department of Computer Science at the University of Massachusetts Amherst..

Masoud Malekzadeh, Department of Electrical and Computer Engineering at the University of Massachusetts Amherst..

Aida Behzad, Department of Biomedical Engineering at the University of Massachusetts Lowell, Lowell, MA 01854, USA..

Joyita Dutta, Department of Biomedical Engineering at the University of Massachusetts Amherst, Amherst, MA 01003, USA..

Code Availability

The code for the SLAMSS-IFS model utilized in this study is publicly accessible on GitHub at the following link: https://www.github.com/BIDSLabUMass/SLAMSS-IFS.

Data & App Availability

The IHR and accelerometry data, together with the sleep stage labels used to train the SLAMSS model, are publicly available via a download link on our GitHub page. Additionally, our iOS application for multi-night sleep monitoring, BIDSleep, is available on the Apple App Store.

References

  • [1].Malhotra A and Loscalzo J, “Sleep and cardiovascular disease: an overview,” Prog. Cardiovasc. Dis, vol. 51, pp. 279–284, Jan. 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].McHill AW and Wright KP Jr, “Role of sleep and circadian disruption on energy expenditure and in metabolic predisposition to human obesity and metabolic disease,” Obes. Rev, vol. 18 Suppl 1, pp. 15–24, Feb. 2017. [DOI] [PubMed] [Google Scholar]
  • [3].Shi L, Chen S-J, Ma M-Y, Bao Y-P, Han Y, Wang Y-M, Shi J, Vitiello MV, and Lu L, “Sleep disturbances increase the risk of dementia: A systematic review and meta-analysis,” Sleep Med. Rev, vol. 40, pp. 4–16, Aug. 2018. [DOI] [PubMed] [Google Scholar]
  • [4].Freeman D, Sheaves B, Waite F, Harvey AG, and Harrison PJ, “Sleep disturbance and psychiatric disorders,” Lancet Psychiatry, vol. 7, pp. 628–637, July 2020. [DOI] [PubMed] [Google Scholar]
  • [5].Moser D, Anderer P, Gruber G, Parapatics S, Loretz E, Boeck M, Kloesch G, Heller E, Schmidt A, Danker-Hopfe H, Saletu B, Zeitlhofer J, and Dorffner G, “Sleep classification according to AASM and Rechtschaffen & Kales: effects on sleep scoring parameters,” Sleep, vol. 32, pp. 139–149, Feb 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Kim RD, Kapur VK, Redline-Bruch J, Rueschman M, Auckley DH, Benca RM, Foldvary-Schafer NR, Iber C, Zee PC, Rosen CL, Redline S, and Ramsey SD, “An economic evaluation of home versus laboratory-based diagnosis of obstructive sleep apnea,” Sleep, vol. 38, pp. 1027–1037, July 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Shelgikar AV, Anderson PF, and Stephens MR, “Sleep tracking, wearable technology, and opportunities for research and clinical care,” Chest, vol. 150, pp. 732–743, Sept. 2016. [DOI] [PubMed] [Google Scholar]
  • [8].Birrer V, Elgendi M, Lambercy O, and Menon C, “Evaluating reliability in wearable devices for sleep staging,” NPJ Digit. Med, vol. 7, p. 74, Mar. 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Chinoy ED, Cuellar JA, Huwa KE, Jameson JT, Watson CH, Bessman SC, Hirsch DA, Cooper AD, Drummond SPA, and Markwald RR, “Performance of seven consumer sleep-tracking devices compared with polysomnography,” Sleep, vol. 44, p. zsaa291, December 2020. [Google Scholar]
  • [10].Lee T, Cho Y, Cha KS, Jung J, Cho J, Kim H, Kim D, Hong J, Lee D, Keum M, Kushida CA, Yoon I-Y, and Kim J-W, “Accuracy of 11 wearable, nearable, and airable consumer sleep trackers: Prospective multicenter validation study,” JMIR MHealth UHealth, vol. 11, p. e50983, Nov. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Kainec KA, Caccavaro J, Barnes M, Hoff C, Berlin A, and Spencer RMC, “Evaluating accuracy in five commercial sleep-tracking devices compared to research-grade actigraphy and polysomnography,” Sensors (Basel), vol. 24, Jan. 2024. [Google Scholar]
  • [12].Song T-A, Chowdhury SR, Malekzadeh M, Harrison S, Hoge TB, Redline S, Stone KL, Saxena R, Purcell SM, and Dutta J, “Ai-driven sleep staging from actigraphy and heart rate,” PLOS ONE, vol. 18, pp. 1–29, May 2023. [Google Scholar]
  • [13].Walch O, Huang Y, Forger D, and Goldstein C, “Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device,” Sleep, vol. 42, December 2019. [Google Scholar]
  • [14].Sharan RV, Takeuchi H, Kishi A, and Yamamoto Y, “Macro-sleep staging with ECG-derived instantaneous heart rate and respiration signals and multi-input 1-D CNN–BiGRU,” IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–12, 2024. [Google Scholar]
  • [15].Sridhar N, Shoeb A, Stephens P, Kharbouch A, Shimol DB, Burkart J, Ghoreyshi A, and Myers L, “Deep learning for automated sleep staging using instantaneous heart rate,” NPJ Digit Med, vol. 3, p. 106, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Pathinarupothi RK, Vinaykumar R, Rangan E, Gopalakrishnan E, and Soman KP, “Instantaneous heart rate as a robust feature for sleep apnea severity detection using deep learning,” in 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 293–296, 2017. [Google Scholar]
  • [17].Huang N and Lu H, “Association between instantaneous heart rate sequence during the awake period and cardiovascular events: a study based on sleep heart health study.,” Minerva cardiology and angiology, vol. 72 5, pp. 465–476, 2024. [DOI] [PubMed] [Google Scholar]
  • [18].Sekkal RN, Bereksi-Reguig F, Ruiz-Fernandez D, Dib N, and Sekkal S, “Automatic sleep stage classification: From classical machine learning methods to deep learning,” Biomedical Signal Processing and Control, vol. 77, p. 103751, 2022. [Google Scholar]
  • [19].Mousavi S, Afghah F, and Acharya UR, “SleepEEGNet: Automated sleep stage scoring with sequence to sequence deep learning approach,” PloS one, vol. 14, no. 5, p. e0216456, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Fiorillo L, Puiatti A, Papandrea M, Ratti P-L, Favaro P, Roth C, Bargiotas P, Bassetti CL, and Faraci FD, “Automated sleep scoring: A review of the latest approaches,” Sleep medicine reviews, vol. 48, p. 101204, 2019. [Google Scholar]
  • [21].Li C, Qi Y, Ding X, Zhao J, Sang T, and Lee M, “A deep learning method approach for sleep stage classification with EEG spectrogram,” International Journal of Environmental Research and Public Health, vol. 19, no. 10, p. 6322, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Radha M, Fonseca P, Moreau A, Ross M, Cerny A, Anderer P, Long X, and Aarts RM, “Sleep stage classification from heart-rate variability using long short-term memory neural networks,” Sci Rep, vol. 9, p. 14149, Oct 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Boe AJ, McGee Koch LL, O’Brien MK, Shawen N, Rogers JA, Lieber RL, Reid KJ, Zee PC, and Jayaraman A, “Automating sleep stage classification using wireless, wearable sensors,” NPJ Digit Med, vol. 2, p. 131, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Sun H, Ganglberger W, Panneerselvam E, Leone MJ, Quadri SA, Goparaju B, Tesh RA, Akeju O, Thomas RJ, and Westover MB, “Sleep staging from electrocardiography and respiration with deep learning,” Sleep, vol. 43, Jul 2020. [Google Scholar]
  • [25].Fonseca P, van Gilst MM, Radha M, Ross M, Moreau A, Cerny A, Anderer P, Long X, van Dijk JP, and Overeem S, “Automatic sleep staging using heart rate variability, body movements, and recurrent neural networks in a sleep disordered population,” Sleep, vol. 43, Sep 2020. [Google Scholar]
  • [26].Werth J, Radha M, Andriessen P, Aarts RM, and Long X, “Deep learning approach for ECG-based automatic sleep state classification in preterm infants,” Biomed Signal Process Control, vol. 56, p. 101663, 2020. [Google Scholar]
  • [27].Ganglberger W, Krishnamurthy PV, Quadri SA, Tesh RA, Bucklin AA, Adra N, Da Silva Cardoso M, Leone MJ, Hemmige A, Rajan S, Panneerselvam E, Paixao L, Higgins J, Ayub MA, Shao Y-P, Coughlin B, Sun H, Ye EM, Cash SS, Thompson BT, Akeju O, Kuller D, Thomas RJ, and Westover MB, “Sleep staging in the ICU with heart rate variability and breathing signals. an exploratory cross-sectional study using deep neural networks,” Frontiers in Network Physiology, vol. 3, 2023. [Google Scholar]
  • [28].Mathunjwa BM, Lin Y-T, Lin C-H, Abbod MF, Sadrawi M, and Shieh J-S, “Automatic IHR-based sleep stage detection using features of residual neural network,” Biomedical Signal Processing and Control, vol. 85, p. 105070, 2023. [Google Scholar]
  • [29].Jie Chen Y, Siting Z, Kishan K, and Patanaik A, “Instantaneous heart rate based sleep staging using deep learning models as a convenient alternative to polysomnography,” ERJ Open Research, vol. 7, no. suppl 7, 2021. [Google Scholar]
  • [30].Krauss D, Richer R, Küderle A, Jukic J, German A, Leutheuser H, Regensburger M, Winkler J, and Eskofier BM, “Incorporating respiratory signals for machine learning-based multimodal sleep stage classification: a large-scale benchmark study with actigraphy and heart rate variability,” Sleep, p. zsaf091, April 2025. [Google Scholar]
  • [31].Kotzen K, Charlton PH, Salabi S, Amar L, Landesberg A, and Behar JA, “SleepPPG-Net: A deep learning algorithm for robust sleep staging from continuous photoplethysmography,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 2, pp. 924–932, 2023. [DOI] [PubMed] [Google Scholar]
  • [32].Radha M, Fonseca P, Moreau A, Ross M, Cerny A, Anderer P, Long X, and Aarts R, “A deep transfer learning approach for wearable sleep stage classification with photoplethysmography,” npj Digital Medicine, vol. 4, pp. 135:1–11, September 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Huttunen R, Leppänen T, Duce B, Arnardottir E, Nikkonen S, Myllymaa S, Toyras J, and Korkalainen H, “A comparison of signal combinations for deep learning-based simultaneous sleep staging and respiratory event detection,” IEEE Transactions on Biomedical Engineering, vol. PP, pp. 1–12, November 2022. [Google Scholar]
  • [34].Attia S, Hershkovich RS, Tabakhov A, Ang A, Haimov S, Tauman R, and Behar JA, “SleepPPG-Net2: Deep learning generalization for sleep staging from photoplethysmography,” arXiv preprint arXiv:2404.06869, 2024. [Google Scholar]
  • [35].Motin MA, Karmakar C, Palaniswami M, Penzel T, and Kumar D, “Multi-stage sleep classification using photoplethysmographic sensor,” Royal Society Open Science, vol. 10, April 2023. [Google Scholar]
  • [36].Beattie Z, Oyang Y, Statan A, Ghoreyshi A, Pantelopoulos A, Russell A, and Heneghan C, “Estimation of sleep stages in a healthy adult population from optical plethysmography and accelerometer signals,” Physiol Meas, vol. 38, pp. 1968–1979, Oct 2017. [DOI] [PubMed] [Google Scholar]
  • [37].Kudo S, Chen Z, Ono N, Altaf-Ul-Amin M, Kanaya S, and Huang M, “Deep learning-based sleep staging with acceleration and heart rate data of a consumer wearable device,” in 2022 IEEE 4th Global Conference on Life Sciences and Technologies (LifeTech), pp. 305–307, 2022. [Google Scholar]
  • [38].Chih H-Y, Ahmed T, Chiu AP, Liu Y-T, Kuo H-F, Yang AC, and Lien D-H, “Multitask learning for automated sleep staging and wearable technology integration,” Advanced Intelligent Systems, vol. 6, no. 1, p. 2300270, 2024. [Google Scholar]
  • [39].Zhang X, Kou W, Chang EI-C, Gao H, Fan Y, and Xu Y, “Sleep stage classification based on multi-level feature learning and recurrent neural networks via wearable device,” Computers in Biology and Medicine, vol. 103, pp. 71–81, 2018. [DOI] [PubMed] [Google Scholar]
  • [40].Arnal P, Thorey V, Debellemaniere E, Ballard M, Hernandez A, Guillot A, Jourde H, Harris M, Guillard M, Beers P, Chennaoui M, and Sauvet F, “The Dreem Headband compared to polysomnography for EEG signal acquisition and sleep staging,” Sleep, vol. 43, May 2020. [Google Scholar]
  • [41].Supratak A, Dong H, Wu C, and Guo Y, “DeepSleepNet: A model for automatic sleep stage scoring based on raw single-channel EEG,” IEEE Trans Neural Syst Rehabil Eng, vol. 25, pp. 1998–2008, 11 2017. [DOI] [PubMed] [Google Scholar]
  • [42].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473, 2016. [Google Scholar]
  • [43].Phan H, Andreotti F, Cooray N, Chen OY, and De Vos M, “SeqSleepNet: End-to-end hierarchical recurrent neural network for sequence-to-sequence automatic sleep staging,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, p. 400–410, Mar. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Perslev M, Darkner S, Kempfner L, Nikolic M, Jennum P, and Igel C, “U-Sleep: resilient high-frequency sleep staging,” npj Digital Medicine, vol. 4, p. 72, 04 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, and Polosukhin I, “Attention is all you need,” 2023.
  • [46].Hochreiter S and Schmidhuber J, “Long short-term memory,” Neural Comput., vol. 9, p. 1735–1780, Nov. 1997. [DOI] [PubMed] [Google Scholar]
  • [47].Ho Y and Wookey S, “The real world-weight cross-entropy loss function: Modeling the costs of mislabeling,” IEEE Access, vol. 8, pp. 4806–4813, 2020. [Google Scholar]
  • [48].Zhang GQ, Cui L, Mueller R, Tao S, Kim M, Rueschman M, Mariani S, Mobley D, and Redline S, “The National Sleep Research Resource: towards a sleep data commons,” J Am Med Inform Assoc, vol. 25, pp. 1351–1358, 10 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Chen X, Wang R, Zee P, Lutsey PL, Javaheri S, Alcá ntara C, Jackson CL, Williams MA, and Redline S, “Racial/ethnic differences in sleep disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA),” Sleep, vol. 38, pp. 877–888, Jun 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code for the SLAMSS-IFS model utilized in this study is publicly accessible on GitHub at the following link: https://www.github.com/BIDSLabUMass/SLAMSS-IFS.

The IHR and accelerometry data, together with the sleep stage labels used to train the SLAMSS model, are publicly available via a download link on our GitHub page. Additionally, our iOS application for multi-night sleep monitoring, BIDSleep, is available on the Apple App Store.

RESOURCES