Abstract
Sleep–wake scoring is a time-consuming, tedious but essential component of clinical and preclinical sleep research. Sleep scoring is even more laborious and challenging in rodents due to the smaller EEG amplitude differences between states and the rapid state transitions which necessitate scoring in shorter epochs. Although many automated rodent sleep scoring methods exist, they do not perform as well when scoring new datasets, especially those which involve changes in the EEG/EMG profile. Thus, manual scoring by expert scorers remains the gold standard. Here we take a different approach to this problem by using a neural network to accelerate the scoring of expert scorers. Sleep-Deep-Learner creates a bespoke deep convolution neural network model for individual electroencephalographic or local-field-potential (LFP) records via transfer learning of GoogLeNet, by learning from a small subset of manual scores of each EEG/LFP record as provided by the end-user. Sleep-Deep-Learner then automates scoring of the remainder of the EEG/LFP record. A novel REM sleep scoring correction procedure further enhanced accuracy. Sleep-Deep-Learner reliably scores EEG and LFP data and retains sleep–wake architecture in wild-type mice, in sleep induced by the hypnotic zolpidem, in a mouse model of Alzheimer’s disease and in a genetic knock-down study, when compared to manual scoring. Sleep-Deep-Learner reduced manual scoring time to 1/12. Since Sleep-Deep-Learner uses transfer learning on each independent recording, it is not biased by previously scored existing datasets. Thus, we find Sleep-Deep-Learner performs well when used on signals altered by a drug, disease model, or genetic modification.
Keywords: automated sleep–wake scoring, deep learning, transfer learning, translational sleep research, basic sleep research, in vivo pharmacology, hypnotics, Alzheimer’s disease model
Statement of Significance.
Sleep medicine is often critically advanced by translational research based on in vivo electrophysiologic mouse data. A necessary but time-consuming step in this field is scoring epochs of recordings into wakefulness, non-rapid-eye-movement sleep, and rapid-eye-movement sleep. Despite efforts to automate this, manual scoring remains the gold standard since automatic methods poorly handle data that is not similar enough to data used during development. Here, we describe a novel automated sleep-scoring method that involves retraining a deep-convolution neural net capable of computer vision to score sleep–wake patterns after learning from a small set of manual scores within a record. This avoids biasing the model to expect data to be the same as its training set from previous records.
Assigning behavioral states to wakefulness, non-rapid-eye-movement (NREM) sleep, or rapid-eye-movement (REM) sleep has been an essential component of clinical and preclinical sleep research ever since the first all-night recordings of sleep in humans [1]. So-called sleep–wake scoring is a necessary step in the workflow from electroencephalographic (EEG) or local-field-potential (LFP) data collection to experimental observations in basic and translational sleep research, allowing analysis of the amount of sleep states, sleep architecture, and alterations in specific sleep oscillations. However, manual sleep–wake scoring requires concentration, is time-consuming, labor intensive, and laborious. New expert scorers require training from experienced scorers. Scoring data from rodents has unique challenges to scoring data from humans due to the smaller separation of EEG amplitude differences between states [2] and the need for scoring shorter epochs which are conventionally 4 or 10 seconds [3, 4], contrasting with 30-second epochs [5], as standard in scoring data from humans. Moreover, in animals, there exists no professional standard handbook for reference, as used to score data from humans [6, 7]. Instead, criteria for scoring rodent data come from methods sections of research articles, typically described in a single paragraph [3]. Since subsequent automated analytic methods can be performed rapidly on the EEG/LFP records once the sleep–wake record is scored, the manual scoring step can be considered a bottleneck. These considerable limitations of manual scoring have created a strong impetus to automate sleep–wake scoring, with efforts dating back decades [8]. Older approaches may work relatively well for clearly defined criteria of wake, REM sleep, and NREM sleep, typical of wild-type (WT) mouse data. However, these methods typically work less well in studies which require scoring mouse EEG/LFP/electromyographic(EMG) data involving an experimental intervention, such as a genetic modification, disease model treatment, or drug. These experimental interventions may alter the quality of sleep, as defined by the EEG waveforms, altering the definition of the sleep–wake state. Thus, manual scoring by human scorers has remained the gold standard in the field [8].
Recently, machine learning-based algorithms have been developed to address the problem of automating sleep–wake scoring. In the last 4 years, at least seven machine learning-based sleep–wake scoring algorithms have been published. These algorithms work by creating a novel machine-learning model that is trained on a dataset. A major strategy used to increase performance is to provide the largest dataset available for training the network. These models are then required to score new data based on what it has learned from the data with which they are familiar [9–16]. However, one concern with this approach is that networks can often perform poorly when they are asked to classify new data that they have not encountered before [17]. This could be the case with data from a different laboratory or unfamiliar experimental treatments, such as alterations in EEG/EMG produced by pharmacological agents, genetic manipulations, and/or disease models.
We decided to take a novel approach to this problem. Instead of building a new unique machine learning model trained on data and scores from our laboratory, we employ transfer learning to considerably speed up the scoring of data by an expert scorer. This is a deep-learning-based technique that leverages a pretrained highly sophisticated deep neural network already capable of computer vision, and retrains it at a superficial level to score images that represent NREM sleep, REM sleep, and wakefulness. For this, we utilized GoogLeNet, winner of the 2014 ImageNet challenge, a pretrained deep neural network [18]. Our new method, which we term Sleep-Deep-Learner, creates a newly tailored model for each individual file via transfer learning of GoogLeNet, which is retrained on a small set of manual scores from that file provided by the end user. Since learning is based on a small subset of manual scores performed on each recording of a mouse’s EEG/LFP/EMG signal, Sleep-Deep-Learner automation remains faithful to the learned EEG/LFP/EMG pattern of each mouse. Sleep-Deep-Learner is not familiar with older existing mouse EEG/LFP/EMG data from our, or any other, labs. Thus, when a new investigator uses Sleep-Deep-Learner, it learns to score from them. Similar approaches have been used in human data and wearable technology [19, 20]. However, this approach has not been applied to data from mice, the most commonly used animal model in preclinical sleep research, as far as we are aware. It is important to note that no automated sleep-scoring method works perfectly on every file. Especially, scoring REM sleep appears to be one of the major challenges for the automated scoring methods. An automatic indicator of poorly scored REM sleep would be valuable in determining whether to use a particular set of automated scores. With Sleep-Deep-Learner, we developed a way to identify files yielding the lowest REM sleep reliability, which provides the user with options to either exclude or manually score those files. No other method does this. We aim to make this freely available to the sleep research community to accelerate basic and translational research in mice.
Materials and Methods
Mouse sleep–wake datasets and surgical procedures
Two mouse models with EEG and one disease mouse model with LFP electrode recordings were used for developing and validating the Sleep-Deep-Learner automated scoring. For EEG/EMG recordings, we used datasets previously obtained from mice expressing the enzyme, Cas9, in parvalbumin neurons (PV-Cas9 mice), before and after the CRISPR knockdown of the GABAA receptor α3 subunit in the thalamic reticular nucleus [21, 22]. For pharmacology work, male C57BL/6 mice (Strain #000664 Jackson Laboratory) were used. As in our previous published work [21, 22], these mice were implanted with frontal cortex (AP + 1.9 mm, ML ± 1.5 relative to bregma) EEG screw electrodes (Pinnacle Technology Inc.; Kansas, United States; Part # 8403) with a reference screw electrode above the cerebellum (AP bregma − 6 mm, ML = 0) and a ground electrode screw at bregma AP − 3 mm, ML + 2.7. EMG electrodes were positioned under neck muscles. For validations using LFP/EMG data, we used transgenic mice which were produced by crossing PV-Cre mice (Stock #: 017320; Jackson Laboratory) and APP/PS1 mice, a well-known β-Amyloid overexpressing model of Alzheimer’s disease (Stock #: 034829-JAX; MMRRC stock #34829). These PV-cre/APP/PS1 mice were used for the LFP recordings in which tungsten electrodes were implanted in the prefrontal cortex (AP bregma + 1.9 mm, ML −0.45 mm, DV −1.7 mm) with a reference and a ground screw electrode above the cerebellum (AP bregma − 5.8 mm, ML ± 1.25). For additional validations using data recorded in a brain region other than the frontal area, we performed LFP/EMG recordings in the dorsal hippocampus of PV-Cre mice (AP bregma –2.0 mm, ML −1.8 mm, and DV −1.25 mm). In all surgeries, EEG or LFP, attached wires were soldered to Pinnacle Technolology Inc. headmounts (Part # 8201-SS) which were then secured in place by Keystone industries Bosworth Fastray (Part # 0921378) dental cement. Mice were allowed at least 1 week to recover before recordings were made. Surgeries were performed on a Leica stereotaxic system under isoflurane (1.5%–4% in O2) anesthesia monitored by breathing rate, pedal withdrawal, and tail pinch reflexes. Meloxicam (5 mg/kg; intraperitoneal or subcutaneous) was given immediately at the end of surgery and again 22–24 hours later, to treat post-surgery pain.
Sleep–wake recording procedures
We recorded EEG/LFP and EMG on Pinnacle Technology Inc. three-channel (2 EEG and 1 EMG) mouse systems (Part # 8200-K1-SL), via native software (Sirenia Acquisition). Mice were habituated by tethering them to their system via pre-amplifiers for mice (Pinnacle Technology Inc. Part # 8202-SL) for 24 + hours. 24-hour recordings were collected between zeitgeber time (ZT) 0–24. Data were sampled at 2 kHz, amplified 100x, and low pass filtered at 600 Hz. Zolpidem tartrate was administered at 5 mg/kg, i.p., at ZT 6. All experiments were approved by the Institutional Animal Care and Use Committee of VA Boston Healthcare System and conformed to National Institute of Health, Veterans Administration, and Harvard Medical School guidelines.
Manual sleep–wake scoring
We used Sirenia Sleep (version 2.0.1) to manually score a 2-hour section of 24-hour EEG/EMG records focusing on parts of the sleep–wake recording (ZT 2.5–4.5) where we could identify as including REM sleep, since REM sleep is much less abundant than wake and NREM sleep. For the pharmacology work, we manually scored a 20-minute period spanning and roughly centered on the injection of zolpidem. Zolpidem is short-acting [23] so we focused on a 3-hour period for analysis and hence we also manually scored a shorter (20 minutes) period for Sleep-Deep-Learner training. Manual scoring was performed as previously described [21], using 4-second epochs. EEG/EMG and LFP/EMG signals were exported as European data format files using Sirenia’s native export options. All epochs, including those scored manually, were exported as tsv files and converted to txt files. Briefly, our method for manual scoring was as follows. Epochs with a desynchronized low amplitude EEG signal and a large EMG signal that did not necessarily need to appear phasic were labeled wake. Epochs containing mostly large amplitude, slower waves, and a low amplitude EMG signal—with the exception of brief bursts suggesting a twitch—were labeled NREM sleep. Epochs with a repetitive stereotyped “sawtooth” EEG signal oscillating at 5–9 Hz (theta), with a very flat EMG signal were labeled REM sleep. Epochs containing a mixture of states were labeled as the majority state—I.E. more than half of the epoch. There were no scoring rules precluding a single epoch bout flanked by other states. No exclusion criteria preclude use of any epochs in training Sleep-Deep-Learner.
Wavelet transform of signals
First, for each manually scored epoch, continuous wavelet transforms were performed using the MATLAB cwtfilterbank function with 32 voices per octave on three-epoch wide sections of raw EEG/EMG or LFP/EMG—the score of the middle epoch was assigned to each section, flanked by its neighboring epochs. The EEG/EMG or LFP/EMG wavelet transforms were converted to RGB images, with the top half showing the EEG or LFP wavelet transforms and the bottom half showing EMG wavelet transforms per epoch-triplet, in the jet color scheme. Images were sized to 224 × 224 and saved as jpegs. The jpeg files were then saved in folders, each labeled with the following categories: “W,” for wake, “N,” for NREM sleep; “R,” for REM sleep. The same conversion to EEG/EMG or LFP/EMG wavelet transform image files was performed for every epoch in the EEG/EMG or LFP/EMG record and saved in a folder to be classified by the algorithm, once trained. We wrote and modified code to perform the wavelet transform of signals by consulting MATLAB documentation [24].
Retraining GoogLeNet for each EEG/EMG or LFP/EMG record
We performed transfer learning of GoogLeNet using the MATLAB Deep Learning Toolbox. GoogLeNet was obtained from MATLAB’s pretrained deep neural nets. As it exists without modification, GoogLeNet classifies images into one of 1000 object categories, such as a mouse, keyboard or pencil, etc [25]. We retrained GoogLeNet to classify images representing wake, NREM sleep & REM sleep—jpegs of wavelet transforms described above. In a neural net, so-called convolutional layers extract features, which the final learnable layer and classification layers use to classify the images [24]. In GoogLeNet, “loss3-classifier” and “output” are the final learnable and classification layers, respectively, containing the information about how features are combined to create class probabilities and predictions [24]—I.E. sleep–wake scores once it is retrained. Prior to our modification: “loss3-classifier” corresponds to the 1000 categories pertaining to object recognition (mouse, keyboard, or pencil, etc). This “loss3-classifier” layer was replaced with a new layer that instead has the same number of categories as sleep–wake states present in the manual scores (wake, NREM sleep, and REM sleep). Of note, users must manually score some of each state that is needed. For example, a user should not score a period of data devoid of, or poorly representing, REM sleep. The output layer was also replaced to reflect the number of sleep–wake states represented in the manual scores. We replaced a “dropoutLayer,” with one that has a dropout probability of 0.6 to prevent over-fitting [26]. Otherwise, the GoogLeNet architecture was unchanged. Using these modifications to GoogLeNet, novel deep nets were trained, per file, using the MATLAB “trainNetwork” function. We wrote and modified code to retrain GoogLeNet by consulting MATLAB documentation [24]. The average number of scalogram images of sleep–wake epochs used to train and validate bespoke models for 24-hour EEG & LFP records were as follows: 499 epochs of wakefulness, 1115 epochs of NREM sleep, and 769 epochs of REM sleep. The images assigned for validation were 20% of these scalogram images generated from each class (sleep–wake state). To avoid biasing the selection of epochs that were assigned to training and those assigned to validation, assignment was made at random using the MATLAB splitEachLabel function, “randomized” option. We selected dropout probability based on MATLAB documentation [24], Maximum training epochs were determined empirically with our publicly available dataset [22]. Elsewhere in the manuscript, the term “epoch” is used in the conventional sleep research context: fixed periods used to section and score a signal. But in this paragraph, a “training epoch” describes one full pass of the data used in training. We chose 10 training epochs after initial trials with the default setting of 20 training epochs. We observed validation accuracy and loss plateaued by 10 training epochs in all files, and F1 scores improved. Five training epochs led to lower F1 scores so we used 10 training epochs for the rest of the study. We also tested other mini-batch sizes, weight learn rates, and bias learn-rate factors, but these did not improve performance. Therefore, these and other user-defined settings were set to default as follows: “Stochastic gradient descent with momentum,” “Mini-Batch Size” = 15, “Initial Learn Rate” = 1e-4, “Validation Data” = images assigned for validation, “Validation Frequency” = 10, “Verbose” = 1. To increase speed, we used the GPU as the “Execution Environment.” Finally, a newly retrained model tailored to each file was used to classify the non-scored epochs for each file using the MATLAB classify function. There is an option to save each bespoke trained model, to ensure reproducibility should a file need to be scored a second time.
REM sleep quality control to identify files that could not be scored reliably by Sleep-Deep-Learner
In order to further improve accuracy of REM sleep scoring, we developed post hoc corrections which involve a few extra steps. We discovered that the MATLAB “classify” function does not simply assign the class based on its highest certainty estimate (one of the classify functions output variables). In other words, some REM sleep epochs are assigned with a higher certainty estimate for wake or NREM sleep. First, we relabeled each REM sleep epoch based on its highest certainty value. Next, we corrected cases where REM sleep immediately following wake which is an abnormal state transition that could only be observed in the specific cases such as the mouse models of narcolepsy [27, 28]. We first found the Wake to REM sleep cases and identified the surrounding scores for each case. If a wake→REM sleep case was preceded by more than three epochs of wake, then the REM sleep was replaced with wake, assuming that an epoch within a wake bout was incorrectly scored as REM sleep. If a wake-to-REM sleep case was followed by more REM sleep epochs and had less than five wake epochs prior to it, then epochs between the last NREM sleep and the wake-to-REM sleep case were replaced with REM sleep, assuming that the epochs at the transition from NREM sleep to REM sleep were incorrectly scored as wake. Our post hoc correction of REM sleep scores effectively removed REM sleep from files that could not be scored accurately. This feature lets us identify the files that could not be reliably scored by Sleep-Deep-Learner. For the PV-Cas9 mice dataset, only 3 out of 14 files used were flagged as unreliable. In this example, these three files could then be scored manually and used in a study. For simplicity, we removed those three from the validation data shown in the results section.
Secondary optional post hoc correction of REM sleep epochs
In addition to the post hoc corrections of REM sleep described above, we wrote a custom function termed “rule_123” which first searches for REM sleep epochs preceded by a single NREM sleep epoch, then replaces the REM sleep and the following epoch as NREM sleep. Since Supplementary Figure S1A revealed REM sleep false positives following bouts of NREM sleep up to eight epochs, we tried expanding this correction to target those REM sleep epochs as well. However, changing REM sleep scores preceded by two or three NREM sleep epochs did not increase our F1 scores for REM sleep. Therefore, we decided to only correct REM sleep epochs preceded by a single NREM sleep epoch. Next, rule_123 searches for short REM sleep bouts—from a single 4-second epoch up to four 4-second epochs and relabels them as wakefulness.
Hardware and software specifications
Work was performed on either a computer with an Intel(R) Core(TM) i7-8700 (clock speed: 3.20GHz), 32.0 GB RAM, and a NVIDIA GeForce RTX 2080 GPU or an Intel(R) Core(TM) i9-11900F (clock speed: 2.50GHz), 128 GB RAM, and a NVIDEA GeForce RTX 3090 GPU. Both ran Windows 10 Pro and MATLAB R2023b. Using a GPU with minimum specifications similar to the ones listed here is highly recommended since it speeds up the computationally demanding processes such as the wavelet transform of signals and the network retraining by ~7 times faster than using CPU only.
The time-frequency spectrogram
The time-frequency spectrogram was produced using the multitaper method (Chronux Toolbox; Chronux.org) function [29] (five tapers with 10-second sliding window in 100 milliseconds steps).
Workflow for scoring datasets
Figure 1 illustrates the workflow for the end-user (Figure 1A) and subsequent scoring which is automated by Sleep-Deep-Learner (Figure 1B). First, the end-user needs to score 1/12th of the epochs from each file of the dataset, taking care to label some of each state, we recommend trying to score at least 100 epochs of each state. Epochs can be scored sporadically throughout the file; they do not need to be from a continuous section of the record. Following manual scoring, the EEG/LFP/EMG signals are exported as EDFs. We used Sirenia Sleep for manual scoring, but any software package that can export EDFs can be used. Next, the scores need to be exported. These need to be converted to.txt files or .xls files to use with Sleep-Deep-Learner. Next, a spreadsheet needs to be created that Sleep-Deep-Learner reads to load files. The edf file names are placed in the first column. In the second column, indicate which EEG or LFP channel was used for scoring and should be used by Sleep-Deep-Learner (Figure 1A). Sleep-Deep-Learner iterates through each file (Figure 1B); for each one, GoogLeNet is loaded and the final learnable layers are replaced with new layers for sleep–wake states wake, NREM sleep, and REM sleep [30]. Each new model (created for each file) is trained using images of wavelet transforms of epochs that have been scored wake, NREM sleep, or REM sleep by the end-user. Each trained model scores all the epochs in its file (Figure 1B). Finally, an .xls file is saved for each record that contains Sleep-Deep-Learner automated scores, each tab contains scores performed with each optional post hoc correction.
Figure 1.
Illustration of the workflow for the end-user and subsequent automated scoring by Sleep-Deep-Learner. (A) The end-user first manually scores one-twelfth of each file in their preferred sleep-scoring software package. We used Sirenia Sleep. EEG/LFP/EMG signals are then exported as European data format files. Scores are exported/converted to.txt,.tsv, or.xls files. The manual scores for each file will be used to train a new model for each record. Next a spreadsheet should be created with all the edf file names in the first column and the chosen EEG or LFP channel in the second column. Sleep-Deep-Learner reads this spreadsheet to work through all the files without supervision by the end user. Finally, the end-user instructs Sleep-Deep-Learner to complete scoring of the files via its native graphical user interface. (B) Sleep-Deep-Learner iterates through each file. For each iteration, GoogLeNet is loaded. The final learnable layers are replaced with new layers specific to each sleep–wake state. The model created for each file is then trained on a subset of wavelet transform images labeled as wake, NREM sleep, or REM sleep by the end-user’s manual scoring for that specific file. The novel trained model created for each file is then used to classify (score) all epochs from that file.
Statistics
Comparisons were made by either Pearson’s correlations or unpaired t-tests as described. F1 scores were calculated as the harmonic mean of precision and recall, where precision was the fraction of detections of a class that was correct; recall was the fraction of correct classes that were detected.
Results
To assess how reliable our transfer learning-based algorithm, Sleep-Deep-Learner, was at classifying scores after training on manually labeled scores, we first calculated F1 scores for the three vigilance states typically tabulated in mouse EEG and overall. Automated scores by Sleep-Deep-Learner provided high F1 scores (Figure 2A) for wakefulness (0.97 ± 0.003), NREM sleep (0.95 ± 0.004), REM sleep (0.86 ± 0.011) and overall (0.96 ± 0.004) in eleven 24-hour EEG/EMG records. Importantly, the proportion of each vigilance stage is similar when EEG/EMG records were scored by Sleep-Deep-Learner versus a human (Figure 2, B and C). Two-tailed t-tests, with Bonferroni corrections, revealed no significant differences between manual scores versus Sleep-Deep-Learner scores for wake (t (10) = −0.55, p = 1), NREM sleep (t (10) = 1.55, p = .4) or REM sleep (t (10) = −1.17, p = .76). Moreover, the proportions of states determined from automated scores versus manual scores correlated very highly in wakefulness (Figure 2D, ρ = .94, p = .000007), NREM sleep (Figure 2E, ρ = .85, p = .0005), and REM sleep (Figure 2F, ρ = .87, p = .0002), as determined by Pearson’s correlation. For this, we used 11 separate 24-hour records from six different mice. To address whether these correlations were significant by virtue of the within-animal replicates, we performed additional correlations using only a single record per mouse and still found significant correlations for wake (ρ = 0.95, p = .002), NREM sleep (ρ = 0.92, p = .005) and REM sleep (ρ = 0.85, p = .02). Of note, the 11 files shown here were a subset of 14 files from seven mice. After utilizing our “REM sleep Quality Control” (see methods), we were able to identify 3 out of 14 files as unreliably scored, which we determined as described next.
Figure 2.
Sleep-Deep-Learner scores mouse EEG/EMG signal with high reliability compared with human expert scores and retain the proportion of sleep–wake states. (A) F1 scores for wakefulness, NREM sleep, and REM sleep reveal reliable scoring respective to manual scores by the expert scorer on which Sleep-Deep-Learner was trained. (B) The proportion of wake, NREM sleep, and REM sleep from 24-hour records when determined by manual scoring. (C) The proportion of wake, NREM sleep, and REM sleep from 24-hour records when determined by Sleep-Deep-Learner’s automated scoring, which is highly comparable to the one with manual scoring (B). On a 24-hour record-by-record basis, the proportion of wakefulness (D), NREM sleep (E), and REM sleep (F) correlate highly when determined by manual scoring versus Sleep-Deep-Learner’s automated scoring. Data were from eleven 24-hour recordings from six mice. Error bars represent SEM.
Initially, we noticed that REM sleep F1 scores were lower than those of wake and NREM sleep. We were interested in developing an automated “REM sleep Quality Control” procedure that would reliably indicate EEG/EMG records in which Sleep-Deep-Learner REM sleep scores were unreliable. We thought this would be useful because investigators using Sleep-Deep-Learner would not be able to calculate F1 scores on new unscored data. Unreliable EEG/LFP/EMG records could then be scored manually, or excluded from further analysis. First, we looked at a variable called “validation accuracy,” provided by the MATLAB classify function, this indicates how accurate classification is, determined during training based on the 20% randomly assigned “validation” subset of epochs. Validation accuracy is a percentage calculated by dividing the number of correct labels by the total number of labels. However, this was a poor correlate of final REM sleep F1 scores (Supplementary Figure S1A, ρ = .03, p = .46). However, using a simple “REM sleep Quality Control” procedure, we were able to slightly improve F1 scores in most of the files. Importantly, however, this had the effect of drastically lowering REM sleep F1 scores that were already the lowest (Supplementary Figure S1B, red). These corrected REM sleep scores yielded F1 values that became correlated with the validation accuracy value (ρ = 0.54, p = .02). Unfortunately, two files with validation accuracy in the same lower range still produced high REM sleep F1 scores (Supplementary Figure S1B, orange filled), which would lead an investigator to unnecessarily score two files manually. However, we found our REM sleep Quality Control procedure decimated REM sleep selectively in unreliably scored files, flagging only 3 out of 14 files from our full dataset (Supplementary Figure S1C, red). The final proportion of REM sleep in each file became an excellent indicator of F1 scores (Supplementary Figure S1C), correlating very highly (ρ = 0.81, p = .0002). In a typical experimental scenario, these three files could then be scored manually and used in a study. For simplicity, we removed those three from this validation. Unless mouse models used in a study are known for the direct wake-to-REM sleep transition, we recommend all Sleep-Deep-Learner users employ this post hoc correction, as it was necessary to identify files yielding poor reliability of REM sleep scoring.
Finally, we made a secondary optional post hoc correction to REM sleep scores which further improved F1 scores. This optional post hoc correction basically removed REM sleep that was preceded by only one epoch of NREM sleep or had a very short bout length. Before developing rule_123, we had observed that recall for REM sleep (0.92 ± 0.01) was considerably higher than precision (0.82 ± 0.02). Thus, if we could identify major sources of false positives, we could relabel them as another state. We observed that the algorithm commonly scores REM sleep following short bouts of NREM sleep, though manual scores produce these instances very rarely (Supplementary Figure S2A). Therefore we employed this simple strategy to identify a triple-epoch sequence of “wake or REM sleep”→NREM sleep→REM sleep to identify REM sleep preceded by a single NREM sleep epoch in the Sleep-Deep-Learner scores. We observed that in most of these epochs, manual scoring had identified the “REM sleep” epoch as NREM sleep. So, we relabeled that Sleep-Deep-Learner scored REM sleep as NREM sleep. Next, we observed that Sleep-Deep-Learner led to more time-weighted REM sleep bouts in the shortest bin at the expense of the next shortest bin (Supplementary Figure S2B). We found relabeling very short REM sleep bouts as wake improved REM sleep F1 without reducing the wakefulness F1 score from 0.97 and corrected the profile of time-weighted REM sleep bouts (Figure 3, B and C, right). We determined the minimum REM sleep epoch duration by increasing it one epoch at a time up to four epochs. We found REM sleep F1 scores increased by 0.1 per epoch and plateaued at three REM sleep epochs. We propose this custom MATLAB function we term “rule_123” as an optional feature investigators may choose to use based on comparisons of F1 scores and time-weighted bout analysis in their own datasets previously scored manually, then use those optional settings with novel datasets. We found it worked well with our EEG data.
Figure 3.
Sleep-Deep-Learner retains sleep architecture of manual scoring. (A) A representative multitaper time-frequency plot of EEG from a 24-hour record with aligned hypnograms drawn from manual scores and Sleep-Deep-Learner’s automated scores revealing highly matching sleep–wake architecture captured by manual versus automated scoring. (B) Time-weighted bout analysis reveals the proportion of time (y-axis) the mice spent in bouts of various binned durations (x-axis) of wakefulness, NREM sleep, and REM sleep as determined from manual scores. (C) The same time-weighted bout analysis is shown of wakefulness, NREM sleep, and REM sleep as determined from automated scores. (D) Scatter plots with fitted least squares lines reveal the close relationship between time-weighted bouts for all three sleep–wake states on a record-by-record basis. Pearson’s correlation coefficients reveal statistically highly significant correlations between all three sleep–wake states comparing manual versus Sleep-Deep-Learner’s automated scores. Data were from eleven 24-hour recordings from six mice. Error bars represent SEM.
Though high F1 scores are needed to indicate widespread agreement overall and within each state, a deep analysis of sleep–wake architecture was necessary to determine whether Sleep-Deep-Learner would produce scores leading to results similar to those of an expert scorer, if we were to use the method in real research applications. Next, we evaluated sleep architecture in the EEG/EMG records when scored by Sleep-Deep-Learner versus manually. Figure 3A shows a representative time-frequency plot of EEG with aligned hypnograms comparing manual scores with Sleep-Deep-Learner scores to illustrate close alignment across the entire 24-hour EEG/EMG record. Thus, Sleep-Deep-Learner generated scores retain the sleep–wake architecture of manual scoring. To perform fine-grained evaluation of sleep–wake dynamics, we performed bout analysis. For this, we analyzed time-weighted bouts as previously described first by Mochizuki et al. [31] and used later by our group [21]. We found the profiles of time-weighted bouts for wake, NREM sleep and REM sleep as scored manually (Figure 3B) and the profiles of time-weighted bouts for each state as scored by Sleep-Deep-Learner (Figure 3C) correlated very highly for all three states (Figure 3D): wake (ρ = 0.99998, p = 5.88037−12), NREM sleep (ρ = .83, p = .02) and REM sleep (ρ = 0.987, p = .002).
Next, we wanted to test the performance of Sleep-Deep-Learner under a condition known to affect the quality of sleep by altering NREM sleep EEG waveforms in mice [23, 32] to test the versatility of Sleep-Deep-Learner with data other than WT EEG/EMG signals. We recorded three hours of EEG/EMG from four mice and delivered a hypnotic dose of zolpidem at 1 hour after the start of the recording. Figure 4A shows a representative time-frequency plot of the EEG in one mouse preceding and following the injection of zolpidem, with aligned hypnograms from manual scores and scores by Sleep-Deep-Learner. Zolpidem causes rapid onset NREM sleep that presents with increased low-frequency power compared to wake but reduced broadband frequency power compared to NREM sleep. Qualitatively, the sleep architecture as defined by manual scores is conserved by the automated scores generated by Sleep-Deep-Learner. Sleep-Deep-Learner produced F1 values in the same range as in WT data for NREM sleep and REM sleep, producing the highest F1 for NREM sleep. Sleep-Deep-Learner performed worse than WT data for wakefulness (Figure 4B). Despite poorer performance in wake, this result supports the utility of Sleep-Deep-Learner in studies of hypnotics studies since the main therapeutic effect of zolpidem is to induce NREM sleep. Thus, EEG signals of NREM sleep induced by zolpidem could be used to study the profile of the drug. Moreover, though epoch-to-epoch agreement of wake is lower than preferred, the proportion of states within the recording is preserved fairly well (Figure 4, C and D) with a percent change to wake of only 26.08% (± 13.23; t (3) = 2.57, p = .13), NREM sleep only −6.86% (± 2.74; t (3) = 2.55, p = .13) and REM sleep 0.00% (± 13.61; t (3) = 0, p = 1).
Figure 4.
Sleep-Deep-Learner retains sleep architecture of hypnotic-drug (zolpidem) induced sleep compared with manual scoring. (A) A representative multitaper time-frequency plot of EEG from a 3-hour record in which 5 mg/kg zolpidem was delivered by intraperitoneal injection with aligned hypnograms drawn from manual scores and Sleep-Deep-Learner’s automated scores demonstrating the highly matched scoring between manual and automated scoring. Of note, because the length of the recording was 3 hours, only 20 minutes of manual scoring was used for training the network for each file in this case. (B) F1 scores between manual and Sleep-Deep-Learner’s automated sleep scores reveal high reliability of drug-induced NREM sleep and overall reliability, with reliability of REM sleep in a similar range to WT records. The relative time spent in each sleep–wake state as determined by manual scores (C) versus Sleep-Deep-Learner’s automated scores (D) show similar values. N = 4. Error bars represent SEM.
To further test the versatility of Sleep-Deep-Learner with data other than WT EEG/EMG signals we recorded LFP from frontal cortex in an Alzheimer’s disease mouse model [33–35] crossed with a Cre line. We used five mice that were 6 months old. At this age, symptoms including altered sleep–wake and EEG are apparent [36]. Figure 5A shows high F1 scores overall, including NREM sleep and wake, with slightly lower F1 scores for REM sleep in a similar range to WT EEG data. Additionally, the relative proportion of each state is well conserved when scoring by Sleep-Deep-Learner, compared with manual scores (Figure 5, B and C, t-tests with Bonferroni correction, % wake: t (8) = 0.513, p = 1.00; % NREM sleep: t (8) = -1.016, p = 1.00; % REM sleep: t (8) = 1.271, p = .718). We found rule_123 did not improve F1 values in the LFP dataset, so it was not used here.
Figure 5.
Sleep-Deep-Learner retains sleep architecture in local-field potential (LFP) data from a mouse model of Alzheimer’s disease (APP/PS1) compared with manual scoring. (A) F1 scores between manual and Sleep-Deep-Learner’s automated sleep scores were computed with 24-hour LFP recordings in frontal cortex performed on 5 APP/PS1 mice. The F1 scores reveal that Sleep-Deep-Learner scored NREM sleep, wake, and all states combined with high accuracy. The F1 score of REM sleep scoring was slightly lower than the other states, but in a similar range to the WT case shown in Figure 1. The relative time spent in each sleep–wake state as determined by manual scores (B) was highly similar to Sleep-Deep-Learner automated scores (C). N = 5. Error bars represent SEM.
So far, the data we used for the validations were recorded from cortical (frontal) regions of the brain. We further examine the versatility of Sleep-Deep-Learner with signals obtained from a subcortical brain region. For this validation, we recorded LFP data from dorsal hippocampus of four PV-Cre mice at 6 months of age. We again found that overall F1 scores including NREM sleep and Wake were high, with slightly lower F1 scores for REM sleep (Figure 6A). Moreover, the proportion of each state is well maintained when scoring by Sleep-Deep-Learner, compared with manual scores (Figure 6, B and C, t-tests with Bonferroni correction, % wake: t (6) = −0.082, p = 1.00; % NREM sleep: t (6) = −1.064, p = .98; % REM sleep: t (6) = 2.025, p = .27).
Figure 6.
Sleep-Deep-Learner retains sleep architecture in local-field potential (LFP) data from hippocampus of mice compared with manual scoring. (A) F1 scores between manual and Sleep-Deep-Learner’s automated scores were computed with 24-hour LFP recordings performed in the hippocampus of 4 PV-Cre mice. The F1 scores reveal high reliability of Sleep-Deep-Learner in scoring NREM sleep, wake, and all states combined. The F1 score of REM sleep scoring was slightly lower than other states, but in a similar range to the WT case shown in Figure 1. The relative time spent in each sleep–wake state as determined by manual scores (B) was highly similar to Sleep-Deep-Learner’s automated scores (C) N = 4. Error bars represent SEM.
Finally, we wanted to reproduce findings involving changes in NREM sleep oscillations from data previously scored manually. We recently used mouse EEG combined with in vivo CRISPR-Cas9 knock-down of α3 containing GABAA receptors in the thalamic reticular nucleus to reveal deepened sleep as measured by elevated delta wave power enriched at NREM sleep to REM sleep transitions [21]. We wanted to test whether we would have found this result had we been using Sleep-Deep-Learner. We utilized Sleep-Deep-Learner to re-score the data from mice that had their synaptic GABAA receptors knocked down selectively in the thalamic reticular nucleus, termed α3KD, and performed the analysis of our previous publication. The post hoc REM sleep correction flagged two files, so their manual scores were used. Figure 7 shows graphs redrawn from our previous publication using manual scoring showing “heightened delta power associated with α3KD was only evident during NREM sleep preceding transitions to REM sleep (Figure 7, A–C; means standard error [SEM]: BL = 2.98 (0.24), α3KD = 3.3 (0.26); 11.5% change for all transitions occurring during the light period (±5.56); t (5) = 2.14, p = 0.04)” [21]. Compared to this analysis, scores of the same data performed by Sleep-Deep-Learner would have provided essentially the same results (Figure 7, D–F; means standard error [SEM]: BL = 2.63 (0.24), α3KD = 3.19 (0.23); 20% change for all transitions occurring during the light period) and led to the same conclusions supported by statistical significance (one-tailed paired t-test: t (5) = 6.6, p = .0006).
Figure 7.
The significant increase in NREM sleep delta power due to knocking down the expression of GABAA alpha3 subunits in the thalamic reticular nucleus (Uygun et al., 2022) is reproduced when Sleep-Deep-Learner scored the data. Sleep–wake states were determined by manual scoring in panels A–C, from Uygun et al., 2022. (A) Baseline time-frequency power dynamics reveal elevated delta (1.5 – 4 Hz) during NREM sleep prior to rapid eye movement sleep (REM) transitions, during the whole 12 h light period. (B) Following the knock-down of alpha3 containing GABAA receptors in the thalamic reticular nucleus (α3KD), high delta in NREM sleep prior to REM sleep transitions was further increased. (C) In contrast to their own BL levels , α3KD mice produced more delta power during NREM sleep prior to REM sleep transitions [t (5) = 2.14, p = .04]. Sleep–wake states determined by Sleep-Deep-Learner’s automated scores in panels (D)–(F) reproduced the effects found by manual scoring [t (5) = 6.6, p = .0006]. Significance was tested using one-tailed paired t-tests. Thick lines indicate mean; envelopes indicate SEM. N = 6.
Discussion
In this paper, we describe and validate a novel transfer learning-based approach to automate and greatly accelerate the sleep–wake scoring of mouse electrophysiological data by a trained scorer. Unlike previous methods for automated scoring of rodent sleep–wake data, Sleep-Deep-Learner creates a bespoke retrained model tailored to each EEG/EMG or LFP/EMG record to complete the scoring based on the end-user’s scoring style. This method is highly accurate, reliable, and versatile for different experimental conditions. Accuracy is further improved by our novel REM sleep-scoring correction procedures.
Each bespoke retrained model generated by Sleep-Deep-Learner is based on GoogLeNet, a deep convolution neural net capable of computer vision, with fine-tuning of the high-level layers to perform the classification of wake, NREM sleep, and REM sleep, which it learns from a subset of scores from that particular file, provided by the end-user. Using transfer learning on a deep convolution neural net that was trained on images for object recognition [18], rather than previously recorded mouse electrophysiology data means there is no bias from familiarity with older data from a specific laboratory. We believe this approach is advantageous because it adopts a scoring style from what it learns from the end-user, rather than us—the investigators who developed it. Manual sleep scoring may be regarded as the most dependable and/or the ground truth. Thus, we argue the importance of an automated scoring method that learns to score from the end-user, rather than the investigators who developed the algorithm.
It is informative to compare our method with previously published automated sleep–wake scoring methods for rodent data. Other models work by training the model on an existing set of data and scores and using the model to score new data. Most of these are freely available. One example is “SPINDLE” [13], a web-based service. Of the “pretrained” sleep–wake scoring models, we believe SPINDLE makes the best attempt to address the issue of differences between labs by using scores and signals from different labs, using different experimental configurations, to train it. The nature of the web-based service may arguably create difficulties depending on the institutional policies surrounding sharing of raw unpublished data. For some investigators, the ability to obtain source code and run it on a local machine, as we provide, may be an attractive alternative. Sharing code also provides investigators the freedom to modify it for their specific needs and/or usability. Another example is “MC-SleepNet” [16], one of its major advantages over some others like it, is the retention of time-domain information. Others extract features from Fourier-transformed power spectra of epochs. This can be problematic because large amplitude, yet short, sections of an epoch may have greater representation in the power spectrum than the longer, lower amplitude, section of the epoch. This scenario poorly mimics manual scoring. Sleep-Deep-Learner also maintains time-domain information by performing wavelet transforms rather than Fourier transforms. One limitation of MC-SleepNet, is that it scores 20 second epochs. In our group and others, 4-second epochs are typically used and considered necessary to capture the fast state transitions present in mouse sleep data. “WaveSleepNet” [12] is a MATLAB-based algorithm that also employs transfer learning of GoogLeNet. However, unlike Sleep-Deep-Learner which generates novel models tailored to individual EEG/EMG or LFP/EMG records, WaveSleepNet is a single model that users would deploy to score new data, as with the other models discussed here, it is familiar with the developers’ datasets. The available code we found for WaveSleepNet did not process the raw signals into images that WaveSleepNet would require for classifying. This may introduce challenges for investigators interested in using WaveSleepNet for their data. Additionally, we report considerably higher overall F1 scores—0.82 by WaveSleepNet versus 0.96 by Sleep-Deep-Learner. IntelliSleepScorer [15] is another example of the more commonly found model that is trained on existing mouse EEG, expected to perform well on new data. IntelliSleepScorer provides 10-second epochs only whereas Sleep-Deep-Learner does not require a predetermined epoch duration. Sleep-Deep-Learner handles any epoch duration automatically, notwithstanding our post hoc adjustments were tested in 4-second epochs only. IntelliSleepScorer also assumes signals are EEG from a frontal screw electrode and a parietal screw electrode requiring stringent channel designations. Sleep-Deep-Learner does not require this. Reliability metrics of Sleep-Deep-Learner were slightly higher than IntelliSleepScorer’s metrics, but considerably higher than those on data from an outside laboratory. The paper reported precision and recall separately, so we calculated F1 scores from their tables [15]. For REM sleep, IntelliSleepScorer had an F1 score of 0.72 versus 0.86 by Sleep-Deep-Learner. As they report, F1 score of 0.86 is consistent with two human scorers. For NREM sleep, IntelliSleepScorer had an F1 score of 0.89 versus 0.95 by Sleep-Deep-Learner. For wakefulness, IntelliSleepScorer had an F1 score of 0.92 versus 0.97 by Sleep-Deep-Learner.
One crucial consideration with most models trained with existing mouse EEG and scores is the question of whether they will perform as well as in their source publication, when presented with novel data [17]. Both the WaveSleepNet [12] and IntelliSleepScorer [15] papers address this by obtaining outside datasets. In both papers, their models perform worse with outside data than with their in-house data. Presumably, this is because the model is simply more familiar with signals and scores from the same laboratory as signals and scores used to train it in the first place. SPINDLE deals with this to an extent by training the model on data and scores from different labs, but we do not know if a new end-user’s scores would be well aligned. It remains possible, or even likely, that the training SPINDLE received will not allow it to perform accurately under all experimental conditions. As new basic and translational research projects emerge, there may be a need to employ a re-trainable method to deal with new data that does not resemble previously encountered data. Our method has the advantage that the lower layers are not trained on mouse EEG/LFP, and the higher levels are trained on signals from the individual file being scored. Thus, when shown new data, the model is no less familiar with new data than data we used in this manuscript to validate the method. We have included data generated independently by different investigators and scored by those different investigators, including data using EEG and LFP signals. Moreover, Sleep-Deep-Learner is the only machine learning-based sleep scoring method we know of to perform well with signals from subcortical regions. Despite these differences in recording method, we found similar reliability metrics. When we collated all the F1 scores generated in this study (Supplementary Table S1), we found only wakefulness from our zolpidem-treated group was significantly lower than other treatments. This presents a fundamental difference from other published machine learning-based sleep scoring algorithms that we have found.
Somnivore is proprietary software that reportedly works very well. It is the one exception we know of that may partially circumvent the issue of familiarity with data on which it was trained, by retraining its network on a subset of manually scored epochs from each file [37]. The architecture of this model is not described in the publication [37], so it is difficult to ascertain whether the model is trained on time-domain or frequency-domain information, or whether the base model is originally trained on existing EEG datasets and manual scores available to those who developed it, unlike Sleep-Deep-Learner which has no bias from familiarity with existing electrophysiologic data from specific laboratories. Moreover, since Somnivore is proprietary, it may not be available to many investigators who could benefit from a highly accurate sleep–wake algorithm. The reliability of Sleep-Deep-Learner is in a similar range to Somnivore. However, Somnivore reports poorer accuracy at transitions. We have found our method deals with transitions well, as shown in our analysis of NREM-REM sleep transitions. We feel this is an important attribute to conserve in sleep–wake scoring since we routinely analyze power dynamics at state transitions and bout analyses. Additionally, Somnivore reported a loss in accuracy when scoring drug-induced NREM sleep, whereas Sleep-Deep-Learner performed well when scoring drug-induced NREM sleep. Sleep-Deep-Learner users do need access to MATLAB, but it is otherwise freely available. Additionally, Sleep-Deep-Learner classified wakefulness and NREM sleep slightly better than Somnivore with rodent data. Somnivore is reportedly extremely fast. We have not tested this firsthand because we have not purchased the software. Sleep-Deep-Learner completed the scoring of one 24-hour file in about 10 min using the machine specifications in our methods section.
Finally, one of the major advantages of Sleep-Deep-Learner is its ability to identify files that yield the lowest score-reliability. Our post hoc modification provides this information in the absence of a set of manual scores to serve as the ground truth, the existence of which would negate the need for an automated method. In other publications, especially for REM sleep, we see long distribution tails into low-reliability metrics. In other words, while the average reliability of a dataset may be arguably acceptable, there are typically some files that yield far poorer REM sleep reliability [15]. Other methods do not have a means of flagging these unreliable scores. Investigators using other methods would need to accept that an unknown subset of the files in their dataset were unreliable. Our method enables investigators to remove those files, or include them after scoring them manually. This would be advantageous to investigators who require the scores included in their dataset to be reliable.
Who is this method for? Our aim in developing this method was to provide a tool that would take the impact of manual sleep scoring away from investigators who would otherwise prefer to score their data manually. One potential concern with using automated sleep–wake scoring methods, including machine learning-based ones, is that the automated scores may poorly match the investigator’s scores. In principle, supervised machine learning-based approaches should score more “like” humans because they are trained on epochs labeled by human experts. However, previously published models were trained on already existing labeled epochs scored manually by humans. It is not clear if one of these models would score similarly to a new investigator using the model for the first time. Indeed, different investigators do not consistently score manually like one another [15]. Thus, a user would need to be reasonably certain that their manual scoring style is similar to those who labeled the epochs used to train the model they are using. Our aim was to develop a method that would score “like” the investigator who is using it. Thus, our transfer learning-based approach retrains a bespoke model for each EEG/EMG or LFP/EMG record it is given, based on a subset of epochs scored by that investigator. It is not trained on, or familiar with, epochs labeled by other individuals. Additionally, users need enough experience with manual sleep scoring that they would ensure all the states are well represented. For instance, scoring the first 2-hours of the dark period would be a mistake because NREM sleep and REM sleep would be poorly represented. REM sleep is of particular concern because it is underrepresented physiologically compared to wake and NREM sleep. We manually scored a 2–3-hour period around ZT 2.5–4.5, a region that appeared to have REM sleep well represented. Who is this method not for? Since our method requires a subset of manually scored epochs, it is likely not useful to anyone who is not well-versed in sleep–wake scoring. An investigator who requires an algorithm to score their data for them, because they do not have experience scoring manually will need to use one of the other methods, and we recommend one of those named above, especially SPINDLE which may be less biased than other pretrained models, due to its training on diverse scores and data. Finally, we propose the secondary optional post hoc REM sleep correction we term “rule_123” is used by investigators who have existing datasets they have scored manually to test this option first. We suggest the appropriate settings are determined by comparisons with their existing manual scores before using rule_123 in new data sets.
The aim of this method is to reduce the impact of manual scoring without sacrificing the scoring approach employed by an individual expert scorer. Estimates of the impact of sleep scoring from expert scores suggest a single 24-hour record requires about two to six focused labor hours to complete. However, this cannot simply be multiplied by the number of records to score to obtain the total number of hours needed to complete a dataset. Manual scoring requires concentration and leads to fatigue. Expert scorers in our labs report being able to score a single 24-hour EEG/LFP/EMG record per day, maximum. More commonly, a 24-hour record is completed gradually in separate sessions in between other work requirements. This would indicate a dataset of 11 records would require, at a minimum, more than a week to complete. Manually scoring approximately 2 hours’ worth of data can be completed in about 10 minutes, amounting to roughly 2 hours for a batch of 11 records. Eleven records then take approximately 2 hours to complete with Sleep-Deep-Learner, which takes place unsupervised by the computer, such as overnight. This indicates impact reduction from 1 to 2 weeks of intermittent human labor to under 4 hours total time, with 80% of that time being completed without the investigator. Many preclinical datasets are much larger than this. For instance, consider a study with four groups and 10 mice per group. Using the same estimates given above, these forty 24-hour recordings would require 2 months to score manually versus ~1 work day of investigator time, and less than one full day of total time using Sleep-Deep-Learner!
This is the first method we know of to (1) create a new custom model for each file based on transfer learning of the signal of each file and scores by the end-user to adopt the scoring style of that end-user without bias of previous familiarity with signals and scores from the developers’ lab(s)., (2) Flag unreliably scored files in the absence of F1 values, which would be needed for a new dataset not previously scored manually., (3) validate reliable performance when scoring hypnotic-drug-induced sleep compared with non-induced sleep., and (4) validate reliable performance when scoring using LFP data. Given the recent boom in machine learning applications for automating sleep scoring in mouse models, the advantages and disadvantages of our method are discussed in the context of the others we know to be available.
In summary, we provide the basic and translational sleep research community with an automated sleep–wake scoring algorithm [38] that will adopt the individual investigator’s manual scoring style, while reporting the files on which it could not perform well. We hope this will reduce the bottleneck in the workflow from data collection to novel findings from mouse models used to advance sleep research.
Supplementary Material
Contributor Information
Fumi Katsuki, Department of Psychiatry, VA Boston Healthcare System and Harvard Medical School, West Roxbury, MA, USA.
Tristan J Spratt, Department of Psychiatry, VA Boston Healthcare System and Harvard Medical School, West Roxbury, MA, USA.
Ritchie E Brown, Department of Psychiatry, VA Boston Healthcare System and Harvard Medical School, West Roxbury, MA, USA.
Radhika Basheer, Department of Psychiatry, VA Boston Healthcare System and Harvard Medical School, West Roxbury, MA, USA.
David S Uygun, Department of Psychiatry, VA Boston Healthcare System and Harvard Medical School, West Roxbury, MA, USA.
Funding
This work was supported by VA Biomedical Laboratory Research and Development Service Career Development Award IK2 BX004905 (D.S.U.) and Merit Awards I01 BX001404 and I01 BX006105(R.B.); I01 BX004673 (R.E.B.) and NIH support from, K01 AG068366 (FK), R01 NS119227 (R.B.), R01 MH039683 (R.E.B.). F.K., D.S.U., R.E.B., and R.B. are Research Health Scientists at VA Boston Healthcare System, West Roxbury, MA. The contents of this work do not represent the views of the US Department of Veterans Affairs or the United States Government.
Disclosure Statement
Financial disclosure: none. Nonfinancial disclosure: F.K., R.E.B., R.B. & D.S.U. are Research Health Scientists at VA Boston Healthcare System, West Roxbury, MA. The contents of this work do not represent the views of the US Department of Veterans Affairs or the United States Government. Publicly available data from a previous publication by our group was used in this study (Uygun et al. 2022) and cited in the main text. Panels A–C of Figure 7, which were redrawn from our records, were published previously in (Uygun et al. 2022) as cited in the figure legend and main text. We determined express permission was not required after consulting the Nature Communications webpage “editorial policies > self-archiving and license to publish.” Competing interests: Authors declare no competing interests.
Author Contributions
Fumi Katsuki (Conceptualization [Equal], Data curation [Equal], Formal analysis [Equal], Funding acquisition [Equal], Investigation [Equal], Methodology [Equal], Project administration [Equal], Resources [Equal], Software [Equal], Validation [Equal], Visualization [Equal], Writing—original draft [Equal], Writing—review & editing [Equal]), Tristan Spratt (Conceptualization [Supporting], Writing—original draft [Supporting], Writing—review & editing [Supporting]), Ritchie Brown (Conceptualization [Equal], Funding acquisition [Equal], Project administration [Equal], Resources [Equal], Supervision [Equal], Writing—original draft [Equal], Writing—review & editing [Equal]), Radhika Basheer (Conceptualization [Equal], Funding acquisition [Equal], Project administration [Equal], Resources [Equal], Supervision [Equal], Writing—original draft [Equal], Writing—review & editing [Equal]), and David Uygun (Conceptualization [Lead], Data curation [Equal], Formal analysis [Equal], Funding acquisition [Equal], Investigation [Equal], Methodology [Lead], Project administration [Equal], Resources [Equal], Software [Equal], Supervision [Lead], Validation [Equal], Visualization [Equal], Writing—original draft [Equal], Writing—review & editing [Equal]).
Data availability
We welcome investigators to contact us directly so we can share Sleep-Deep-Learner and assist with setting up and using it. We can be reached at Fumi_Katsuki@hms.harvard.edu and David_Uygun@hms.harvard.edu. Sleep-Deep-Learner and its user manual can also be downloaded at https://osf.io/gmv8h/ [38]. Twenty-four-hour EEG datasets are available at https://osf.io/uf2ca/.
References
- 1. Dement W, Kleitman N.. Cyclic variations in EEG during sleep and their relation to eye movements, body motility, and dreaming. Electroencephalogr Clin Neurophysiol. 1957;9(4):673–690. doi: 10.1016/0013-4694(57)90088-3 [DOI] [PubMed] [Google Scholar]
- 2. Brown RE, Basheer R, McKenna JT, Strecker RE, McCarley RW.. Control of sleep and wakefulness. Physiol Rev. 2012;92(3):1087–1187. doi: 10.1152/physrev.00032.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Tobler I, Deboer T, Fischer M.. Sleep and sleep regulation in normal and prion protein-deficient mice. J Neurosci. 1997;17(5):1869–1879. doi: 10.1523/JNEUROSCI.17-05-01869.1997 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Ghoshal A, Uygun DS, Yang L, et al. Effects of a patient-derived de novo coding alteration of CACNA1I in mice connect a schizophrenia risk gene with sleep spindle deficits. Transl Psychiatry. 2020;10(1):1–12. doi: 10.1038/s41398-020-0685-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Mylonas D, Sjøgård M, Shi Z, et al. A novel approach to estimating the cortical sources of sleep spindles using simultaneous EEG/MEG. Front Neurol. 2022;13:871166. doi: 10.3389/fneur.2022.871166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Iber, A-I, Chesson, Q.. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specification. Westchester, IL: American Academy Sleep Medicine; 2007. [Google Scholar]
- 7. Rechtschaffen A, Kales A.. A manual of standardized terminology, techniques and scoring system for sleep stages of human subjects. Public Health Serv US Gov Print Off Wash DC. 1968;204:1–65. [Google Scholar]
- 8. Rayan A, Szabo AB, Genzel L.. The pros and cons of using automated sleep scoring in sleep research: comparative analysis of automated sleep scoring in human and rodents: advantages and limitations. Sleep. 2024;47(1). doi: 10.1093/sleep/zsad275 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Akada K, Yagi T, Miura Y, Beuckmann CT, Koyama N, Aoshima K.. A deep learning algorithm for sleep stage scoring in mice based on a multimodal network with fine-tuning technique. Neurosci Res. 2021;173:99–105. doi: 10.1016/j.neures.2021.07.003 [DOI] [PubMed] [Google Scholar]
- 10. Barger Z, Frye CG, Liu D, Dan Y, Bouchard KE.. Robust, automated sleep scoring by a compact neural network with distributional shift correction. PLoS One. 2019;14(12):e0224642. doi: 10.1371/journal.pone.0224642 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Fraigne JJ, Wang J, Lee H, Luke R, Pintwala SK, Peever JH.. A novel machine learning system for identifying sleep–wake states in mice. Sleep. 2023;46(6). doi: 10.1093/sleep/zsad101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kam K, Rapoport DM, Parekh A, Ayappa I, Varga AW.. WaveSleepNet: An interpretable deep convolutional neural network for the continuous classification of mouse sleep and wake. J Neurosci Methods. 2021;360:109224. doi: 10.1016/j.jneumeth.2021.109224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Miladinović D, Muheim C, Bauer S, et al. SPINDLE: End-to-end learning from EEG/EMG to extrapolate animal sleep scoring across experimental settings, labs and species. PLoS Comput Biol. 2019;15(4):e1006968. doi: 10.1371/journal.pcbi.1006968 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Svetnik V, Wang TC, Xu Y, Hansen BJ, Fox S V.. A Deep learning approach for automated sleep-wake scoring in pre-clinical animal models. J Neurosci Methods. 2020;337:108668. doi: 10.1016/j.jneumeth.2020.108668 [DOI] [PubMed] [Google Scholar]
- 15. Wang LA, Kern R, Yu E, Choi S, Pan JQ.. IntelliSleepScorer, a software package with a graphic user interface for automated sleep stage scoring in mice based on a light gradient boosting machine algorithm. Sci Rep. 2023;13:4275. doi: 10.1038/s41598-023-31288-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Yamabe M, Horie K, Shiokawa H, Funato H, Yanagisawa M, Kitagawa H.. MC-SleepNet: large-scale sleep stage scoring in mice by deep neural networks. Sci Rep. 2019;9:15793. doi: 10.1038/s41598-019-51269-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Pan SJ, Yang Q.. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010;22(10):1345–1359. doi: 10.1109/tkde.2009.191 [DOI] [Google Scholar]
- 18. Szegedy C, Liu W, Jia Y, et al. Going Deeper with Convolutions. Boston, MA.: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2014.doi: 10.48550/arXiv.1409.4842 [DOI] [Google Scholar]
- 19. Andreotti F, Phan H, Cooray N, Lo C, Hu MTM, De Vos M.. Multichannel Sleep Stage Classification and Transfer Learning using Convolutional Neural Networks. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference. Honolulu, HI: 2018:171–174. doi: 10.1109/EMBC.2018.8512214 [DOI] [PubMed] [Google Scholar]
- 20. Waters SH, Clifford GD.. Comparison of deep transfer learning algorithms and transferability measures for wearable sleep staging. Biomed Eng Online. 2022;21(1):66. doi: 10.1186/s12938-022-01033-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Uygun DS, Yang C, Tilli ER, et al. Knockdown of GABAA alpha3 subunits on thalamic reticular neurons enhances deep sleep in mice. Nat Commun. 2022;13(1):2246. doi: 10.1038/s41467-022-29852-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Uygun DS, Yang C, Tilli ER, et al. In vivo data sets of: Knockdown of GABAA alpha3 subunits on thalamic reticular neurons enhances deep sleep in mice. 2022. doi: 10.17605/OSF.IO/UF2CA [DOI] [PMC free article] [PubMed]
- 23. Uygun DS, Ye Z, Zecharia AY, et al. Bottom-up versus top-down induction of sleep by zolpidem acting on histaminergic and neocortex neurons. J Neurosci. 2016;36(44):11171–11184. doi: 10.1523/JNEUROSCI.3714-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. The MathWorks Inc. Classify Time Series Using Wavelet Analysis and Deep Learning. 2019. https://www.mathworks.com/help/wavelet/ug/classify-time-series-using-wavelet-analysis-and-deep-learning.html. Accessed December 13, 2023.
- 25. The MathWorks Inc. googlenet. 2017. https://www.mathworks.com/help/deeplearning/ref/googlenet.html#mw_7cbd0577-4371-4eb3-829c-a9447220d89d_sep_mw_6dc28e13-2f10-44a4-9632-9b8d43b376fe. Accessed December 15, 2023.
- 26. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov RD.. A Simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(56):1929–1958. [Google Scholar]
- 27. Fujiki N, Cheng T, Yoshino F, Nishino S.. Specificity of direct transition from wake to REM sleep in orexin/ataxin-3 transgenic narcoleptic mice. Exp Neurol. 2009;217(1):46–54. doi: 10.1016/j.expneurol.2009.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Mieda M, Willie JT, Hara J, Sinton CM, Sakurai T, Yanagisawa M.. Orexin peptides prevent cataplexy and improve wakefulness in an orexin neuron-ablated model of narcolepsy in mice. Proc Natl Acad Sci USA. 2004;101(13):4649–4654. doi: 10.1073/pnas.0400590101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Prerau MJ, Brown RE, Bianchi MT, Ellenbogen JM, Purdon PL.. Sleep neurophysiological dynamics through the lens of multitaper spectral analysis. Physiology. 2017;32:60–92. doi: 10.1152/physiol.00062.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. The MathWorks Inc. Transfer Learning Using Pretrained Network. 2023. https://www.mathworks.com/help/deeplearning/ug/transfer-learning-using-pretrained-network.html. Accessed January 21, 2024. [Google Scholar]
- 31. Mochizuki T, Crocker A, McCormack S, Yanagisawa M, Sakurai T, Scammell TE.. Behavioral state instability in orexin knock-out mice. J Neurosci. 2004;24(28):6291–6300. doi: 10.1523/JNEUROSCI.0586-04.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Uygun DS, Basheer R.. Circuits and components of delta wave regulation. Brain Res Bull. 2022;188:223–232. doi: 10.1016/j.brainresbull.2022.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Jankowsky JL, Slunt HH, Ratovitski T, Jenkins NA, Copeland NG, Borchelt DR.. Co-expression of multiple transgenes in mouse CNS: a comparison of strategies. Biomol Eng. 2001;17(6):157–165. doi: 10.1016/s1389-0344(01)00067-3 [DOI] [PubMed] [Google Scholar]
- 34. Jankowsky JL, Fadale DJ, Anderson J, et al. Mutant presenilins specifically elevate the levels of the 42 residue beta-amyloid peptide in vivo: evidence for augmentation of a 42-specific gamma secretase. Hum Mol Genet. 2004;13(2):159–170. doi: 10.1093/hmg/ddh019 [DOI] [PubMed] [Google Scholar]
- 35. Reiserer RS, Harrison FE, Syverud DC, McDonald MP.. Impaired spatial learning in the APPSwe + PSEN1DeltaE9 bigenic mouse model of Alzheimer’s disease. Genes Brain Behav. 2007;6(1):54–65. doi: 10.1111/j.1601-183X.2006.00221.x [DOI] [PubMed] [Google Scholar]
- 36. Zhao Q, Maci M, Miller MR, et al. Sleep restoration by optogenetic targeting of GABAergic neurons reprograms microglia and ameliorates pathological phenotypes in an Alzheimer’s disease model. Mol Neurodegener. 2023;18(1):93. doi: 10.1186/s13024-023-00682-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Allocca G, Ma S, Martelli D, et al. Validation of ‘Somnivore’, a machine learning algorithm for automated scoring and analysis of polysomnography data. Front Neurosci. 2019;13:207. doi: 10.3389/fnins.2019.00207. https://www.frontiersin.org/articles/10.3389/fnins.2019.00207. Accessed October 10, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Katsuki F., Uygun, D. S.. Sleep-Deep-Learner. 2024. doi: 10.17605/OSF.IO/GMV8H [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We welcome investigators to contact us directly so we can share Sleep-Deep-Learner and assist with setting up and using it. We can be reached at Fumi_Katsuki@hms.harvard.edu and David_Uygun@hms.harvard.edu. Sleep-Deep-Learner and its user manual can also be downloaded at https://osf.io/gmv8h/ [38]. Twenty-four-hour EEG datasets are available at https://osf.io/uf2ca/.







