Large-scale assessment of consistency in sleep stage scoring rules among multiple sleep centers using an interpretable machine learning algorithm

Gi-Ren Liu; Ting-Yu Lin; Hau-Tieng Wu; Yuan-Chung Sheu; Ching-Lung Liu; Wen-Te Liu; Mei-Chen Yang; Yung-Lun Ni; Kun-Ta Chou; Chao-Hsien Chen; Dean Wu; Chou-Chin Lan; Kuo-Liang Chiu; Hwa-Yen Chiu; Yu-Lun Lo

doi:10.5664/jcsm.8820

. 2021 Feb 1;17(2):159–166. doi: 10.5664/jcsm.8820

Large-scale assessment of consistency in sleep stage scoring rules among multiple sleep centers using an interpretable machine learning algorithm

Gi-Ren Liu ¹, Ting-Yu Lin ², Hau-Tieng Wu ³, Yuan-Chung Sheu ^4,⁵, Ching-Lung Liu ⁶, Wen-Te Liu ⁷, Mei-Chen Yang ⁸, Yung-Lun Ni ⁹, Kun-Ta Chou ¹⁰, Chao-Hsien Chen ⁶, Dean Wu ⁷, Chou-Chin Lan ⁸, Kuo-Liang Chiu ^9,¹¹, Hwa-Yen Chiu ¹⁰, Yu-Lun Lo ^2,^✉

PMCID: PMC7853209 PMID: 32964831

Abstract

Study Objectives:

Polysomnography is the gold standard in identifying sleep stages; however, there are discrepancies in how technicians use the standards. Because organizing meetings to evaluate this discrepancy and/or reach a consensus among multiple sleep centers is time-consuming, we developed an artificial intelligence system to efficiently evaluate the reliability and consistency of sleep scoring and hence the sleep center quality.

Methods:

An interpretable machine learning algorithm was used to evaluate the interrater reliability (IRR) of sleep stage annotation among sleep centers. The artificial intelligence system was trained to learn raters from 1 hospital and was applied to patients from the same or other hospitals. The results were compared with the experts’ annotation to determine IRR. Intracenter and intercenter assessments were conducted on 679 patients without sleep apnea from 6 sleep centers in Taiwan. Centers with potential quality issues were identified by the estimated IRR.

Results:

In the intracenter assessment, the median accuracy ranged from 80.3%–83.3%, with the exception of 1 hospital, which had an accuracy of 72.3%. In the intercenter assessment, the median accuracy ranged from 75.7%–83.3% when the 1 hospital was excluded from testing and training. The performance of the proposed method was higher for the N2, awake, and REM sleep stages than for the N1 and N3 stages. The significant IRR discrepancy of the 1 hospital suggested a quality issue. This quality issue was confirmed by the physicians in charge of the 1 hospital.

Conclusions:

The proposed artificial intelligence system proved effective in assessing IRR and hence the sleep center quality.

Citation:

Liu G-R, Lin T-Y, Wu H-T, et al. Large-scale assessment of consistency in sleep stage scoring rules among multiple sleep centers using an interpretable machine learning algorithm. J Clin Sleep Med. 2021;17(2):159–166.

Keywords: interrater reliability, intracenter assessments, intercenter assessments, machine learning, sleep stage scoring

BRIEF SUMMARY

Current Knowledge/Study Rationale: We searched MEDLINE for articles using the keywords “sleep stage(s)” OR “sleep staging” OR “sleep scoring” AND “automated” OR “artificial intelligence” OR “learning” OR “neural networks”. The search returned 1,347 results. By further imposing limitations (“intrarater” OR “reliability” OR “inter”), we retrieved 120 final results. Most of them evaluated the performance of proposed methods on a specific database. Most IRR-related results relied on consensus meetings, and few of them relied on machine learning.

Study Impact: Machine learning algorithms can be applied to efficiently study IRR in multiple centers and evaluate sleep center quality. The proposed algorithm is deployed to expand the scope of this analysis, aiming to enhance quality control in sleep studies.

INTRODUCTION

Sleep is a critical factor contributing to physical and mental health, particularly in fatigue recovery and memory consolidation.¹ Sleep stages are usually characterized using a polysomnography (PSG) test, during which sensors are applied to various parts on the body to monitor brain activity (electroencephalography [EEG]), eye movements (electrooculography [EOG]), and muscle activity (electromyography [EMG]). The PSG recordings taken throughout the night are then visually scored by technicians based on standards outlined by the American Academy of Sleep Medicine (AASM)^2,3 or the Rechtschaffen and Kales criteria.⁴

Interrater reliability (IRR) is an important issue from the perspective of clinical applicability, including quality control, personnel training, and scientific research. The true sleep stage cannot be determined conclusively or consistently, even when scorers have fulfilled accreditation standards and follow the same AASM criteria. In a study of 72 patients from the SIESTA sleep database (scored using AASM and Rechtschaffen and Kales criteria⁴), the overall agreement among 7 experienced scorers was only 82.0% (AASM standards) and 80.6% (Rechtschaffen and Kales standards⁴).⁵ In the same SIESTA database, scoring was manually performed by 2 experts (manual scoring 1 and manual scoring 2) and automatically performed by the AASM version of the Somnolyzer system,⁶ and the epoch-by-epoch agreement rate between the fully automated Somnolyzer system and the manual scoring was 79%. Further, the automated scoring of each patient’s sleep stages was edited by 2 experts who independently performed a structured quality control test, resulting in semiautomated scorings 1 and 2. Only 4.3% and 3.5% of the epochs were corrected by these 2 experts. The agreement rate between manual scoring 1 (resp. 2) and semiautomated scoring 1 (resp. 2) was 82.2% (resp. 81.1%), and the agreement rate between semiautomated scorings 1 and 2 was 99%. Manual review of the automated scorings changed the classification accuracy of sleep stages N2 and N3 only marginally but improved the classification of the awake, N1, and REM stages.⁶

In Punjabi et al,⁷ the performance of the Somnolyzer system was evaluated using a different database comprising 97 patients (> 95% of the patients had OSA). Each PSG was manually scored by technologists from 4 sleep laboratories. The study showed that the Somnolyzer-scored percentages of sleep stages N1 and N3 were consistently higher than manually scored values even when the results of automated scoring were edited by experts. In contrast, no differences were observed in the percentage of REM sleep between manual and automated scoring. There was no epoch-by-epoch agreement in the study.⁷

In the AASM Interscorer Reliability program (1,800 epochs scored by > 2,500 scorers), the overall agreement was 82.6%⁸ with an agreement of only 67.4% for stage N3 and 63% for stage N1. Scorers had particular difficulty with 3 epochs: the last epoch of the awake stage before sleep onset, the first epoch of N2 after N1, and the first epoch of REM after N2. Clearly, scorers had considerable difficulty distinguishing between stages N2 and N3. Although several studies have examined the issue of IRR, working with multiple raters in multiple sleep centers remains a daunting problem. It is labor-intense, time-consuming, and expensive. Because of these limitations, it is not easy to systematically apply IRR examination results to clinics on a large scale.

In this work, we hypothesized that a trained artificial intelligence (AI) system based on a state-of-the-art interpretable machine learning algorithm⁹ can be applied to efficiently evaluate the IRR of sleep stage annotation on a large scale and hence explore sleep center quality. Under the assumption that all raters follow the same AASM criteria without ambiguity, the agreement rate between the human-annotated sleep stages and those predicted by the AI system should be high and not depend on which hospital is used for training the AI system. Thus, an agreement rate discrepancy in any sleep center suggests a potential quality issue in that sleep center. The proposed AI system was applied to evaluate IRR in 6 sleep centers in Taiwan using intracenter and intercenter assessments. Intracenter assessment focused on consistency in sleep stage scoring by examining PSG recordings collected in a common sleep center. Intercenter assessment used 1 of 6 databases as the training set and the remaining databases as the testing set.

Data source

EEG signals (C3-A2 and O2-A1), left and right EOG signals (LEOG and REOG, respectively), and EMG signals were retrospectively collected from databases in 6 sleep centers in Taiwan. The names of the sleep centers were substituted using single letters from A to F. We obtained data from 679 patients seeking sleep consultation for suspected sleep breathing disorders. Only patients with an AHI < 5 events/h were included in the study. Demographic data are listed in Table 1.

Table 1.

Demographic data of patients and other basic information retrieved from databases.

Sleep Center	n	Age (y)	AHI Events/h	Sampling Rate	Range of Filter (Hz)	PSG Device
A	150	41.5 ± 14.9	2.7 ± 1.5	200	0.3–35	Alice
B	182	45.5 ± 22.1	2.6 ± 1.4	512	0.3–30	Compumedics
C	60	44.4 ± 15.1	3.0 ± 1.3	200	0.3–35	Embla
D	93	42.4 ± 15.4	2.2 ± 1.4	200\500	0.3\1–35	Embla
E	111	41.6 ± 13.5	2.0 ± 1.4	200	0.3–35	Embla
F	83	47.7 ± 15.5	2.1 ± 1.5	200	0.3–35	Embla

Open in a new tab

PSG = polysomnography.

The 6 sleep centers use different PSG systems. Therefore, all raw EEG data were standardized via resampling at 200 Hz followed by band-pass filtering for the removal of high-frequency (> 35 Hz) and low-frequency (< 0.3 Hz) components. Band-pass filtering was also applied to raw EOG data (0.3–30/35 Hz) and raw EMG data (10–90/100 Hz) before being used by experts to visually score sleep stages. Note that the PSG recordings in each database had previously undergone visual scoring based on AASM standards³ by experienced scorers in the original sleep centers. A total of 26 certified scorers from the 6 sleep centers participated in this project. Table 2 lists data related to the number of scorers in each sleep center and their work experience in years since being licensed. It also lists the distribution of sleep stages. Note that the proportions of REM and N3 were lower than those observed in healthy populations because these patients were suspected of having sleep breathing disorders.

Table 2.

Number of scorers in each sleep center, their work experience since being licensed (units: years), and distribution of sleep stages for each sleep center’s database.

Sleep Center	Number of Scorers	Range of Seniority (y)	Percentage of Each Sleep Stage (%)
Sleep Center	Number of Scorers	Range of Seniority (y)	Awake	REM	N1	N2	N3
A	2	6–7	17	13	11	50	10
B	6	0–7	27	12	8	42	10
C	2	0–3	20	16	4	54	6
D	7	0–10	26	9	9	51	5
E	4	4–13	15	14	9	44	18
F	5	2–13	23	12	9	49	7

Open in a new tab

With the exception of age and AHI, all personal information was deidentified from the raw signal files. The study protocol was approved by the institutional review board (IRB) of each hospital (Chang Gung Memorial Hospital IRB number 201800609B0; MacKay Memorial Hospital IRB number 18MMHIS142e; Shuanh-Ho Hospital IRB numbers N201911007, N201903142; Taipei Tzu Chi Hospital IRB number 07-XD-083; Taichung Tzu Chi Hospital IRB number REC107-37; Taipei Veterans General Hospital IRB number 2018-12-009AC).

Automated annotation system

We adopted a state-of-the-art automated annotation algorithm. The algorithm comprises a feature extraction module and a learning module. Feature extraction uses 2 modern signal processing tools: the scattering transform^10,11 for the extraction of features from 2 EEGs, LEOG, and REOG, and canonical correlation analysis to reduce the dimensionality of the extracted features. Based on AASM standards, sleep is scored in 30-second epochs; therefore, we applied the scattering transform^9–11 to extract features from 30-second EEG and EOG epochs. The scattering transform was inspired by the convolutional neural network. Meanwhile, we extracted some time-domain features from 30-second EOG signals, including the mean and variance of amplitudes, the kurtosis number,¹² the Hjorth activity, and mobility parameters.^13,14 For each 30-second EMG signal, the 95% quantile of the envelope amplitudes was used as a feature because EMG activity decreases during the REM stage compared with the activity while awake.¹⁵ Furthermore, we applied the short-time Fourier transform to compute its spectral energy and use it as a feature.¹⁶ The features extracted from the EEG, EOG, and EMG signals were then combined using canonical correlation analysis,¹⁷ with the aim of removing factors with low correlation from each pair of features.

Figure 1 is the flowchart about the sensor fusion designed for the binary classification of REM and N1 and the binary classification of N2 and N3. For technical details, please see the supplemental material^{(685.1KB, pdf)}. Unlike supervised learning algorithms like convolutional neural networks, the label information is not used for the feature extraction, so the considered algorithm is unsupervised in nature. The learning aspect of the algorithm is based on the support vector machine kernel.¹⁸ Expert-determined sleep stages and fused features are used to find hyperplanes within the feature space for use in dividing the features corresponding to different sleep stages.¹⁹

For the explanation of this flowchart, see the supplemental material^{(685.1KB, pdf)}. CCA = canonical correlation analysis, EEG = electroencephalography, EMG = electromyography, EOG = electrooculography, ST = scattering transform, STFT = short-time Fourier transform.

All experiments were performed using MATLAB 2016a (Natick, Massachusetts) on a desktop PC with an Intel Core i7-7820 3.6 GHz CPU and 64 GB of RAM. The implementation of scattering transform can be downloaded from https://www.di.ens.fr/data/software/scatnet/download. Other packages for feature extraction and sensor fusion can be downloaded from https://github.com/TIDIScode/intra_center_assessment.

Intracenter and intercenter assessment

In intracenter and intercenter assessments, the IRR was evaluated from PSG recordings by comparing the sleep stages annotated by experts and those predicted by the trained AI system under the assumption that the raters followed precisely the same AASM standards.³

In the intracenter assessment, 1 patient from each database was selected at random to form a testing set, and the remaining patients in the same database formed the training set. This method is referred to as the leave-one-subject-out cross-validation scheme, which has previously proven effective in preventing overfitting when constructing the prediction model.

In the intercenter assessment as depicted in Figure 2, the database from 1 sleep center (selected at random, referred to as database/hospital E) was used to train the model for use in predicting sleep stages in the remaining 5 databases. Unlike the intracenter assessment, the intercenter assessment had pronounced variations in the PSG machines, operating parameters, and raters. Discrepancies in PSG and setup were handled by standardizing the sampling rate and spectral range.

AI = artificial intelligence, PSG, polysomnography.

Performance measurement

IRR evaluations were expressed using a range of common performance metrics, including the median and interquartile range^20,21 of accuracies, sensitivity, precision, and F₁ score. All scores ranged from 0 (low performance) to 1 (high performance). See the supplemental material^{(685.1KB, pdf)} for details pertaining to these quantities.

RESULTS

Intracenter assessment

Table 3 summarizes the median and interquartile range of the accuracy values. The parts, ie, (Training Set, Testing Set) = (A, A), (B, B), …, or (F, F), represent the results of intracenter assessment. The median accuracy ranged from 80.3%–83.3% in all databases except for database E, where it was 72.3%. Figure 3 uses radar charts to present the sensitivity, precision, and F₁ score of each sleep stage. The proposed algorithm was more effective in identifying the N2, awake, and REM stages than in identifying the N1 and N3 stages. The difference in the performance of the algorithm when applied to database E can be attributed to the fact that the sensitivity and precision were relatively high for N3 and relatively low for N2. This issue is discussed in the next section (please see the explanation of Figure 3D).

Table 3.

Median and IQR of accuracy (units: %).

	Testing Set
Training Set	Hospital	A	B	C	D	E	F
	A	83.3 (9.7)	75.7 (14.0)	79.7 (14.6)	78.6 (15.3)	64.4 (18.5)	80.6 (8.1)
	B	77.8 (12.6)	80.4 (11.5)	82.1 (11.4)	77.1 (12.1)	67.2 (17.9)	79.6 (8.8)
	C	79.0 (11.6)	78.9 (13.9)	83.1 (12.1)	78.3 (14.1)	66.4 (17.5)	79.7 (10.0)
	D	81.5 (11.4)	78.2 (15.3)	82.8 (8.0)	80.3 (13.0)	65.7 (20.3)	80.5 (10.4)
	E	74.3 (11.7)	67.1 (17.6)	69.8 (17.1)	67.9 (13.3)	72.3 (12.7)	71.5 (15.0)
	F	82.9 (10.2)	78.4 (13.7)	82.8 (14.3)	79.2 (12.1)	68.5 (20.2)	81.7 (7.7)

Open in a new tab

The diagonal part indicates the results for the intracenter assessment, whereas the nondiagonal part indicates the results for the intercenter assessment. The values inside the parentheses represent the IQR. IQR = interquartile range.

The sensitivity, precision, and F₁ score (units: %) for each stage are marked using □, ○, and Δ, respectively.

Intercenter assessment

Differences in PSG scoring among hospitals were compared using the model trained by 1 database to predict sleep stages in the other databases. There was a total of 30 pairing choices, depending on the selection of databases used for training and testing. The off-diagonal parts of Table 3, ie, (Training Set, Testing Set) = (A, B), (A, C), …, (A, F), (B, A), (B, C), …, (B, F), …, etc., represent the median and interquartile range of the accuracy for these patients. For example, when database A was used for training, the median accuracy of predictions ranged from 75.7%–80.6% for all remaining databases except for that of hospital E, the accuracy of which was 64.4%. We observed a drop in the median accuracy to < 74% when database E was used for testing or training. The radar charts in Figure 4 indicate the sensitivity, precision, and F₁ score for the model trained using database A. When applied to databases B, C, D, and F, the performance of this model was higher in detecting awake, N2, and REM epochs than in detecting N1 and N3 epochs.

The prediction model trained using the database A was tested on databases B, C, D, E, and F. The sensitivity, precision, and F₁ score (units: %) for each stage are marked with □, ○, and Δ, respectively. Notation X→Y indicates that the model was trained using database X and tested on database Y.

Where database E was used for testing, the shape of the radar chart in Figure 3D has important implications. The low sensitivity to N3 (red line) implies that the model trained using database A is likely to falsely identify epochs as N3 (over other sleep stages) when it is applied to database E. The high degree of precision in the identification of N3 epochs (green line) implies that N3 predictions are likely to be manually scored as N3 by the technicians in sleep center E. The results in Table 3 and Figure 4 indicate quality issues in sleep stage scoring in sleep center E.

The average time required to train each database was 4.3 ± 3.7 hours, depending on the size. The average time required to test the recordings of each patient was 105 ± 3 seconds, depending on the duration.

Our final analysis of prediction accuracy was performed using 2 prediction models respectively trained using 2 different databases. At the top of Figure 5, “X vs Y” indicates that databases X and Y were used to train the models and were then tested on database Z, where $X, Y, Z \in {A, B, C, D, E, F}$ . LOSOCV was used when $X = Z$ or $Y = Z$ . In each model, we applied a 1-way analysis of variance to test the hypothesis that the mean accuracies of the models trained using different databases were the same vs the alternative hypothesis that the mean accuracies were different. Note that the P value was adjusted using the Bonferroni correction to account for the multiple test issue. Figure 4 shows that regardless of which database was used for testing, most of the P values in the five columns under “A vs E,” “B vs E,” “C vs E,” “D vs E,” and “E vs F” were < .05. The P values < .05/ $C_{2}^{6}$ =0.0033 are marked with *. Overall, these results show that the performance of the model trained using database E differed significantly from the models trained using databases A, B, C, D, or F.

The above analysis results suggest that hospital E has a quality control problem. To further confirm whether the proposed IRR evaluation AI system does help identify hospitals with quality control issues, we consulted the physicians in charge of the sleep center of hospital E. An affirmative answer was obtained, confirming that some technicians have very different sleep stage interpretations than others.

DISCUSSION

This study has shown the efficacy of using AI to evaluate IRR in multiple hospitals and identify sleep centers with potential quality issues. In the intracenter analysis, the agreement between manual scoring and predicted values ranged from 80.3%–83.3%, except for hospital E. In the intercenter analysis, the training and testing data were obtained from different PSG devices and the labeling was performed by different technicians; therefore, the agreement between manual scoring and AI-predicted stages ranged from 75.7%–83.3%, as long as the results from hospital E were excluded. This finding suggested that hospital E has quality issues, which coincides with clinical feedback. We believe that if a sleep center cannot properly score sleep stages of patients without sleep apnea, then other annotations should also be problematic. In other words, our AI system is designed to help identify the worst-case scenario. A systematic examination of hospital E, including automatically identifying a particular technician and the upcoming retraining to restore quality, will be presented in our future work.

Individual raters differ on their interpretation of sleep stages, and the raw signals from different hospitals were recorded using different types of PSG devices and different operational parameters. Therefore, even when the sampling rate and frequency range were standardized before analysis, the agreement rates dropped approximately 4–5% from the intracenter assessment to the intercenter assessment. In comparison with previous studies related to IRR, the intercenter assessment discussed in this work helps sleep researchers and clinicians understand what may happen if the scoring rule learned from the database of a sleep center is applied to a new database. The proposed models proved far more effective in detecting awake, N2, and REM epochs than in detecting N1 and N3 epochs, possibly because the identification of REM and N2 relies on objective markers from EMG, EOG, the K-complex, and the sleep spindle. This finding may also result from variations in the analysis of alpha waves appearing during the transition from wake to sleep. Under AASM standards,³ the N3 stage is characterized by the presence of slow brain waves (ie, delta waves), which account for > 20% of a 30-second epoch. If the proportion of delta waves within a 30-second epoch is < 20%, then the epoch is scored as N2. Note that it can be very difficult to follow this guideline precisely, with the result that N2 and N3 epochs are commonly misclassified.

It is important to carefully consider what accuracy means within the context of AI. It is very likely that an AI system of outstanding accuracy would be prone to overfitting, particularly when it is employed for scientific research. Thus, an ideal AI system should not only emphasize its high accuracy but also show how it can help human experts, consider the limitations on human knowledge, and leave room for human experts to expand their knowledge. In the current study, our focus was not on pushing the accuracy of our AI system to the limit. Rather, our focus was to design it to enable the determination of IRR in an efficient manner. Under the assumption that all raters follow the same protocol, the proposed AI system also serves as a system to warn administrators of quality issues. When scoring abnormalities are detected (notable discrepancies between a scorer and the hypnogram generated by the algorithm), scorers could be advised to polish up their skills using the AASM Interscorer Reliability program or other courses. As a result, although the quality issue we are concerned with in this work is sleep stage annotation, we believe that it is sufficient to alarm the particular sleep center and increase the overall sleep examination quality.

Scoring overnight PSG outputs is time-consuming and prone to error because of enormous signal loading.²² To resolve this problem, researchers sought to establish an AI system to automate sleep stage annotation. Despite considerable advances in the accuracy of AI systems, their reliability is still limited. It is difficult for sleep experts to unambiguously assign labels, regardless of whether they are dealing with healthy patients or patients with sleep breathing disorders.²³ Most AI systems are based on the assumption that labels are unambiguous; however, this is simply not the case in many medical situations. An ideal solution would be to have multiple sleep experts label all patients in a common database and then seek a consensus annotation with which to build labels of sufficient reliability to improve the accuracy of AI systems; however, the substantial draw on human resources render this approach impractical. Our AI system has the potential to help in this regard. The ability to determine IRR should make it possible to formulate labels of increasing consistency over time, or human scorers could give the AI system feedback in case of remarkable disagreement. Both options, in turn, would allow the AI system to evolve and would enhance the accuracy and reliability of AI systems.^{20,21,23–27} We believe that this iterative process should eventually lead to improvements in the agreement rates and the label quality and hence boost the accuracy of the AI system.

The proposed AI system can assist with several quality control problems. Evaluating the reliability of sleep centers in the scoring of sleep stages is important when conducting studies based on multiple centers and when implementing programs involving international cooperation. Note that this issue is of particular importance when 1 of the sleep centers has recently been established.^5,22,27–32 One study compared IRR in scoring sleep stages in laboratories in both Germany and China.³² Unfortunately, this type of study is labor-intensive and imposes numerous difficulties in overcoming differences in laboratory setup. Moreover, achieving a consensus on sleep stage scoring within and between sleep centers is usually via holding consensus meetings, which are time-consuming, expensive, and inefficient. Our results show that the proposed AI system can greatly assist in this mission. In our follow-up of this work, we will examine the discrepancies between our predictions and the sleep stages scored in sleep center E. We will seek to determine the proportion of non-N3 epochs that were falsely scored as N3 and develop a quality control improvement scheme.

This study had a number of limitations. First, the present work was limited to Asian patients suspected of sleep breathing disorders; therefore, its performance in dealing with other sleep disorders (eg, insomnia) or individuals of different ethnicities is unknown. Second, the present algorithm is based on 2-channel EEG signals, EOG and EMG, instead of conventional PSG with at least 9-channel signals for sleep stage interpretation. This setup may help explain the variations in accuracy observed between hospitals. Third, in this work we only evaluated the IRR for patients without sleep apnea. It is possible that sleep experts will perform well with patients without sleep apnea but perform poorly with patients with sleep apnea or with periodic limb movement disorder. Our system is limited to identifying sleep centers with problems examining patients with sleep apnea or periodic limb movement disorder. Note that our aim in the present study was to determine the feasibility of using AI to assess IRR using a broad PSG database comprising patients from multiple hospitals. We will improve the automatic annotation system by including the automatic sleep apnea event annotation through the inclusion of additional channels in a future study.

In conclusion, the proposed AI system can be used to identify discrepancies in the scores assigned by technicians in different sleep centers and identify potential quality issues. Thus, the proposed algorithm could be used as a tool to minimize IRR at least in sleep stage scoring. It could be applied in a single institution to ensure consistency among raters or to facilitate the assessment of results from multiple institutions. Expanding the size of databases with a solid consensus regarding the correctness of annotations will ultimately lead to scoring systems of higher accuracy and reliability.

DISCLOSURE STATEMENT

All authors have seen this manuscript and approved its submission. This study was funded by grants from the Ministry of Science and Technology, Taiwan (MOST-109-2119-M-002-014), and the Chang Gung Medical Research Program (CMRPG3K0201). The authors report no conflicts of interest.

SUPPLEMENTARY MATERIAL

Click here for additional data file.^{(685.1KB, pdf)}

ACKNOWLEDGMENTS

All the authors contributed to the conception and design of the study. Gi-Ren Liu, Ting-Yu Lin, Hau-Tieng Wu, Yuan-Chung Sheu, and Yu-Lun Lo were in charge of design, data analysis, and result interpretation. Ching-Lung Liu, Wen-Te Liu, Mei-Chen Yang, Yung-Lun Ni, Kun-Ta Chou, Chao-Hsien Chen, Dean Wu, Chou-Chin Lan, Kuo-Liang Chiu, Hwa-Yen Chiu, and Yu-Lun Lo provided study materials and selected the patients.

ABBREVIATIONS

AASM: American Academy of Sleep Medicine
AI: artificial intelligence
EEG: electroencephalography
EMG: electromyography
EOG: electrooculography
IRB: Institutional Review Board
IRR: interrater reliability
PSG: polysomnography

REFERENCES

1.Rasch B, Born J. About sleep’s role in memory. Physiol Rev. 2013;93(2):681–766. 10.1152/physrev.00032.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Berry RB, Brooks R, Gamaldo CE, et al. ; for the American Academy of Sleep Medicine . The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Version 2.0. Darien, IL: American Academy of Sleep Medicine; 2012. [Google Scholar]
3.Iber C, Ancoli-Israel S, Chesson AL Jr, Quan SF; for the American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. 1st ed. Westchester, IL: American Academy of Sleep Medicine; 2007. [Google Scholar]
4.Hobson JA. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects. Rechtschaffen A, Kales, A, eds. Washington, DC: U.S. Government Printing Office; 1968. Elsevier; 1969. [Google Scholar]
5.Danker-Hopfe H, Anderer P, Zeitlhofer J, et al. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J Sleep Res. 2009;18(1):74–84. 10.1111/j.1365-2869.2008.00700.x [DOI] [PubMed] [Google Scholar]
6.Anderer P, Moreau A, Woertz M, et al. Computer-assisted sleep classification according to the standard of the American Academy of Sleep Medicine: validation study of the AASM version of the Somnolyzer 24 × 7. Neuropsychobiology. 2010;62(4):250–264. 10.1159/000320864 [DOI] [PubMed] [Google Scholar]
7.Punjabi NM, Shifa N, Dorffner G, Patil S, Pien G, Aurora RN. Computer-assisted automated scoring of polysomnograms using the Somnolyzer system. Sleep. 2015;38(10):1555–1566. 10.5665/sleep.5046 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Rosenberg RS, Van Hout S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013;9(1):81–87. 10.5664/jcsm.2350 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Liu G-R, Lo Y-L, Malik J, Sheu Y-C, Wu H-T. Diffuse to fuse EEG spectra–intrinsic geometry of sleep dynamics for classification. Biomed Signal Process Control. 2020;55:101576. 10.1016/j.bspc.2019.101576 [DOI] [Google Scholar]
10.Mallat S. Group invariant scattering. Commun Pure Appl Math. 2012;65(10):1331–1398. 10.1002/cpa.21413 [DOI] [Google Scholar]
11.Andén J, Mallat S. Deep scattering spectrum. IEEE Trans Signal Process. 2014;62(16):4114–4128. 10.1109/TSP.2014.2326991 [DOI] [Google Scholar]
12.Zoubek L, Charbonnier S, Lesecq S, Buguet A, Chapotot F. Feature selection for sleep/wake stages classification using data driven methods. Biomed Signal Process Control. 2007;2(3):171–179. 10.1016/j.bspc.2007.05.005 [DOI] [Google Scholar]
13.Lu Y, Zhang C, Zhou B-Y, Gao X-P, Lv Z. A dual model approach to EOG-based human activity recognition. Biomed Signal Process Control. 2018;45:50–57. 10.1016/j.bspc.2018.05.011 [DOI] [Google Scholar]
14.Akbar IA, Rumagit AM, Utsunomiya M, Morie T, Igasaki T. Three drowsiness categories assessment by electroencephalogram in driving simulator environment. Annu Int Conf IEEE Eng Med Biol Soc. 2017;2017:2904–2907. 10.1109/EMBC.2017.8037464 [DOI] [PubMed] [Google Scholar]
15.Pan S-T, Kuo C-E, Zeng J-H, Liang S-F. A transition-constrained discrete hidden Markov model for automatic sleep staging. Biomed Eng Online. 2012;11(1):52. 10.1186/1475-925X-11-52 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Phinyomark A, Phukpattaranont P, Limsakul C. Feature reduction and selection for EMG signal classification. Expert Syst Appl. 2012;39(8):7420–7431. 10.1016/j.eswa.2012.01.102 [DOI] [Google Scholar]
17.Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 2004;16(12):2639–2664. 10.1162/0899766042321814 [DOI] [PubMed] [Google Scholar]
18.Schölkopf B, Smola A. Support vector machines and kernel algorithms. In: Encyclopedia of Biostatistics. 2002: 5328–5335. [Google Scholar]
19.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27. 10.1145/1961189.1961199 [DOI] [Google Scholar]
20.Memar P, Faradji F. A novel multi-class EEG-based sleep stage classification system. IEEE Trans Neural Syst Rehabil Eng. 2018;26(1):84–95. 10.1109/TNSRE.2017.2776149 [DOI] [PubMed] [Google Scholar]
21.Tsinalis O, Matthews PM, Guo Y. Automatic sleep stage scoring using time-frequency analysis and stacked sparse autoencoders. Ann Biomed Eng. 2016;44(5):1587–1597. 10.1007/s10439-015-1444-y [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23(7):901–908. 10.1093/sleep/23.7.1e [DOI] [PubMed] [Google Scholar]
23.Boostani R, Karimzadeh F, Nami M. A comparative review on sleep stage classification methods in patients and healthy individuals. Comput Methods Programs Biomed. 2017;140:77–91. 10.1016/j.cmpb.2016.12.004 [DOI] [PubMed] [Google Scholar]
24.Gudmundsson S, Runarsson T, Sigurdsson S. Automatic sleep staging using support vector machines with posterior probability estimates. Paper presented at International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06); November 28–30, 2005; Vienna, Austria. 10.1109/CIMCA.2005.1631496 [DOI] [Google Scholar]
25.Supratak A, Dong H, Wu C, Guo Y. DeepSleepNet: A Model for Automatic Sleep Stage Scoring Based on Raw Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2017;25(11):1998–2008. 10.1109/TNSRE.2017.2721116 [DOI] [PubMed] [Google Scholar]
26.Vilamala A, Madsen K, Hansen L. Deep convolutional neural networks for interpretable analysis of EEG sleep stage scoring. Paper presented at IEEE International Workshop on Machine Learning for Signal Processing; September 25–28, 2017; Tokyo, Japan. [Google Scholar]
27.Chen L, Ho CK, Lam VK, et al. Interrater and intrarater reliability in multiple sleep latency test. J Clin Neurophysiol. 2008;25(4):218–221. 10.1097/WNP.0b013e31817f36a6 [DOI] [PubMed] [Google Scholar]
28.Danker-Hopfe H, Kunz D, Gruber G, et al. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004;13(1):63–69. 10.1046/j.1365-2869.2003.00375.x [DOI] [PubMed] [Google Scholar]
29.Kim Y, Kurachi M, Horita M, Matsuura K, Kamikawa Y. Agreement of visual scoring of sleep stages among many laboratories in Japan: effect of a supplementary definition of slow wave on scoring of slow wave sleep. Jpn J Psychiatry Neurol. 1993;47(1):91–97. [DOI] [PubMed] [Google Scholar]
30.Wendt SL, Welinder P, Sorensen HB, et al. Inter-expert and intra-expert reliability in sleep spindle scoring. Clin Neurophysiol. 2015;126(8):1548–1556. 10.1016/j.clinph.2014.10.158 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Younes M, Kuna ST, Pack AI, et al. Reliability of the American Academy of Sleep Medicine rules for assessing sleep depth in clinical practice. J Clin Sleep Med. 2018;14(2):205–213. 10.5664/jcsm.6934 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zhang X, Dong X, Kantelhardt JW, et al. Process and outcome for international reliability in sleep scoring. Sleep Breath. 2015;19(1):191–195. 10.1007/s11325-014-0990-0 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(685.1KB, pdf)}

[b1] 1.Rasch B, Born J. About sleep’s role in memory. Physiol Rev. 2013;93(2):681–766. 10.1152/physrev.00032.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] 2.Berry RB, Brooks R, Gamaldo CE, et al. ; for the American Academy of Sleep Medicine . The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Version 2.0. Darien, IL: American Academy of Sleep Medicine; 2012. [Google Scholar]

[b3] 3.Iber C, Ancoli-Israel S, Chesson AL Jr, Quan SF; for the American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. 1st ed. Westchester, IL: American Academy of Sleep Medicine; 2007. [Google Scholar]

[b4] 4.Hobson JA. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects. Rechtschaffen A, Kales, A, eds. Washington, DC: U.S. Government Printing Office; 1968. Elsevier; 1969. [Google Scholar]

[b5] 5.Danker-Hopfe H, Anderer P, Zeitlhofer J, et al. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J Sleep Res. 2009;18(1):74–84. 10.1111/j.1365-2869.2008.00700.x [DOI] [PubMed] [Google Scholar]

[b6] 6.Anderer P, Moreau A, Woertz M, et al. Computer-assisted sleep classification according to the standard of the American Academy of Sleep Medicine: validation study of the AASM version of the Somnolyzer 24 × 7. Neuropsychobiology. 2010;62(4):250–264. 10.1159/000320864 [DOI] [PubMed] [Google Scholar]

[b7] 7.Punjabi NM, Shifa N, Dorffner G, Patil S, Pien G, Aurora RN. Computer-assisted automated scoring of polysomnograms using the Somnolyzer system. Sleep. 2015;38(10):1555–1566. 10.5665/sleep.5046 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8] 8.Rosenberg RS, Van Hout S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013;9(1):81–87. 10.5664/jcsm.2350 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9] 9.Liu G-R, Lo Y-L, Malik J, Sheu Y-C, Wu H-T. Diffuse to fuse EEG spectra–intrinsic geometry of sleep dynamics for classification. Biomed Signal Process Control. 2020;55:101576. 10.1016/j.bspc.2019.101576 [DOI] [Google Scholar]

[b10] 10.Mallat S. Group invariant scattering. Commun Pure Appl Math. 2012;65(10):1331–1398. 10.1002/cpa.21413 [DOI] [Google Scholar]

[b11] 11.Andén J, Mallat S. Deep scattering spectrum. IEEE Trans Signal Process. 2014;62(16):4114–4128. 10.1109/TSP.2014.2326991 [DOI] [Google Scholar]

[b12] 12.Zoubek L, Charbonnier S, Lesecq S, Buguet A, Chapotot F. Feature selection for sleep/wake stages classification using data driven methods. Biomed Signal Process Control. 2007;2(3):171–179. 10.1016/j.bspc.2007.05.005 [DOI] [Google Scholar]

[b13] 13.Lu Y, Zhang C, Zhou B-Y, Gao X-P, Lv Z. A dual model approach to EOG-based human activity recognition. Biomed Signal Process Control. 2018;45:50–57. 10.1016/j.bspc.2018.05.011 [DOI] [Google Scholar]

[b14] 14.Akbar IA, Rumagit AM, Utsunomiya M, Morie T, Igasaki T. Three drowsiness categories assessment by electroencephalogram in driving simulator environment. Annu Int Conf IEEE Eng Med Biol Soc. 2017;2017:2904–2907. 10.1109/EMBC.2017.8037464 [DOI] [PubMed] [Google Scholar]

[b15] 15.Pan S-T, Kuo C-E, Zeng J-H, Liang S-F. A transition-constrained discrete hidden Markov model for automatic sleep staging. Biomed Eng Online. 2012;11(1):52. 10.1186/1475-925X-11-52 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16] 16.Phinyomark A, Phukpattaranont P, Limsakul C. Feature reduction and selection for EMG signal classification. Expert Syst Appl. 2012;39(8):7420–7431. 10.1016/j.eswa.2012.01.102 [DOI] [Google Scholar]

[b17] 17.Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 2004;16(12):2639–2664. 10.1162/0899766042321814 [DOI] [PubMed] [Google Scholar]

[b18] 18.Schölkopf B, Smola A. Support vector machines and kernel algorithms. In: Encyclopedia of Biostatistics. 2002: 5328–5335. [Google Scholar]

[b19] 19.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27. 10.1145/1961189.1961199 [DOI] [Google Scholar]

[b20] 20.Memar P, Faradji F. A novel multi-class EEG-based sleep stage classification system. IEEE Trans Neural Syst Rehabil Eng. 2018;26(1):84–95. 10.1109/TNSRE.2017.2776149 [DOI] [PubMed] [Google Scholar]

[b21] 21.Tsinalis O, Matthews PM, Guo Y. Automatic sleep stage scoring using time-frequency analysis and stacked sparse autoencoders. Ann Biomed Eng. 2016;44(5):1587–1597. 10.1007/s10439-015-1444-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22] 22.Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23(7):901–908. 10.1093/sleep/23.7.1e [DOI] [PubMed] [Google Scholar]

[b23] 23.Boostani R, Karimzadeh F, Nami M. A comparative review on sleep stage classification methods in patients and healthy individuals. Comput Methods Programs Biomed. 2017;140:77–91. 10.1016/j.cmpb.2016.12.004 [DOI] [PubMed] [Google Scholar]

[b24] 24.Gudmundsson S, Runarsson T, Sigurdsson S. Automatic sleep staging using support vector machines with posterior probability estimates. Paper presented at International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06); November 28–30, 2005; Vienna, Austria. 10.1109/CIMCA.2005.1631496 [DOI] [Google Scholar]

[b25] 25.Supratak A, Dong H, Wu C, Guo Y. DeepSleepNet: A Model for Automatic Sleep Stage Scoring Based on Raw Single-Channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2017;25(11):1998–2008. 10.1109/TNSRE.2017.2721116 [DOI] [PubMed] [Google Scholar]

[b26] 26.Vilamala A, Madsen K, Hansen L. Deep convolutional neural networks for interpretable analysis of EEG sleep stage scoring. Paper presented at IEEE International Workshop on Machine Learning for Signal Processing; September 25–28, 2017; Tokyo, Japan. [Google Scholar]

[b27] 27.Chen L, Ho CK, Lam VK, et al. Interrater and intrarater reliability in multiple sleep latency test. J Clin Neurophysiol. 2008;25(4):218–221. 10.1097/WNP.0b013e31817f36a6 [DOI] [PubMed] [Google Scholar]

[b28] 28.Danker-Hopfe H, Kunz D, Gruber G, et al. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004;13(1):63–69. 10.1046/j.1365-2869.2003.00375.x [DOI] [PubMed] [Google Scholar]

[b29] 29.Kim Y, Kurachi M, Horita M, Matsuura K, Kamikawa Y. Agreement of visual scoring of sleep stages among many laboratories in Japan: effect of a supplementary definition of slow wave on scoring of slow wave sleep. Jpn J Psychiatry Neurol. 1993;47(1):91–97. [DOI] [PubMed] [Google Scholar]

[b30] 30.Wendt SL, Welinder P, Sorensen HB, et al. Inter-expert and intra-expert reliability in sleep spindle scoring. Clin Neurophysiol. 2015;126(8):1548–1556. 10.1016/j.clinph.2014.10.158 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b31] 31.Younes M, Kuna ST, Pack AI, et al. Reliability of the American Academy of Sleep Medicine rules for assessing sleep depth in clinical practice. J Clin Sleep Med. 2018;14(2):205–213. 10.5664/jcsm.6934 [DOI] [PMC free article] [PubMed] [Google Scholar]

[b32] 32.Zhang X, Dong X, Kantelhardt JW, et al. Process and outcome for international reliability in sleep scoring. Sleep Breath. 2015;19(1):191–195. 10.1007/s11325-014-0990-0 [DOI] [PubMed] [Google Scholar]

PERMALINK

Large-scale assessment of consistency in sleep stage scoring rules among multiple sleep centers using an interpretable machine learning algorithm

Gi-Ren Liu, PhD

Ting-Yu Lin, MD

Hau-Tieng Wu, MD, PhD

Yuan-Chung Sheu, PhD

Ching-Lung Liu, MD

Wen-Te Liu, PhD

Mei-Chen Yang, MD

Yung-Lun Ni, MD

Kun-Ta Chou, PhD

Chao-Hsien Chen, MD

Dean Wu, PhD

Chou-Chin Lan, PhD

Kuo-Liang Chiu, MD, PhD

Hwa-Yen Chiu, MD

Yu-Lun Lo, MD

Abstract

Study Objectives:

Methods:

Results:

Conclusions:

Citation:

BRIEF SUMMARY

INTRODUCTION

Data source

Table 1.

Table 2.

Automated annotation system

Figure 1. Flowchart about the sensor fusion designed for the binary classification of REM and N1 and the binary classification of N2 and N3.

Intracenter and intercenter assessment

Figure 2. Scenario of the intercenter assessment.

Performance measurement

RESULTS

Intracenter assessment

Table 3.

Figure 3. Radar charts obtained from leave-one-subject-out cross-validation.

Intercenter assessment

Figure 4. Radar charts used for intercenter assessment.

Figure 5. The P values used for indicating significant differences between database E and the other databases.

DISCUSSION

DISCLOSURE STATEMENT

SUPPLEMENTARY MATERIAL

ACKNOWLEDGMENTS

ABBREVIATIONS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases