Abstract
Study Objectives:
The American Academy of Sleep Medicine (AASM) guidelines for polysomnography (PSG) scoring are increasingly being adopted worldwide, but the agreement among international centers in scoring respiratory events and sleep stages using these guidelines is unknown. We sought to determine the interrater agreement of PSG scoring among international sleep centers.
Design:
Prospective study of interrater agreement of PSG scoring.
Setting:
Nine center-members of the Sleep Apnea Genetics International Consortium (SAGIC).
Measurements and Results:
Fifteen previously recorded deidentified PSGs, in European Data Format, were scored by an experienced technologist at each site after they were imported into the locally used analysis software. Each 30-sec epoch was manually scored for sleep stage, arousals, apneas, and hypopneas using the AASM recommended criteria. The computer-derived oxygen desaturation index (ODI) was also recorded. The primary outcome for analysis was the intraclass correlation coefficient (ICC) of the apnea-hypopnea index (AHI). The ICCs of the respiratory variables were: AHI = 0.95 (95% confidence interval: 0.91-0.98), total apneas = 0.77 (0.56-0.87), total hypopneas = 0.80 (0.66-0.91), and ODI = 0.97 (0.93-0.99). The kappa statistics for sleep stages were: wake = 0.78 (0.77-0.79), nonrapid eye movement = 0.77 (0.76-0.78), N1 = 0.31 (0.30-0.32), N2 = 0.60 (0.59-0.61), N3 = 0.67 (0.65-0.69), and rapid eye movement = 0.78 (0.77-0.79). The ICC of the arousal index was 0.68 (0.50-0.85).
Conclusion:
There is strong agreement in the scoring of respiratory events among the SAGIC centers. There is also substantial epoch-by-epoch agreement in scoring sleep variables. Our results suggest that centralized scoring of PSGs may not be necessary in future research collaboration among international sites where experienced, well-trained scorers are involved.
Citation:
Magalang UJ; Chen NH; Cistulli PA; Fedson AC; Gíslason T; Hillman D; Penzel T; Tamisier R; Tufik S; Phillips G; Pack AI. Agreement in the scoring of respiratory events and sleep among international sleep centers. SLEEP 2013;36(4):591-596.
Keywords: General, polysomnography, scoring
INTRODUCTION
Polysomnography (PSG) is the current gold standard for defining the presence and severity of obstructive sleep apnea (OSA).1,2 In 2007, the American Academy of Sleep Medicine (AASM) published revised guidelines for the scoring of PSG, including the definitions of respiratory events and sleep stages.3 Since its publication, these guidelines have been increasingly adopted internationally for scoring PSGs.
Using the AASM guidelines, a previous retrospective study found substantial interrater agreement in the scoring of sleep stages among six scorers from three sleep centers.4 A major revision within the new guidelines is a recommended definition for a hypopnea that incorporates a ≥ 4% oxyhemoglobin desaturation associated with the respiratory event.3 A previous multi-center study showed that the interrater agreement for scoring respiratory events by three sleep technologists at one centralized scoring center was excellent when respiratory indices during sleep were defined by association with an oxy-hemoglobin desaturation rather than by presence of associated electroencephalographic (EEG) arousals.5
Although centralized scoring of PSGs in multicenter studies has been shown to be an excellent approach in standardizing their analysis,5 it is also more resource-intensive in terms of infrastructure (sharing of PSGs and server space) and adequate scoring personnel, particularly in research involving patients who are to have PSGs for clinical reasons. The need for these additional resources is likely to be magnified where the multicenter studies include different countries in various continents. Such studies will be increasingly needed with proliferation of large multicenter clinical trials and studies of the genetics of sleep disorders. One alternative approach would be local scoring of the sleep studies by experienced technologists in the sleep centers where the PSGs were acquired as is done in routine clinical practice. However, the inter-rater agreement of scoring respiratory events and sleep stages using the AASM guidelines among international sleep centers is unknown.
We sought to determine the agreement of scoring respiratory events and sleep stages in the nine center-members of the Sleep Apnea Genetics International Consortium (SAGIC) [http://www.med.upenn.edu/sleepctr/SAGIC.shtml]. Our primary hypothesis was that there would be good agreement in the scoring of the apnea-hypopnea index (AHI) among experienced scorers in the SAGIC centers using the new AASM definitions.
METHODS
The SAGIC centers are located in the following cities: Perth, Australia; Sydney, Australia; São Paulo, Brazil; Grenoble, France; Berlin, Germany; Reykjavik, Iceland; Taipei, Taiwan; and Columbus, Ohio and Philadelphia, Pennsylvania in the United States of America. The study was approved by the Institutional Review Board of The Ohio State University Medical Center. Informed consent was waived for this study given that PSG was previously recorded and deidentified.
Polysomnography
Fifteen previously recorded attended, in-laboratory PSGs in the Columbus, Ohio site were chosen by a SAGIC investigator (UM). The studies were randomly selected from a database of clinical PSGs recorded during one quarter, in order to represent a wide spectrum of OSA severity based on AHI: AHI 0-20/h: n = 5; AHI 21-30/h: n = 5; AHI > 30/h: n = 5. Once the required number of studies in each level of OSA severity has been selected, no further studies in the specified AHI range were included. Exclusion criteria included split-night and titration studies, those done while on oxygen supplementation, and less than 4 h of recording time. Once the 15th sleep study was selected, no further sleep studies were evaluated. During this process, there were no studies excluded on the basis of poor signal quality.
All PSGs were originally recorded using an N-7000 amplifier (Embla Systems, LLC, Ontario, Canada). The sampling rates of the channels were: EEG, electrooculogram (EOG), electro-myogram (EMG), electrocardiogram (ECG), snore, thermocouple airflow, and nasal pressure: 200 Hz; chest and abdominal movement: 50 Hz; and pulse oximetry: 10 Hz. The pulse oximeter (Nonin Medical, Inc, Plymouth, MN) used in acquiring the PSGs used an averaging algorithm of 3 sec or faster for pulse rates of 60 beats per min or greater. The PSGs were deidentified and then converted into European Data Format (EDF). EDF conversion removed all previous scoring of sleep stages, respiratory events, and sleep technologist notations because the format contains only the biosignals and calibration information.
Scoring
At each SAGIC site, the studies in EDF format were imported into the local software used for scoring. The scoring software included Remlogic (Embla Systems, LLC, Ontario, Canada) [Grenoble, Berlin, and São Paulo, Brazil sites]; Compumedics (Compumedics Limited, Victoria, Australia) [Perth, Sydney, and Taipei, Taiwan sites]; Sandman (Embla Systems, LLC, Ontario, Canada) [Philadelphia, PA site]; Rembrandt (Embla Systems, LLC, Ontario, Canada) [Columbus, OH site]; and Somnologica (Embla Systems, LLC, Ontario, Canada) [Reykjavik, Iceland site]. A protocol including a review of the AASM guidelines on scoring of sleep and respiratory events (in Microsoft Power Point) was provided to one experienced scorer from each SAGIC site who performed the scoring of the imported PSGs. All scorers have at least 5 yr of experience in scoring clinical PSGs and were designated by the investigator at each site. The recording start time, recording end time, lights-off time, and lights-on time were provided for each study. The channel names of the different signals as well as their derivations were also provided: EEG: F4, C4, O2 all referenced to M1 and F3, C3, O1 all referenced to M2; EOG: E1-M2, E2-M2; chin EMG; snore channels, nasal pressure; thermocouple oronasal airflow; chest and abdominal movements; oxyhemoglobin saturation and plethysmographic signal by pulse oximetry; electrocardiogram; and leg EMG. Each 30-sec epoch was manually scored for sleep stage, apneas, hypopneas, and arousals using the AASM recommended criteria.3 Briefly, an apnea is scored when all of the following criteria are met: a drop in thermal sensor excursion by ≥ 90% of baseline, the duration of the event lasts at least 10 sec, and at least 90% of the event's duration meets the amplitude reduction criteria. An apnea is classified as obstructive if it is associated with continued or increased respiratory effort throughout the entire period of absent airflow, central if it is associated with absent respiratory effort throughout the entire period of absent airflow, and mixed if it is associated with absent respiratory effort in the initial portion of the event, followed by resumption of respiratory effort in the second portion of the event. A hypopnea is scored if the following criteria are met: the nasal pressure signal excursions drop by ≥ 30% of baseline, there is a ≥ 4% desaturation from pre-event baseline, and at least 90% of the event's duration meets the amplitude reduction criteria. The computer-derived oxygen desaturation index (ODI), defined as the number of oxygen desaturations ≥ 4% per h of sleep, was also obtained. All scorers recorded the following variables for each PSG: total sleep time (TST); min Stages N1, N2, N3, nonrapid eye movement (NREM) and Stage R (rapid eye movement, REM); arousal index; AHI; number of apneas; number of obstructive, central, and mixed apneas; number of hypopneas; and ODI. Scoring of sleep stages and arousals all followed the new AASM guidelines.3 Using the local scoring software available at each site, the epoch-by-epoch sleep stage scoring of each PSG was exported into a text file and then compiled into a Microsoft Excel spreadsheet by a sleep scoring quality assurance program (QSleep, Victoria, Australia).
Sample Size
The primary outcome for the interrater agreement analysis was the intraclass correlation coefficient (ICC) of the AHI. Given nine sleep scorers (one at each site), the 15 PSGs had a power of 83% to detect an ICC of at least 0.90, assuming a null hypothesis of ICC = 0.70.
Statistical Analysis
The interrater reliability measures used to examine the agreement among the nine different scorers were the ICC for continuous variables (respiratory indices, duration of sleep stages, and arousal index) and the kappa (κ) statistic for multiple raters for the categorical variables (sleep stages). The levels of agreement using the ICCs of respiratory indices, duration of sleep stages, and arousal index were classified as follows: 0.00-0.25 = little, 0.26-0.49 = low, 0.50-0.69 = moderate, 0.70-0.89 = strong, 0.90-1.00 = very strong.6–8 For the κ statistic for multiple raters, the levels of agreement were classified as follows: < 0 = no agreement, 0-0.20 = slight agreement, 0.21-0.40 = fair agreement, 0.41-0.60 = moderate agreement, 0.61-0.80 = substantial agreement, 0.81-1.0 = almost perfect agreement.9 Data analyses were performed using Stata software version 12 (StataCorp, LP, College Station, TX).
RESULTS
Respiratory Event Scoring
The scored AHI ranged from 0 to 70.9 events/h of sleep. The interscorer agreement of AHI scoring (Table 1) among the SAGIC centers was very strong, with an ICC of 0.95 (95% confidence interval [CI] 0.91-0.98). There was strong agreement in the scoring of the total number of apneas as well as the total number of hypopneas. Table 1 displays the means and the standard deviations (SD) of the scoring of respiratory events in all 15 PSGs. One site (sleep center number six) tended to score the respiratory events as hypopneas, whereas another (sleep center number eight) tended to score the respiratory events as apneas (Table 1). The agreement of ODI scoring was also very strong with an ICC of 0.97 (95% CI 0.93-0.99).
Table 1.
Figure 1 displays all 15 PSG respiratory event scoring results at the different sites. As expected from the ICC values, there were more variations in the scoring of the total apneas and hypopneas compared with the AHI.
The AHI scores were highly correlated with the ODI scores (r = 0.96, P < 0.0001). Using the Bland-Altman method, the mean difference between AHI and ODI scores was +2.1 events/h (limits of agreement: +12.3 and -8.1 events/h).
Sleep Stage Scoring
Table 2 shows the epoch-by-epoch (n = 12,712) agreement in scoring Stages W, NREM, and REM sleep among the nine scorers. There was substantial agreement in the scoring of the major sleep-wake states with κ = 0.78 (95% CI 0.77-0.78). In Table 3, the κ values are shown with NREM sleep separated into different stages. The scoring of Stages N1 and N2 showed only fair (κ = 0.31) and moderate (κ = 0.60) agreements respectively, although there was still substantial agreement in the scoring of all sleep stages with κ = 0.63 (95% CI 0.62-0.63).
Table 2.
Table 3.
Table 4 shows the agreement of the sleep stage duration (min) and arousal index scoring. There was strong agreement in the scoring of the duration of TST, NREM, and REM sleep, but little to low agreement in the scoring of the duration of Stage N1 and N2. There was moderate agreement in the scoring of the arousal index with an ICC of 0.68 (95% CI: 0.49-0.85).
Table 4.
DISCUSSION
This study investigated the agreement of scoring respiratory events and sleep stages, among experienced scorers using the AASM definitions, in nine international sites (members of SAGIC). The major findings of this study are: (1) there is a very strong agreement in the scoring of the AHI among experienced sleep technologists using the AASM recommended definitions of respiratory events; (2) there is strong agreement in the scoring of the total number of apneas and hypopneas; and (3) there is substantial epoch-by-epoch agreement in scoring wake, NREM, and REM sleep. The computer-assisted scoring of the ODI also showed very strong agreement despite using various scoring software to derive this index. To our knowledge, this is the first study examining the interscorer agreement of PSG scoring of respiratory events and sleep among international sleep centers based on the 2007 AASM guidelines. The issue of agreement is not only important to researchers who are planning multicenter collaborative efforts but also to clinicians because such information is essential for evaluation of the findings and interpretation. The use of different scoring software across the nine centers is a strength of this study as it provides a greater challenge to good agreement and is a real-world application.
A previous study reported by Whitney and colleagues5 showed that the interrater agreement of scoring respiratory indices by three sleep technologists at one centralized scoring center was excellent when respiratory indices were defined in association with an oxyhemoglobin desaturation rather than the presence of associated EEG arousals. Our study extends this finding to international sleep centers and showed that using the 2007 AASM recommended scoring guidelines (which bases the hypopnea definition on the occurrence of a ≥ 4% oxyhemoglobin desaturation), there is very strong agreement in AHI scoring (ICC = 0.95; 95% CI 0.91-0.98). This is similar to the ICC of 0.99 (95% CI not reported) in the prior study of three scorers using the same hypopnea definition.5 In our current study, the agreement in the scoring of central and mixed apneas was lower than obstructive apneas with an ICC of 0.46 and 0.42, respectively. The reasons for these findings are unclear, but suggest that it is an area that needs to be emphasized in ongoing quality improvement of PSG scoring among the SAGIC centers.
We also found similar levels of agreement in regard to sleep stage scoring with a previously reported κ of 0.81-0.83 among two scorers in the study by Whitney et al.5 compared with the κ of 0.78 observed among the nine scorers in our study. It should be pointed out that this previous study used the Rechtshaffen and Kales (R & K) criteria in scoring sleep stages because it was conducted prior to the publication of the 2007 AASM scoring guidelines. With regard to sleep staging, our results indicated that the lowest agreement is in the scoring of Stage N1, a finding also reported in the study by Whitney et al. The interrater agreement of sleep stage scoring was also previously examined by Danker-Hopfe and colleagues4 in 12 PSGs among six scorers from three sleep centers located in Austria and Germany. These investigators found a substantial agreement in sleep stage scoring using the AASM guidelines (0.61 < κ < 0.80) and their study suggested that the interrater agreement was higher when using the AASM guidelines compared with the R & K criteria. In addition, they also found the lowest agreement in Stage N1 scoring. Unlike our study, however, these investigators also found a low interrater agreement in Stage N3 scoring.
Our study has some limitations. First, we only included one experienced scorer at each site. Additional research would be needed to determine PSG scoring agreement if multiple scorers from each center would be involved in SAGIC research projects. Second, we did not determine the intrascorer reliability of respiratory event and PSG scoring. Nonetheless, the previous study by Whitney and colleagues5 suggests that intrascorer agreement of respiratory indices and sleep staging was also excellent. Third, the 15 PSGs used in this study were attended, in-laboratory sleep studies performed in one member center. However, data acquisition in these PSGs used standard techniques that would not be expected to be different given the equipment used at the nine SAGIC sites. It is not clear whether our findings can be generalized to unattended PSGs. Fourth, we did not examine the scoring of periodic limb movements because our aim was primarily to assess the interrater agreement of scoring of respiratory indices and sleep stages. Finally, our current findings should be viewed within the context of new AASM respiratory scoring rules that will soon be published that call for a change in hypopnea definition and stipulate the following: a drop of peak signal excursion by ≥ 30% of pre-event baseline using the nasal pressure signal for a duration of ≥ 10 sec with a ≥ 3% oxygen desaturation from pre-event baseline or an associated arousal. The effect of the upcoming change in hypopnea definition (which incorporates the presence of arousals) on the interrater reliability of scoring respiratory events among international centers is yet to be examined.
There are several key implications in research collaboration continuing from the results of this important study. Although there was very strong agreement in the scoring of the AHI, one site (sleep center number six) tended to score the respiratory events as hypopneas, whereas another (sleep center number eight) tended to score the respiratory events as apneas (Table 1). We also found lower levels of agreement in the scoring of specific NREM sleep stages, particularly stage N1 (Table 3). Findings from this initial project will help in the ongoing quality improvement of PSG scoring among the SAGIC centers because they suggest that efforts in scoring education among the sleep technologists should be centered on the differentiation between apneas and hypopneas, identification of central apneas, and improving recognition of sleep stage N1.
OSA is a highly prevalent condition not only in Western countries but also worldwide.10–12 The condition is associated with a variety of adverse consequences including hypertension, stroke, cardiovascular disease, and motor vehicle crashes.13–16 There is a growing interest in performing multicenter studies involving patients with OSA at the international level.17 Centralized scoring of PSGs in multicenter studies involving different countries in various continents will require a greater magnitude of resources that may not always be available to investigators. An alternative approach would be scoring of the sleep studies locally in the sleep centers where the PSGs are acquired. Our results suggest that this is feasible in future research collaboration among national and international groups such as SAGIC, particularly in studies where the AHI, ODI, and duration of sleep are the primary variables of interest. Additional steps required to ensure standardization of data collection would include ensuring similar procedures (such as sampling rates and filters) and comparable pulse oximeters18 are used at each center, and technically adequate signals are obtained during the data acquisition of PSGs.
In summary, our findings show a substantial agreement in the scoring of the respiratory events as well as sleep stages among international sleep centers working in the SAGIC consortium using the 2007 AASM guidelines where experienced sleep technologists are involved. Our results suggest an alternative approach to centralized scoring of PSGs in research collaboration among national and international centers, particularly if the AHI, TST, and ODI are the primary variables of interest. This approach could be used in multicenter studies providing a similar scoring validation is conducted.
DISCLOSURE STATEMENT
This was not an industry supported study. The authors have indicated no financial conflicts of interest.
ACKNOWLEDGMENTS
The authors thank the following individuals who helped in this project: Mohammad Ahmadi, Alexander Blau, Silverio Garbuio, Dennis Hoffman, Su-Lan Liu, Neill Madeira, Beth Staley, Magdalena Ósk Sigurgunnarsdóttir, Gavin Sturdy, and James Waddell. The work was supported by NHLBI award P01 HL094307 (to Dr. Pack), HL093463 and the Tzagournis Medical Research Endowment Funds of The Ohio State University (to Dr. Magalang).
Footnotes
A commentary on this article appears in this issue on page 465.
REFERENCES
- 1.Kushida CA, Littner MR, Morgenthaler T, et al. Practice parameters for the indications for polysomnography and related procedures: an update for 2005. Sleep. 2005;28:499–521. doi: 10.1093/sleep/28.4.499. [DOI] [PubMed] [Google Scholar]
- 2.Sleep-related breathing disorders in adults: recommendations for syndrome definition and measurement techniques in clinical research. The Report of an American Academy of Sleep Medicine Task Force. Sleep. 1999;22:667–89. [PubMed] [Google Scholar]
- 3.Iber C, Ancoli-Israel S, Chesson A, Quan SF, editors. 1st ed. Westchester, IL: American Academy of Sleep Medicine; 2007. The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications. [Google Scholar]
- 4.Danker-Hopfe H, Anderer P, Zeitlhofer J, et al. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J Sleep Res. 2009;18:74–84. doi: 10.1111/j.1365-2869.2008.00700.x. [DOI] [PubMed] [Google Scholar]
- 5.Whitney CW, Gottlieb DJ, Redline S, et al. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep. 1998;21:749–57. doi: 10.1093/sleep/21.7.749. [DOI] [PubMed] [Google Scholar]
- 6.Munro BH. Statistical methods for health care research. 5th ed. Philadelphia: Lippincott Williams Wilkins; 2005. pp. 248–9. [Google Scholar]
- 7.Cheng JW, Tsai WC, Yu TY, Huang KY. Reproducibility of sonographic measurement of thickness and echogenicity of the plantar fascia. J Clin Ultrasound. 2012;40:14–9. doi: 10.1002/jcu.20903. [DOI] [PubMed] [Google Scholar]
- 8.Portney LG, Watkins MP. Applications to practice. New Jersey: Prentice Hall Inc.; 2000. Foundations of clinical research; pp. 560–7. [Google Scholar]
- 9.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74. [PubMed] [Google Scholar]
- 10.Young T, Palta M, Dempsey J, Skatrud J, Weber S, Badr S. The occur-rence of sleep-disordered breathing among middle-aged adults. N Engl J Med. 1993;328:1230–5. doi: 10.1056/NEJM199304293281704. [DOI] [PubMed] [Google Scholar]
- 11.Davies RJ, Stradling JR. The epidemiology of sleep apnoea. Thorax. 1996;51:S65–70. doi: 10.1136/thx.51.suppl_2.s65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lam B, Lam DC, Ip MS. Obstructive sleep apnoea in Asia. Int J Tuberc Lung Dis. 2007;11:2–11. [PubMed] [Google Scholar]
- 13.Young T, Peppard P, Palta M, et al. Population-based study of sleep-disordered breathing as a risk factor for hypertension. Arch Intern Med. 1997;157:1746–52. [PubMed] [Google Scholar]
- 14.Yaggi HK, Concato J, Kernan WN, Lichtman JH, Brass LM, Mohsenin V. Obstructive sleep apnea as a risk factor for stroke and death. N Engl J Med. 2005;353:2034–41. doi: 10.1056/NEJMoa043104. [DOI] [PubMed] [Google Scholar]
- 15.Somers VK. Sleep—a new cardiovascular frontier. N Engl J Med. 2005;353:2070–3. doi: 10.1056/NEJMe058229. [DOI] [PubMed] [Google Scholar]
- 16.George CF. Reduction in motor vehicle collisions following treatment of sleep apnoea with nasal CPAP. Thorax. 2001;56:508–12. doi: 10.1136/thorax.56.7.508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.McEvoy RD, Anderson CS, Antic NA, et al. The sleep apnea cardiovascular endpoints (SAVE) trial: Rationale and start-up phase. J Thorac Dis. 2010;2:138–43. doi: 10.3978/j.issn.2072-1439.2010.02.03.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zafar S, Ayappa I, Norman RG, Krieger AC, Walsleben JA, Rapoport DM. Choice of oximeter affects apnea-hypopnea index. Chest. 2005;127:80–8. doi: 10.1378/chest.127.1.80. [DOI] [PubMed] [Google Scholar]