Skip to main content
Sleep logoLink to Sleep
. 2024 Apr 30;47(7):zsae088. doi: 10.1093/sleep/zsae088

The National Sleep Research Resource: making data findable, accessible, interoperable, reusable and promoting sleep science

Ying Zhang 1,b, Matthew Kim 2,b, Michael Prerau 3, Daniel Mobley 4, Michael Rueschman 5, Kathryn Sparks 6, Meg Tully 7, Shaun Purcell 8, Susan Redline 9,
PMCID: PMC11236948  PMID: 38688470

Abstract

This paper presents a comprehensive overview of the National Sleep Research Resource (NSRR), a National Heart Lung and Blood Institute-supported repository developed to share data from clinical studies focused on the evaluation of sleep disorders. The NSRR addresses challenges presented by the heterogeneity of sleep-related data, leveraging innovative strategies to optimize the quality and accessibility of available datasets. It provides authorized users with secure centralized access to a large quantity of sleep-related data including polysomnography, actigraphy, demographics, patient-reported outcomes, and other data. In developing the NSRR, we have implemented data processing protocols that ensure de-identification and compliance with FAIR (Findable, Accessible, Interoperable, Reusable) principles. Heterogeneity stemming from intrinsic variation in the collection, annotation, definition, and interpretation of data has proven to be one of the primary obstacles to efficient sharing of datasets. Approaches employed by the NSRR to address this heterogeneity include (1) development of standardized sleep terminologies utilizing a compositional coding scheme, (2) specification of comprehensive metadata, (3) harmonization of commonly used variables, and (3) computational tools developed to standardize signal processing. We have also leveraged external resources to engineer a domain-specific approach to data harmonization. We describe the scope of data within the NSRR, its role in promoting sleep and circadian research through data sharing, and harmonization of large datasets and analytical tools. Finally, we identify opportunities for approaches for the field of sleep medicine to further support data standardization and sharing.

Keywords: data-sharing, data reuse, data repository, FAIR, metadata, harmonization, informatics

Graphical Abstract

Graphical Abstract.

Graphical Abstract


Statement of Significance.

This manuscript introduces the National Sleep Research Resource (NSRR), a pioneering repository providing access to a large quantity of diverse sleep-related data, crucial for understanding sleep disorders and their systemic impacts on health and health disparities. Adhering to FAIR principles, the NSRR addresses data heterogeneity in sleep data by standardizing sleep terminologies, specifying comprehensive metadata standards, harmonizing commonly used data, and developing computational tools to standardize signal processing. This platform is vital for bridging knowledge gaps in sleep research, promoting innovative data analysis, and enabling translational research. Its approach can inform data sharing, metadata, and Common Data Element development in other domains, significantly enhancing scientific discovery and productivity, statistical power, rigor, and reproducibility in sleep and circadian science.

The generation of massive volumes of biomedical data from multiple sources in combination with a need for greater rigor and reproducibility of scientific research findings has spurred efforts to promote data sharing and standardization to create “big data” resources. Over the last 25 years, the National Institutes of Health (NIH) invested in various initiatives to support these goals, including, but not limited to: (1) the creation of over 130 domain-specific biomedical data repositories and knowledgebases; (2) the Big Data 2 Knowledge initiative that supported the development of tools and training in big data analytics [1, 2]; (3) cloud-based “ecosystems” to store, access, and analyze data, such as BioData Catalyst [3]; and most recently, (4) the NIH Data Management and Sharing requirement for NIH grantees to propose formal plans for standardizing and sharing newly generated research data [4]. Much effort was focused on areas such as cancer and imaging where opportunities were identified to apply machine learning for improved diagnostic and prognostic tools; genomics, which requires extremely large sample sizes to detect typically small effects; and electronic health records, which are continuously generated for tens of millions of people, resulting in huge amounts of “real world” clinical data that are highly under-utilized.

Sleep and circadian data also present unique big data opportunities due to the fundamental role of sleep and circadian rhythms in nearly all physiological systems, as well as the richness of sleep and circadian datasets that include data on multiple physiological systems measured in temporally precise patterns over hours, if not days. The value of repositories and tools for accessing and analyzing physiological signals such as those obtained by electrocardiography and electroencephalography was recognized as early as 1999 when the NIH invested in the PhysioNet Research Resource for Complex Physiologic Signals. Its aim was to create “archives of digital recordings of a wide variety of physiologic signals and related data and associated tools from healthy subjects and patients with a variety of conditions.” However, until 2013 when the National Sleep Research Resource (NSRR; sleepdata.org) was launched, there were no repositories that specifically focused on sleep-related data and the needs of the sleep and circadian research communities. Over its 10-year history, while continually ingesting new datasets, the NSRR has iteratively developed approaches for improving data representation and processes for improving the accessibility and quality of annotated sleep-related summary and raw signal data. Some approaches address problems that are readily applicable to all data types, while others reflect unique aspects of sleep data. In this paper, we: (1) summarize the potential for sleep and circadian data to accelerate scientific discovery and general challenges; (2) provide an overview of the NSRR and a sample of its data; (3) describe specific challenges that impact data standardization and harmonization; (4) describe approaches for developing study-specific and variable-specific metadata and the use of signal processing tools to address FAIR principles and facilitate data harmonization; and (5) propose future directions. We hope that this paper will increase awareness of the value of sleep data repositories generally, as well as improve the understanding of the organization and content of the NSRR specifically, inform future data collection and annotation procedures to facilitate data harmonization, and better prepare sleep researchers to meet current NIH data sharing requirements.

Untapped potential of sleep and circadian data: motivations and goals of the NSRR

Opportunities.

Robust sleep and circadian data repositories could propel multiple scientific discoveries, enhancing the understanding of numerous complex physiological systems while filling critical knowledge gaps related to sleep disorders and their underlying population distributions, risk factors, etiological mechanisms, and impact on health and health disparities. Sleep disorders are prevalent, widely under-recognized and under-treated, and associated with significant morbidity and mortality patterns that are incompletely understood [5]. There are therefore numerous research questions that require access to sleep data from large, well-characterized, and diverse samples connected to clinical and outcome data. Notably, sleep research provides opportunities to understand multiple physiological processes, disease mechanisms, and health outcomes. For example, sleep traits are genetically correlated with multiple cardiovascular, metabolic, and hematological traits [6], providing opportunities to study shared genetic mechanisms to uncover potentially novel etiological pathways and inter-relationships underlying common chronic diseases. Sleep and circadian data provide a unique window into the dynamics and interactions of multiple physiological processes and systems. For example, the neurophysiological manifestations of sleep as measured by sleep macro-architecture (e.g. stages) and sleep micro-architecture (e.g. sleep spindle activity) change dynamically over very short time scales and provide windows into multiple brain-peripheral physiological interactions [7]. Additionally, the occurrences of sleep-related physiological events such as apneas, arousals, cardiac arrhythmias, periodic limb movements, and seizures occur in temporally complex and informative patterns, reflecting the influences of variations in sleep state, body position, circadian phase, autonomic function, and prior physiological events [8–10]. Analyses of streams of diverse data provide opportunities to discern the “cross-talk” across multiple physiological systems and to develop temporally-based interventions that anticipate and potentially prevent adverse physiological events. High dimensional sleep data are ripe for using artificial intelligence and machine learning for developing algorithms that could transform the clinical management of patients with sleep disorders, but require interrogation of large and diverse datasets [11].

General challenges:

A major barrier to pursuing the many exciting research opportunities of sleep and circadian science relates to the limitations of individual datasets that often lack diversity (socio-economic, race and ethnicity, age, health conditions, exposures, etc.) and are often limited by ascertainment biases, precluding assessments of effect moderation and limiting generalizability. Individual datasets with small or modest sample sizes reduce statistical power and increase the likelihood of spurious inferences.

In the absence of very large, single-source, and richly phenotyped sleep datasets, there is a need to make multiple relevant datasets centrally accessible, and to define and represent those data so that they can be readily combined. For any data type, heterogeneity in data collection procedures, annotations, and labeling reduce the efficiency of accessing, combining, and analyzing such data. These issues are especially pertinent for sleep data for which large volumes are data are routinely collected for clinical purposes by thousands of sleep laboratories per year and by numerous research programs, but are collected using protocols that are largely not standardized with respect to collection procedures (both device-based and patient-reported) and labeling of data elements [12]. Therefore, a major need for a sleep data repository is to ensure that data ingested from diverse sources are well-curated, clearly annotated, and harmonized at various semantic and signal processing levels, ideally using standards that support the needs of the sleep as well as informatics communities. Providing access to well-annotated data from multiple sources also provides scientific opportunities to understand sources of variation due to technical (due to sensors, scorers, algorithms; as described [13]) and non-technical (socio-demographic, environmental, and genetic) factors [14]. This information can guide the interpretation of data from various sources, inform best practices in data collection, and identify important population sources of variation in biological processes.

NSRR: Content and Access

The NSRR provides the scientific community with centralized and secure access to growing numbers of datasets that include objective and/or self-reported measurements of sleep and/or circadian rhythm, including data from polysomnography, actigraphy, and patient-reported questionnaires. Data include raw physiological signal data, summary sleep data, and annotations and associated metadata, with ongoing work to generate and share the results of advanced signal analyses that quantify neurophysiological, electrocardiographic, and respiratory-related metrics. As available (for each dataset), demographic, anthropometry, medical history, laboratory, and clinical outcome data are included. Data are ingested using a process that includes documentation of ethical review and any limitations to data sharing, ascertainment that data are de-identified and do not include Protected Health Information, and review of the integrity of the incoming data. The NSRR is supported by a contract from the National Heart Lung and Blood Institute (NHLBI) with Brigham and Women’s Hospital (BWH), Boston MA; regulatory procedures are compliant with BWH’s institutional policies.

Data is made available to the community through a secure on-line data use agreement and tools for efficiently downloading large files [15, 16]. Data distribution and use are governed by each dataset’s original data use limitations. Investigators who request access to specific datasets (Supplemental Figure S1) and consent to required data access and use agreements (with BWH) can directly download files that may include polysomnograms recordings encoded as European Data Format (EDF) files, polysomnogram annotation files (e.g. containing scored “events” or labeled epochs), demographic information with linked variables, forms used for data collection, and study documentation. To date, 7,820 data requests from 11,373 registered users were submitted and 4,989 were granted access to datasets hosted by the NSRR (unapproved data requests mostly were due to requests that were inconsistent with dataset-specific participant consent, such as requests by a commercial entity to use data unapproved for commercial use). In total, 1.36 petabytes of information have been downloaded from the repository with an average download rate of 25-35 terabytes per month. As a result of this activity and the subsequent use of downloaded information in secondary research and analysis, the NSRR has been cited as a principal resource in approximately 400 indexed publications [17].

The data within the NSRR were initially seeded by data collected under the auspices of the Sleep Reading Centers (directed by SR), with later data contributed by investigators responding to journal or sponsor requirements to share data, or as a result of an NSRR-driven data sharing campaign. As of November 2023, a total of 27 datasets had been incorporated into the NSRR, including data from 16 cohort or observational studies, 6 clinical trials, 1 experimental database, 3 clinical data banks, and 1 animal study [18–33]. These include datasets from a number of landmark studies in the field of sleep research conducted from 1995 to the present (Table 1). Collectively, sources include de-identified data from 46 214 subjects including (1) polysomnogram recordings with overnight multi-channel neurophysiological, cardiac, and respiratory data, (2) actigraphy recordings capturing multi-day 24-hour sleep-wake patterns, (3) responses to surveys asking questions about sleep habits, sleep quality, and the adverse effects of disrupted sleep, and (4) demographic information, anthropometric measurements, biochemical parameters, lifestyle behaviors, and data pertaining to comorbid medical conditions, outcomes, and events. Several “at-a-glance” matrices provide researchers with the ability to quickly identify datasets that include datatypes most relevant to their needs. The broad range of data within the NSRR is organized into conceptual domains with nested subdomains, as summarized (Figure 1).

Table 1.

National Sleep Research Resource (NSRR) datasets (as of March 2024)

Subjects Age range Time frame PSG/HSAT count Actigraphy count Variable count Sleep test type Average actigraphy days On dbGaP
Sleep Heart Health Study 5804 40–89 1995–2010 8444 0 1896 II 0 Yes
Honolulu-Asia Aging Study of sleep apnea 718 79–97 1999–2000 717 0 11 II 0 No
Wisconsin sleep cohort 1123 37–85 2000–2015 3671 0 360 I 0 No
Cleveland Family Study 735 6–88 2001–2006 730 0 2657 I 0 Yes
Study of osteoporotic fractures 461 65–89 2002–2003 453 0 1146 II 0 Yes
Apnea Positive Pressure Long-term Efficacy Study 1516 18–84 2003–2008 1104 0 353 I 0 No
Outcomes of Sleep Disorders in Older Men (MrOS Sleep Study) 2911 65–89 2003–2012 3933 0 649 II 0 Yes
Cleveland Children’s Sleep and Health Study 517 16–19 2006–2010 515 0 257 I 0 No
Childhood adenotonsillectomy trial 1243 5–9 2007–2012 1639 0 2901 I 0 No
Home positive airway pressure 373 20–80 2008–2010 414 0 120 I/III* 0 No
Hispanic community health study/study of Latinos 16,415 18–76 2009–2013 12 088 1,887 1032 III 7 Yes
Heart biomarker evaluation in apnea treatment 318 45–75 2010–2012 591 0 790 III 0 No
Multi-ethnic study of atherosclerosis 2237 54–95 2010–2013 2056 2,159 627 II 7 Yes
Nulliparous pregnancy outcomes study monitoring mothers-to-be 3012 14–44 2011–2013 5341 0 392 III 0 Yes
Best apnea interventions in research 169 46–76 2011–2014 518 0 205 III 0 No
Apnea, bariatric surgery, and CPAP study 49 26–64 2011–2014 132 0 108 I 0 No
One year of actigraphy 1 62 2016–2017 0 1 0 n/a 0 No
The economic consequences of increasing sleep among the urban poor 597 25–55 2017–2019 0 597 0 n/a 28 No
Forced desynchrony with and without chronic sleep restriction 28 20–34 2000–2016 1000 28 32 I 25 No
Nationwide Children’s Hospital Sleep DataBank 3673 0–58 2017–2019 3984 0 31 I 0 No
Maternal sleep in pregnancy and the fetus 106 18–42 2015–2019 106 0 37 I 0 No
Assessing nocturnal sleep/wake effects on risk of suicide 971 18–52 2020–2021 0 0 301 n/a 0 No
Efficacy assessment of NOP agonists in non-human primates 5 14–19 2019 10 0 0 I 0 No
Mignot nature communications 3000 18–91 Varies 1438 0 0 I 0 No
Stanford technology analytics and genomics in sleep 1881 13–84 2018–2019 2055 2055 441 I 7 No
Cox and Fell (2020) sleep medicine reviews 5 0–100 3 0 0 I 0 No
Sleep health in infancy and early childhood 433 0-2 2016–2020 0 1,257 319 n/a 7 No
Sleep disordered breathing, ApoE and lipid metabolism 712 13–90 2003–2007 712 0 67 I 0 No

PSG: polysomnography; HSAT: home sleep apnea test.

Sleep test type: Type I: attended studies that minimally include the following channels: EEG, EOG, ECG/Heart rate, chin EMG, limb EMG, respiratory effort at thorax and abdomen, oxygen saturation, air flow from nasal canula or thermistor. Type II: full polysomnograms (as in Type I) but performed in an unattended setting. Type III: home sleep test (HST), performed in an unattended setting with a minimum of 4 channels, minimally including two respiratory movement/airflow, 1 ECG/heart rate, and 1 oxygen saturation channel. Type IV: home sleep test (HST), performed in an unattended setting with a minimum of 3 channels that allows calculation of an AHI or RDI as the result of measuring airflow or thoracoabdominal movement.

*Home positive airway pressure: Type I in baseline and Type III in follow up visit.

Figure 1.

Figure 1

(a) and (b) Measures and Instruments across the National Sleep Research Resource (NSRR) Sleep Questionnaires and Polysomnography Domain. Bar lengths represent the number of variables in each domain or subdomain aggregated across the full range of datasets. In (a), the colored bar represents variables from specific survey instruments, while the grey bar represents standalone variables. MEQ: Horne-Ostberg Morningness Eveningness Questionnaire; SDS Checklist-25: Sleep Disorders Symptom Checklist-25; DDNSI: Disturbing Dream and Nightmare Severity Index; Calgary SAQLI: Calgary Sleep Apnea Quality of Life Index; OSA-18: Obstructive Sleep Apnea Quality of Life Questionnaire; PSQ: Pediatric Sleep Questionnaire; SEMA: Self-Efficacy Measure of Sleep Apnea; SAQLI: Sleep Apnea Quality of Life Index; ESS: Epworth Sleepiness Scale; BRISC: The Brief Index of Sleep Control; PSQI: Pittsburgh Sleep Quality Index; PSQI: Pittsburgh Sleep Quality Index; FOSQ: Functional Outcomes of Sleep Questionnaire; ISI: Insomnia Severity Index; PROMIS SD: PROMIS Sleep Disturbance; PROMIS SRI: PROMIS Sleep Related Impairment; WHIIRS: Women’s Health Initiative Insomnia Rating Scale. * The full list of NSRR domains and subdomains is provided in Supplemental Figure S3.

Adherence to FAIR Principles

Adherence to FAIR (findable, accessible, interoperable, and reusable) principles [34] is a central tenet for modern data management and is included in recent federal data sharing requirements. At each stage of development of the NSRR, efforts were focused on designing and implementing a system that adheres to these principles. Prior to its initial release, the combined input of computer scientists, data scientists, sleep experts, and informaticians fostered the iterative development of a system targeted to make hosted data accessible by incorporating (1) a streamlined registration process that enables users to submit requests for access to multiple datasets under a unified data access and use agreement, and (2) a secure mechanism for the reliable transfer of downloadable EDF files. During the ingestion process, the NSRR team collaborates with the contributor to further: (3) develop consistent study documentation, (4) provide standardized metadata for key variables, (5) map selected variables to standardized terms and concept tags, and (6) conduct signal processing to generate canonical sets of harmonized sleep signals with standardized labels and sampling rates. These data ingestion procedures were codified and modified to reflect the requirements of NIH’s 2023 Data Management and Sharing Plan requirements, including specification of data formats and metadata standards, and have been publicized through interactions with professional societies, social media, and NSRR-sponsored webinars.

Specific Challenges: Heterogeneity in Data

There are multiple sources of heterogeneity in sleep data that impact the ability to easily find, combine, and reuse data, some of which reflect challenges faced by any federated data repository, while other variations are more specific to sleep research. As discussed in other publications [12, 35], and detailed below, the variability of sleep data reflects variations in collection procedures, annotations, and formatting (Figure 2), each of which requires specific approaches for improving the usability of the data.

Figure 2.

Figure 2.

Concept map of selected sources of heterogeneity in sleep data. This diagram summarizes the sources of heterogeneity in sleep and circadian data discussed in this paper.

  • Variability of polysomnography data collection—The American Academy of Sleep Medicine (AASM) publishes guidelines for polysomnography that include minimal requirements related to channels recorded, sensors used, and filter and sampling rates. However, these guidelines allow for a broad latitude in how data are collected (e.g. several permissible sensor types); annotation procedures (e.g. does not prohibit use of “hot keys” for annotating events such as arousals or apneas); procedures for achieving consistent polarity of recorded signals; and how data are labeled (e.g. there is no widely-adopted and comprehensive standardized nomenclature used to label the physiological channels or event annotations). Further variability in analysis of EEG, EMG, EOG, and ECG data may result from variations in choice of reference electrodes and electrode derivations, which can markedly impact the amplitude and content (e.g. features) of the measured signals, and often is not well-documented. Variations in equipment hardware can output signals that are filtered in poorly documented ways, impacting secondary analysis of measures such as airflow limitation. Heterogeneity also arises from the expanded use of diagnostic devices other than the overnight in-laboratory polysomnogram (Type I device). In fact, approximately 70% of clinical sleep studies currently utilize home-based devices that collect a limited number and variable types of physiological data [36]. These devices, which are increasing in numbers and diversity, are categorized as Types II, III, and IV, with only Types I and II including EEG data collection. Moreover, the Type III and IV devices include a wide variety of sensors, some of which are not routinely used in the “gold-standard” Type I studies (e.g. peripheral arterial tonometry), and often do not include sensors traditionally considered to be core for defining event subtypes (e.g. nasal flow). This variability in data types is reflected in the data ingested into NSRR. Some datasets utilized protocols for collecting comprehensive polysomnography studies monitored by sleep technicians in a laboratory setting, while others relied on a variety of home sleep study devices. As a result, across studies (and sometimes within studies when data were aggregated across different sleep laboratories), different recording approaches were used to measure airflow, oxygenation, respiratory effort, muscle tone, limb movement, eye movement, and brain activity. Such differences not only affect the availability of core parameters within and across studies, but also the ability to define events and create uniform metadata, and the overall precision and accuracy of measurements.

  • Event Definitions—The AASM publishes criteria for scoring specific events within the sleep study (respiratory events, stages, leg movements, etc.) [37]. However, the criteria for scoring events (particularly hypopneas) have changed multiple times over the last 15 years. These changes may result in large differences in disease classification [38, 39]. In addition, many key terms used to annotate events and define sleep disorders have evolved [40]. The original metric used to classify sleep disordered breathing focused on quantifying the number of apneic events per hour to calculate an apnea index [41]. Subsequent definitions expanded criteria to include hypopneas characterized by reductions in airflow with decreased oxygen saturation to calculate a broader index, which initially was labeled a respiratory disturbance index, and later an apnea-hypopnea index (AHI) [18]. As the AHI became accepted as a standard measure of obstructive sleep apnea, thresholds were proposed to classify disease as mild, moderate, and severe disease [42, 43]. However, the AHI was shown to be widely variable depending on which definitions were applied to define hypopneas (with variable criteria for defining critical changes in breathing amplitude and/or inclusion of desaturation and/or arousal). Due to lack of consensus, the AASM even published two definitions characterized as “recommended” and “alternative.” Subsequent revisions proposed more unified “recommended” and “acceptable” hypopnea definitions that still vary with respect to criteria related to associated oxygen saturation and/or arousal [44, 45]. Additional measures used to characterize event subtypes include detection of increased respiratory effort with cortical arousal classified as a respiratory effort-related arousal (RERA) and a summation of the AHI and RERA index designated as the respiratory disturbance index (RDI) [37, 46].

  • Data formatting—Sleep data are routinely collected and saved using a variety of proprietary software dictated by the specific equipment used. To avoid the need to access multiple software tools for data analysis and to standardize the presentation of data, the NSRR requests data contributors to transfer polysomnography data as EDF files (https://edfplus.info/), a standardized format developed to promote sleep data exchange [47, 48]. However, many laboratories do not routinely save data in EDF, requiring support for exporting and de-identifying data for data-sharing. While the NSRR can assist contributors with these tasks (i.e. providing deidentification tools or guidance on best practices for exporting data), these procedures are generally not automated, and their implementation can delay data sharing efforts. In addition to sometimes containing subtle corruptions, e.g. due to truncations of data transfers, EDF files themselves can vary in content and format: for example, (1) with continuous or discontinuous (gapped) recordings, (2) missing physical unit or transducer header fields, (3) inappropriate dynamic ranges or misspecified units, (4) the presence of annotations encoded within the EDF, or (5) with single night data split across multiple EDF files. Further, annotation files can occasionally be temporally misaligned with respect to the underlying signal data.

Many users are interested in training algorithms to automatically score events within the polysomnogram, or to extract novel metrics based on scored events. Those goals require access to the annotation files that provide tabular scored events, delineated by their duration, inter-event intervals, and associated features (e.g. desaturation). However, such files are encoded in a range of different formats, and the labeling, encoding, and directory structures of associated data files vary.

  • Actigraphy—Actigraphy data are also saved in a variety of formats and lack a single “standard.” There are scant published recommendations that guide data collection, with variability in what data are saved (counts, accelerometry motion), sampling rates, and auxiliary data (light, event markers, etc.).

  • Patient-reported outcomes—There is not a standard set of Common Data Elements recommended for sleep or circadian research. Accordingly, the datasets within the NSRR include a variety of sleep questionnaires such as the Epworth Sleepiness Scale, Women’s Health Initiative Insomnia Rating Scale, Pediatric Sleep Questionnaire, Pittsburgh Sleep Quality Index, and Functional Outcomes of Sleep Questionnaire. Some patient-reported data are based on single items, subsets of items, or paraphrased questions abstracted from one or more instruments, with response categories and/or rating scales that are different from the validated survey instruments. Moreover, the reference period (e.g. “in the past two weeks”) vary across studies and often are not preserved in the data dictionary submitted by data owners, which pose challenges to metadata documentation and chronicity assessment of certain sleep disorders. Many items within questionnaires overlap different domains (e.g. insomnia vs sleep quality; sleepiness vs functional impairment), which makes mapping those items to specific domains challenging.

Approaches for Addressing Sleep Data Heterogeneity, Developing Metadata, and Data Harmonization

To minimize the effects of heterogeneity while providing opportunities to assess and learn from sources of heterogeneity, data are ingested using a well-defined process that captures critical metadata at the study and variable level. Innovative approaches that NSRR has employed to address data heterogeneity have stemmed from integrated initiatives that include (1) specification of study-level and variable-level metadata, including use of compositional terminology and mapping of terms to a common standard, as possible, (2) standardization of sleep-wake period information, (3) post-processing polysomnography data to standardize and annotate the data and channel labels, and (4) integration and extension of harmonized variables. Figure 3 outlines the overview for data ingestion and metadata generation.

Figure 3:

Figure 3:

Outline of data ingestion and metadata generation processes in NSRR. This diagram illustrates the workflow required to generate and curate enhanced metadata in the NSRR. The original metadata consists of PDF files of study manuals, forms, and data dictionaries. Information extracted from these resources is categorized as study-level, file-level, or variable-level metadata during the data ingestion process. This structured metadata is reviewed and extended to generate several output products including: (1) semi-structured metadata in the form of a README file that serves as a dataset introduction page on the NSRR website; (2) a version-controlled standardized data dictionary that incorporates standard conceptual domains/subdomains and enhanced variable-level metadata including relevant study-level metadata, provenance information, hyperlinks to data collection forms, and standardized tags; (3) summary statistics for each variable stratified by common demographic groupings; (4) harmonized data for selected groups of variables that are comparable across datasets; (5) enhanced search results using standardized NSRR tags; and (6) an at-a-glance matrix showing the availability of data by category and PSG channel. The right panel shows a screenshot of variable-level metadata for the “ahi” variable integrated into the NSRR after review and curation.

Specification of study-level metadata.

To generate a template for the specification of study-level metadata we adopted a reporting format based on checklists promoted by the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative [49]. Using foundational NSRR datasets as models, we traced the data collection, processing, and analysis approaches involved in different types of studies to identify sources of heterogeneity that could be specified as metadata elements. We compiled a set of key value pairs for each of these elements that we used to generate a metadata intake form incorporating (1) a study overview section providing information about the investigator(s), support, study design, eligibility and exclusion criteria, exposures, interventions, outcomes, access restrictions, and a list of validated survey instrument for collecting patient-report outcomes, (2) an actigraphy data section providing information about data collection and processing including recording devices, software, sampling rates, annotation methods, and definitions of specific times and periods of interest, and (3) a polysomnography data section providing information about data collection and processing including equipment, montages, sampling rates, data formats, scoring methods, and definitions of thresholds used to identify hypopnea events. This form has been deployed as a spreadsheet incorporating selectable and extensible options for each element that has been integrated into the NSRR data deposition process. Information abstracted from completed forms has been used to generate a matrix that provides a sortable filtered overview of the studies included in the NSRR with direct links to available datasets (19).

Specification of variable-level metadata

In the early stages of development of the NSRR we recognized that there was significant variation in the range and depth of metadata available for variables in many datasets. While we knew that advanced cross-cohort search capabilities might make it possible to retrieve similarly labelled variables from different datasets [50, 51], we were also aware that inadequate characterization of the provenance of these variables would make it difficult to determine if they were comparable.

The original plan for the NSRR called for the definition of a sleep research ontology that would serve as the basis for a structured vocabulary to characterize dataset variables. While this approach was conceptually appealing, in practice the development of an extensible ontology proved to be a cumbersome process that led to significant delays in the deployment of usable resources. One challenge was the ability to readily expand upon and integrate with existing ontologies (SNOMED and LOINC) due to their limited coverage of sleep terms. As a workable alternative, we compiled a set of canonical terms abstracted from variables included in the larger foundational datasets. These terms were edited for clarity to provide precise definitions of origins, thresholds, and states before they were added to a curated data dictionary. We were subsequently able to link variables in each dataset to terms in this data dictionary, enabling them to serve as points of connection for cross-cohort queries. When it proved to be feasible, additional metadata elements were appended to linked variables in the form of tags providing information about the source, timing, equipment, and methodology used to collect data. We also linked, when able, terms to sleep-related terms in the National Library of Medicine’s (NLM) Common Data Element library. However, coverage of sleep data within the NLM is currently limited.

This parallel approach to the specification of study-level and variable-level metadata has streamlined the workflow required to integrate datasets submitted for inclusion in the NSRR. Minimizing the ambiguity of metadata mapped at both levels effectively serves to improve the accuracy of cross-cohort queries conducted to identify comparable variables retrieved from disparate datasets.

Compositional terminology.

A special problem in specifying metadata and adopting uniform terminology relates to the marked variation in definitions of apnea and hypopneas, both over time and across datasets. To address this problem, we developed a compositional terminology configured to generate compound labels that can be parsed to provide fully qualified metadata pertaining to specific variables. To accommodate the full range of variable-level metadata pertaining to indices of sleep disordered breathing, we developed a compositional terminology modeled after the post-coordination approach utilized by the SNOMED CT system to define complex concepts [52]. This flexible scheme can be used to generate a compound label for each variable comprised of a root component and qualifying suffix components. The root component includes linked abbreviations that designate the type of event measured (Event), the measurement recorded (Data type), and any qualifiers used to characterize the measurement (Data qualifier) (Supplementary Table S1). A suffix component separated by an underscore can be appended to designate a sleep stage and body position, and additional suffix components can be added to designate a data source and level of oxygen saturation or desaturation. This scheme also incorporates precompiled suffix components that correspond to specific criteria used to identify hypopneas based on varying definitions. The labels generated using this compositional terminology can be parsed by algorithms to enable large scale mapping and harmonization of variables. Those variables that are mapped to labels can be automatically converted between wide and long data formats. When converted to a long data format, the information encoded in each label can be extracted to generate a profile of semantic terms. This approach has proven to work well with polysomnography and actigraphy data which tend to have many permutations of similar measures. We are also assessing the utility of apply a compositional terminology to other types of sleep data such as self-reported questionnaire items.

Defining core sleep-wake information.

One of the initial challenges in sleep data standardization relates to the inconsistency in the terminology used to specify time points and intervals describing sleep and wake periods. Depending on the context and usage, “time” might refer to a specific point in time or to an interval between two time points. While “duration” and “period” could both be taken to correspond to intervals, they were often used interchangeably in protocol descriptions and data dictionaries without any indication of whether they referred to intervals between designated time points or to specific intervals when subjects were determined to be awake or asleep.

Recognizing the need to develop internally consistent terms to distinguish time points and intervals prompted us to compile a list of key concepts used to define sleep-wake intervals in available NSRR study protocols. These included specific time points, intervals between time points, and states within intervals. Review of study documentation and research publications helped to identify commonly used terms that could be mapped to specific concepts. This in turn enabled us to designate standardized terms that incorporate precise definitions of “time,” “period,” and “duration” that can be used to delineate distinct intervals, and to visualize their inter-relationships graphically (Figure 4). In the terminology developed based on these definitions, “time” refers to a specific point in time that is either recorded as a clock time or marked by when an event starts or ends. “Period” refers to a continuous interval between two specific time points defined a priori, while “duration” refers to the sum of the lengths of multiple intervals describing a specific state or condition, wherein the state can be further specified as sleep stages. Adoption of this standardized terminology allows for unambiguous demarcation of the intervals and sleep-wake states used to characterize the state-specificity of respiratory, cardiac, electroencephalographic, and movement-related events. Use of structured definitions also allowed inconsistencies in data calculations to be identified. For example, in one instance, applying standardized nomenclature identified that a summary respiratory index was calculated erroneously to include events in the interval between recording start and end times rather than only during the time asleep.

Figure 4:

Figure 4:

Sleep–wake terminology schema. To disambiguate notations that refer to different time points, sleep intervals, periods, and durations, the NSRR utilizes a visual schema that identifies (a) Time points (going to bed, falling asleep, waking up during sleep, waking up after sleep, getting out of bed), (b) intervals (recording, in-bed), and (c) states (awake, asleep). Terminologies based on these designations are organized in reference to (d) clock times (recording start time, lights-off/in-bed time, sleep onset, sleep offset, lights-on/out-bed time, recording end time), (e) periods (recording period, in-bed period, sleep period, sleep onset latency), and (f) durations (sleep duration = sleep period—wake after sleep onset (WASO), wake after sleep onset = sleep period—sum of sleep durations within the sleep period). Note that other iterations could further distinguish stages (N1, N2, NREM, and REM) within states.

Standardization of data represented within the polysomnogram, including channel labels.

By definition, the principal indices used to classify sleep-related physiological disturbances rely on the identification and quantification of events annotated from the polysomnography recordings. As described above, the variation in data collection, annotation, and scoring approaches introduce considerable heterogeneity. During its initial phase, NSRR’s computer scientists and biomedical engineers developed several signal processing tools, tailored to working with NSRR data. Tool development was informed by the needs of the local team as well as feedback from the User Community solicited during community outreach events. The NSRR team has further developed a robust signal processing pipeline for sleep data that can be applied both to existing and new NSRR datasets, as well as users’ own sleep studies uploaded to the cloud. Details of these tools will be reported in a subsequent publication, but include:

  • EDF Annotation Translator: this provides the framework for reading annotations stored in multiple file formats such as XML, CSV, and text files, and transforms them to a standard XML file format with Sleep Resource Ontology concepts for defining the events.

  • Altamira: an EDF Viewer allows the display of signals and standardized annotations

  • Luna ( http://zzz.bwh.harvard.edu/luna/): is a C/C++ toolset and R extension library for the manipulation and analysis of large numbers of EDFs, designed with both parallelization and working with NSRR annotation data in mind; it can also be deployed as a Docker container, to facilitate migration to the cloud computing environment. These tools support an NSRR analytical pipeline (NAP) that identifies primary signals and annotations; re-labels polysomnograms using canonical labels; and provides a standardized “NSRR” version of data that has been re-referenced and re-sampled to a common standard. These tools include a series of semi-automated checks on incoming data, and outputs a more technically uniform set of signal and annotation files. For example, we employ steps to (1) identify and potentially fix technical issues with EDFs, (2) flag noisy, flat or duplicate signals, (3) check EEG polarities, (4) check the consistency and alignment of stage annotations with the signal data, and potentially fix misaligned staging data, (5) automatically relabel channels and annotations, potentially re-referencing, resampling or rescaling signals as needed, and dropping redundant or undocumented channels, and (6) generate a battery of statistics summarizing sleep macro- and micro-architecture, with a focus on the EEG.

A challenge in analyzing sleep signal data relates to a lack of standards or requirements that could be used to indicate data are of sufficient quality for supporting specific, or a set of broad, applications. The NSRR team prioritizes data modifications aimed at enhancing usability—such as making physical units, sampling rates, file formats, or channel nomenclatures similar between and within studies. This approach deliberately avoids altering specific information content to achieve a particular minimum standard, recognizing that the appropriateness of such standards varies according to the specific research question and analytical methods employed. For example, standards that flag a given recording suitable for one analysis (e.g. examining spectral properties of the stable NREM EEG) may not apply to others (e.g. studying sleep onset or the relationship between sleep and circadian factors). Future work may include developing an array of diagnostic metrics and annotating these for their relative applicability for different purposes. However, ultimately, decisions related to data quality need to be made by the researchers who best understand their specific research questions.

Harmonization Steps

The process of data harmonization focuses on the specification of homogenized phenotypes that can be used to identify and characterize potentially comparable variables abstracted from different datasets, as exemplified by the work of the Trans-Omics for Precision Medicine (TOPMed) initiative [53]. While we were able to utilize resources provided by the TOPMed and BioData Catalyst projects to harmonize a range of non-sleep variables in NSRR datasets (including age, sex, race, ethnicity, smoking status, body mass index, and blood pressure), we recognized that the inherent complexity of device-based sleep data would make it difficult to develop integrated functions capable of accurately harmonizing sleep research phenotypes [54, 55]. Towards that end, we engineered a unique approach to the harmonization of polysomnography and polygraphy variables that leveraged the degree of specificity afforded by our compositional terminology. This approach progressed through the following iterative stages, as exemplified by harmonization efforts for sleep-disordered breathing variables:

  1. Specification of target phenotypes—Candidate phenotypes were reviewed to identify commonly used terms (e.g. the AHI), as supported by their citation in published guidelines and use in the research literature.

  2. Characterization of heterogeneity—Data generation and acquisition processes were reviewed to determine which study-level and variable-level metadata elements contributed most significantly to the heterogeneity between related but distinct target phenotypes. Potential sources of heterogeneity included sleep acquisition procedures, and for hypopnea and apnea terms, included (1) airflow reduction thresholds, (2) oxygen desaturation thresholds, and (3) the presence or absence of arousal(s).

  3. Refinement of target phenotypes—Practical considerations prompted us to limit the degree of granularity required to specify target phenotypes. For example, although we considered basing definitions on the four level AASM classification of sleep apnea monitoring devices, we categorized the sleep device types into those that include or do not include EEG data. We limited definitions of thresholds of flow reduction to levels that could be mapped to specific AASM guidelines. By doing so we were able to identify 13 permutations of sleep-disordered breathing events by combining study types, flow reduction thresholds, and event definitions at 3% and 4% oxygen desaturation thresholds that we were able to consolidate to generate 7 AHI phenotypes and 3 REI phenotypes.

  4. Mapping compositional tags to target phenotypes—We used our compositional terminology scheme to assign metadata tags to each phenotype to generate harmonized terms. Checks were conducted to confirm that each mapped compositional tag corresponded to a mutually exclusive AHI or REI phenotype (Supplementary Figure S2).

  5. Identification of candidate variables—Queries were conducted using harmonized terms to identify candidate variables in each dataset based on mapped compositional tags originally assigned during the ingestion and curation of each dataset added to the NSRR.

  6. Generation of harmonized variables—Each retrieved candidate variable was further evaluated to determine the degree to which it matched the specification of a harmonized term. When a candidate variable was deemed to be an appropriate match, a new version marked as a harmonized variable was added to the dataset with a link to the harmonized term that could be used to trigger a cross-cohort query to identify similarly harmonized variables present in other datasets.

We conduct this review on a regular basis in the course of processing and curating datasets added to the NSRR. External investigators may also follow this approach for determining if selected variables may be candidates for harmonization.

Examples of Data Harmonization in NSRR

The following examples show the results of harmonization efforts to describe variation in sleep metrics across age and gender. Overnight EEG data from a total 25 678 studies (14 618 male, 11 060 female), ages 2.5–90 years, were reprocessed using the Luna pipeline, harmonizing channel labels, polarity, removing artifact, and resampling at standard rates. Figure 5a and b shows the clear reduction in N3 sleep density and increase in sleep fragmentation index across age, and evident gender differences.

Figure 5.

Figure 5.

(a) and (b) Sleep architecture across the lifespan and by gender in NSRR. (a) Stage N3 density (minutes of N3 sleep divided by total sleep period time, SPT); (b) sleep fragmentation index. Data from 26 673 individuals selected from the NSRR with polysomnography data, aged 2.5–90 years (57% male), from 13 cohorts (APPLES, CCSHS, CFS, CHAT, MESA, MNC, MrOS, MSP, NCHSDB, SHHS, SOF, STAGES, and WSC). Blue: male. Red: female.

In contrast, Figure 6 shows the results of efforts to harmonize the AHI. Data mapping efforts allowed unambiguous assignment of specific definitions across key variations in AHI values, demonstrating that at any age, AHI values are considerably highest when the 1999 Chicago criteria are applied (including hypopneas with a 50% reduction in amplitude with a 3% desaturation or arousal), and lowest for the AASM 2015 definition (which requires a 30% amplitude reduction and 4% desaturation to accompany hypopneas). Unlike the approach to analysis of the EEG data where the raw data were directly reprocessed, the AHI metrics were based on events that were annotated by the data contributors. In many datasets, events were annotated with a restricted number of features (e.g. hypopneas only identified for a fixed desaturation), limiting the ability to generate alternative metrics that mapped to a single common definition, or to key definitions used over time. Future harmonization will benefit from directly reprocessing the raw respiratory signals and applying standardized automated algorithms to all datasets.

Figure 6.

Figure 6.

Variations of alternative mapped apnea hypopnea indices, by age, n = 18 287. Mean AHI values by age group, according to four mapped alternative definitions of the AHI, using data from 13 cohorts (ABC, APPLES, BestAIR, CFS, CHAT, HomePAP, MESA, MrOS, MSP, NuMom2b, SHHS, SOF, and WSC): nsrr_ahi_chicago1999: apnea-hypopnea index (all apneas + hypopneas with > 50% flow reduction or discernible flow reduction with >=3% desat or arousal) per hour of sleep. Harmonized by the NSRR team. The definition of hypopnea events is consistent with the following clinical guidelines: American Academy of Sleep Medicine (AASM) Chicago 1999 standard. nsrr_ahi_hp3r_aasm15: Apnea-Hypopnea Index (all apneas + hypopneas with >=30% nasal cannula [or alternative sensor] reduction and >= 3% oxygen desaturation or with arousal) per hour of sleep. Harmonized by the NSRR team. The definition for hypopneas is consistent with the following clinical guidelines: (1) American Academy of Sleep Medicine (AASM) 2007 Manual (2012 update) (recommended), and (2) American Academy of Sleep Medicine (AASM) 2015 (recommended). nsrr_ahi_hp4r: Apnea-Hypopnea Index (all apneas + hypopneas with >= 4% oxygen desaturation or with arousal) per hour of sleep. Harmonized by the NSRR team. nsrr_ahi_hp4u_aasm15: Apnea-Hypopnea Index (all apneas + hypopneas with >=30% nasal cannula [or alternative sensor] reduction with >= 4% oxygen desaturation) per hour of sleep. Harmonized by the NSRR team. The definition of hypopnea events is consistent with the following clinical guidelines: (1) AASM 2012 update (alternative) and (2) AASM 2015 (acceptable).

Needs and Future Directions

The impact of the data, tools, and outreach efforts of the NSRR on sleep and circadian research is evident by its support of thousands of researchers across a wide spectrum of backgrounds and from around the world, its contributions to hundreds of manuscripts, its role in the development of numerous novel sleep scoring algorithms and scientific discovery of novel sleep predictors, and its support of early as well as later stage investigators who have accessed NSRR data to generate preliminary data for grant applications or have used NSRR data as primary data for academic purposes. There are, however, several areas that can be enhanced:

  • Education—Informatics and data sharing policies are each rapidly evolving. Ongoing education and support of a range of stakeholders is needed to ensure there is understanding of the value and approaches for collecting, archiving, labeling, sharing, and analyzing the rich sleep and circadian data increasingly generated by sleep laboratories and research studies.

  • Data coverage—While the NSRR continues to ingest data and is expanding to include data from animal and circadian study designs, there are data gaps. Notably, the data within NSRR are limited to those elements that contributors are willing to share. Some data are not shared due to proprietary concerns by contributors; in other cases, data other than the sleep data had been made available to other repositories (e.g. BioLINCC) and are only accessible by securing permissions for those repositories to cross-link with the data within the NSRR. There are continued challenges in harmonizing sleep-related data that are collected without strong standardization. The harmonization work can be labor intensive (e.g. relying on vetting of potentially harmonizable variables to identify common terms or using compositional terminology components to generate accurate and matchable labels). Implementation of NIH’s Data Management and Sharing Policy that requires the scope and format of data sharing to be described as part of the grant submission process should both facilitate and incentivize the sharing of larger amounts of data. The NSRR is in the process of depositing data into NHBLI’s BioData Catalyst repository, which will make cross-linking with other data, including genetic data, in overlapping cohorts easier. Ultimately, the constraints on big data opportunities for the sleep and circadian field relate to availability of sleep and circadian data linked to the broad range of social determinants of health, clinical, outcome, and molecular data needed to drive transformative science. NIH and other sponsors need to periodically assess the availability of such data and invest in prospective data collection to fill critical gaps.

  • Lack of standards/burden of data harmonization and mapping procedures—This paper discussed multiple sources of heterogeneity inherent in current clinical and research data collection protocols. While the NSRR developed innovative approaches to reducing or addressing this heterogeneity, the sleep and circadian fields need to push for more rigorous up-front standardization of data collection and archival procedures, event and disorder definitions, and metadata, that in aggregate will simplify data sharing efforts and improve the quality of harmonized datasets. Adoption and dissemination of core sleep and circadian Common Data Elements will require collaboration of domain experts, informaticians, and clinicians in the development of standards, with ongoing work to ensure that such standards are updated and used appropriately. Adoption of standards needs to expand beyond event definitions and disease definitions to include standardization across multiple levels of data collection, including nomenclature, pre-processing steps, numerical standards and data, and metadata formats. Professional societies may recommend that research data utilize a set of standards that includes machine readable formats. Societies representing clinical sleep medicine can make sleep laboratory accreditation dependent on use of data standards to allow data to be readily queried as well as shared (after appropriate de-identification).

  • Open-source tools—reproducible research rests on a pillar of shared, documented, and robust computational tools. There needs to be clarity regarding how to balance issues related to intellectual property, commercialization, and scientific rigor. Requiring investigators to share code (or predictive models) will improve both the rigor and transparency of research. However, there is wide variability in how code is documented and updated. Often there is no ongoing support to ensure the developers can respond to user questions or identified bugs.

  • Emergence of “profit-based” or restricted data repositories—Finally, much of the rapid rise in web-based commercial entities (Google, Apple, etc.) is based on the commercial value of aggregating and leveraging individual-level data. Sleep data have attracted commercial interest due to the potential for those data to inform: (1) the targeted development of products aimed at a $40 billion “sleep-health/wellness” market; (2) development of commercial algorithms for improved quantification of sleep-related parameters and sleep-related devices; and (3) development of sleep-focused interventions. In addition to the ethical and privacy concerns related to the commercial use of individual-level data, these commercial interests may drive the development of restricted sleep data repositories that may compete with more generally accessible repositories and limit the community’s ability to engage in open discovery and competition. Non-commercial but restricted access to aggregated data also occurs when academic groups prevent data sharing to protect their own intellectual property, and similarly constrains the potential of “big” sleep data analyses as an open, community knowledge source.

Summary

While there are many challenges for sleep and circadian data sharing and harmonization, work by the NSRR suggests the utility of several novel approaches and demonstrates that heterogenous and valuable data can be readily shared to support a wide range of research and algorithmic development. Moreover, much of the work related to data harmonization can inform data sharing, metadata, and Common Data Element development in other domains. With further data sharing and standardization, the field will move closer to its vision of utilizing large datasets and powerful tools including machine learning to enhance scientific discovery and productivity, statistical power, rigor, and reproducibility, to ensure that the discoveries for sleep and circadian science are applicable to diverse populations. While community-oriented efforts such as those pioneered by the NSRR progress, there also will be a need to carefully consider the roles of restricted commercial and non-commercial efforts in complementing or competing with “open” data sharing efforts, including NIH’s roles in supporting these efforts, and the types of permissions and safeguards needed to ensure ethical, privacy and intellectual property needs are appropriately addressed.

Supplementary Material

zsae088_suppl_Supplementary_Figures_S1-S48_Tables_S1

Acknowledgments

This work was supported by National Heart Lung and Blood Institute grants NHLBI R24 HL114473 and contract 75N92019C00011 to SR.

Contributor Information

Ying Zhang, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Matthew Kim, Division of Endocrinology, Diabetes and Hypertension, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Michael Prerau, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Daniel Mobley, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Michael Rueschman, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Kathryn Sparks, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Meg Tully, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Shaun Purcell, Department of Psychiatry, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA.

Susan Redline, Division of Sleep Medicine and Circadian Disorders, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA.

Funding

Financial Disclosure: SR reports consulting fees from Jazz Pharmaceuticals, Eli Lilly, and ApniMed Inc unrelated to this manuscript, and grants from NIH that supported this work.

Non-financial Disclosure: none.

Data Availability

The data underlying this article is available in the NSRR, at sleepdata.org. The individual-level data is available through application at the NSRR.

References

  • 1. Margolis R, Derr L, Dunn M, et al. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J Am Med Inform Assoc. 2014;21(6):957–958. doi: 10.1136/amiajnl-2014-002974 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Paten B, Diekhans M, Druker BJ, et al. The NIH BD2K center for big data in translational genomics. J Am Med Inform Assoc. 2015;22(6):1143–1147. doi: 10.1093/jamia/ocv047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ahalt S, Avillach P, Boyles R, et al. Building a collaborative cloud platform to accelerate heart, lung, blood, and sleep research. J Am Med Inform Assoc. 2023;30(7):1293–1300. doi: 10.1093/jamia/ocad048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. National Institutes of Health. Final NIH Policy for Data Management and Sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html. Published October 29, 2020. Accessed January 31, 2024. [Google Scholar]
  • 5. Hale L, Troxel W, Buysse DJ.. Sleep health: an opportunity for public health to address health equity. Annu Rev Public Health. 2020;41:81–99. doi: 10.1146/annurev-publhealth-040119-094412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Elgart M, Goodman MO, Isasi C, et al.; Trans-Omics for Precision Medicine (TOPMed) Consortium. Correlations between complex human phenotypes vary by genetic background, gender, and environment. Cell Rep Med. 2022;3(12):100844. doi: 10.1016/j.xcrm.2022.100844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Prerau MJ, Brown RE, Bianchi MT, Ellenbogen JM, Purdon PL.. Sleep neurophysiological dynamics through the lens of multitaper spectral analysis. Physiology (Bethesda). 2017;32(1):60–92. doi: 10.1152/physiol.00062.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Chen S, Redline S, Eden UT, Prerau MJ.. Dynamic models of obstructive sleep apnea provide robust prediction of respiratory event timing and a statistical framework for phenotype exploration. Sleep. 2022;45. doi: 10.1093/sleep/zsac189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Mehra R, Tjurmina OA, Ajijola OA, et al. Research opportunities in autonomic neural mechanisms of cardiopulmonary regulation: a report from the national heart, lung, and blood institute and the national institutes of health office of the director workshop. JACC Basic Transl Sci. 2022;7(3):265–293. doi: 10.1016/j.jacbts.2021.11.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. May AM, May RD, Bena J, et al.; Osteoporotic Fractures in Men (MrOS) Study Group. Individual periodic limb movements with arousal are temporally associated with nonsustained ventricular tachycardia: a case-crossover analysis. Sleep. 2019;42(11). doi: 10.1093/sleep/zsz165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Bandyopadhyay A, Goldstein C.. Clinical applications of artificial intelligence in sleep medicine: a sleep clinician’s perspective. Sleep Breath. 2023;27(1):39–55. doi: 10.1007/s11325-022-02592-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mazzotti DR. Landscape of biomedical informatics standards and terminologies for clinical sleep medicine research: A systematic review. Sleep Med Rev. 2021;60:101529. doi: 10.1016/j.smrv.2021.101529 [DOI] [PubMed] [Google Scholar]
  • 13. Kozhemiako N, Mylonas D, Pan JQ, Prerau MJ, Redline S, Purcell SM.. Sources of variation in the spectral slope of the sleep EEG. eNeuro. 2022;9(5):ENEURO.0094–ENEU22.2022. doi: 10.1523/ENEURO.0094-22.2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Grandner MA, Fernandez F-X.. The translational neuroscience of sleep: a contextual framework. Science. 2021;374(6567):568–573. doi: 10.1126/science.abj8188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zhang G-Q, Cui L, Mueller R, et al. The national sleep research resource: towards a sleep data commons. J Am Med Inform Assoc. 2018;25(10):1351–1358. doi: 10.1093/jamia/ocy064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. National Sleep Research Resource—NSRR. https://sleepdata.org/. Accessed January 30, 2023. [Google Scholar]
  • 17. Zotero. List of publications cited the NSRR as a principal resource. https://www.zotero.org/groups/4698159/nsrr_pub/items/WGFGKV3D/library. Accessed January 31, 2024. [Google Scholar]
  • 18. Young T, Palta M, Dempsey J, Skatrud J, Weber S, Badr S.. The occurrence of sleep-disordered breathing among middle-aged adults. N Engl J Med. 1993;328(17):1230–1235. doi: 10.1056/NEJM199304293281704 [DOI] [PubMed] [Google Scholar]
  • 19. Quan SF, Howard BV, Iber C, et al. The sleep heart health study: design, rationale, and methods. Sleep. 1997;20(12):1077–1085. doi: 10.1093/sleep/20.12.1077 [DOI] [PubMed] [Google Scholar]
  • 20. Foley DJ, Monjan AA, Masaki KH, Enright PL, Quan SF, White LR.. Associations of symptoms of sleep apnea with cardiovascular disease, cognitive impairment, and mortality among older Japanese-American men. J Am Geriatr Soc. 1999;47(5):524–528. doi: 10.1111/j.1532-5415.1999.tb02564.x [DOI] [PubMed] [Google Scholar]
  • 21. Rosen CL, Palermo TM, Larkin EK, Redline S.. Health-related quality of life and sleep-disordered breathing in children. Sleep. 2002;25(6):657–666. [PubMed] [Google Scholar]
  • 22. Tishler PV, Larkin EK, Schluchter MD, Redline S.. Incidence of sleep-disordered breathing in an urban adult population: the relative importance of risk factors in the development of sleep-disordered breathing. JAMA. 2003;289(17):2230–2237. doi: 10.1001/jama.289.17.2230 [DOI] [PubMed] [Google Scholar]
  • 23. Kezirian EJ, Harrison SL, Ancoli-Israel S, et al.; Study of Osteoporotic Fractures Research Group. Behavioral correlates of sleep-disordered breathing in older women. Sleep. 2007;30(9):1181–1188. doi: 10.1093/sleep/30.9.1181 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Rosen CL, Auckley D, Benca R, et al. A multisite randomized trial of portable sleep studies and positive airway pressure autotitration versus laboratory-based polysomnography for the diagnosis and treatment of obstructive sleep apnea: the HomePAP study. Sleep. 2012;35(6):757–767. doi: 10.5665/sleep.1870 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Marcus CL, Moore RH, Rosen CL, et al.; Childhood Adenotonsillectomy Trial (CHAT). A randomized trial of adenotonsillectomy for childhood sleep apnea. N Engl J Med. 2013;368(25):2366–2376. doi: 10.1056/NEJMoa1215881 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Redline S, Sotres-Alvarez D, Loredo J, et al. Sleep-disordered breathing in Hispanic/Latino individuals of diverse backgrounds. The Hispanic Community Health Study/Study of Latinos. Am J Respir Crit Care Med. 2014;189(3):335–344. doi: 10.1164/rccm.201309-1735OC [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Gottlieb DJ, Punjabi NM, Mehra R, et al. CPAP versus oxygen in obstructive sleep apnea. N Engl J Med. 2014;370(24):2276–2285. doi: 10.1056/NEJMoa1306766 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Song Y, Blackwell T, Yaffe K, Ancoli-Israel S, Redline S, Stone KL; Osteoporotic Fractures in Men (MrOS) Study Group. Relationships between sleep stages and changes in cognitive function in older men: the MrOS Sleep Study. Sleep. 2015;38(3):411–421. doi: 10.5665/sleep.4500 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Chen X, Wang R, Zee P, et al. Racial/ethnic differences in sleep disturbances: the Multi-Ethnic Study of Atherosclerosis (MESA). Sleep. 2015;38(6):877–888. doi: 10.5665/sleep.4732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Facco FL, Parker CB, Reddy UM, et al. NuMoM2b sleep-disordered breathing study: objectives and methods. Am J Obstet Gynecol. 2015;212(4):542.e1–542127. doi: 10.1016/j.ajog.2015.01.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Zhao YY, Wang R, Gleason KJ, et al.; on behalf of the BestAIR Investigators. Effect of continuous positive airway pressure treatment on health-related quality of life and sleepiness in high cardiovascular risk individuals with sleep apnea: best apnea interventions for Research (BestAIR) trial. Sleep. 2017;40(4). doi: 10.1093/sleep/zsx040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Bakker JP, Tavakkoli A, Rueschman M, et al. Gastric banding surgery versus continuous positive airway pressure for obstructive sleep apnea: a randomized controlled trial. Am J Respir Crit Care Med. 2018;197(8):1080–1083. doi: 10.1164/rccm.201708-1637LE [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Stephansen JB, Olesen AN, Olsen M, et al. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat Commun. 2018;9(1):5229. doi: 10.1038/s41467-018-07229-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Mazzotti DR, Haendel MA, McMurry JA, et al. Sleep and circadian informatics data harmonization: a workshop report from the Sleep Research Society and Sleep Research Network. Sleep. 2022;45(6). doi: 10.1093/sleep/zsac002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Caples SM, Anderson WM, Calero K, Howell M, Hashmi SD.. Use of polysomnography and home sleep apnea tests for the longitudinal management of obstructive sleep apnea in adults: an American Academy of Sleep Medicine clinical guidance statement. J Clin Sleep Med. 2021;17(6):1287–1293. doi: 10.5664/jcsm.9240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Berry RB, Budhiraja R, Gottlieb DJ, et al.; American Academy of Sleep Medicine. Rules for scoring respiratory events in sleep: update of the 2007 AASM Manual for the scoring of sleep and associated events. deliberations of the sleep apnea definitions task force of the American academy of sleep medicine. J Clin Sleep Med. 2012;8(5):597–619. doi: 10.5664/jcsm.2172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Ho V, Crainiceanu CM, Punjabi NM, Redline S, Gottlieb DJ.. Calibration model for apnea-hypopnea indices: impact of alternative criteria for hypopneas. Sleep. 2015;38(12):1887–1892. doi: 10.5665/sleep.5234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Malhotra A, Ayappa I, Ayas N, et al. Metrics of sleep apnea severity: beyond the apnea-hypopnea index. Sleep. 2021;44(7). doi: 10.1093/sleep/zsab030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Pevernagie DA, Gnidovec-Strazisar B, Grote L, et al. On the rise and fall of the apnea-hypopnea index: A historical review and critical appraisal. J Sleep Res. 2020;29(4):e13066. doi: 10.1111/jsr.13066 [DOI] [PubMed] [Google Scholar]
  • 41. Guilleminault C, Stoohs R, Clerk A, Cetel M, Maistros P.. A cause of excessive daytime sleepiness. The upper airway resistance syndrome. Chest. 1993;104(3):781–787. doi: 10.1378/chest.104.3.781 [DOI] [PubMed] [Google Scholar]
  • 42. Young T, Peppard P, Palta M, et al. Population-based study of sleep-disordered breathing as a risk factor for hypertension. Arch Intern Med. 1997;157(15):1746–1752. doi: 10.1001/archinte.1997.00440360178019 [DOI] [PubMed] [Google Scholar]
  • 43. AASM. Sleep-related breathing disorders in adults: recommendations for syndrome definition and measurement techniques in clinical research. The Report of an American Academy of Sleep Medicine Task Force. Sleep. 1999;22(5):667–689. [PubMed] [Google Scholar]
  • 44. American Academy of Sleep Medicine, ed. The International Classification of Sleep Disorders: Diagnostic and Coding Manual. 2nd ed. American Academy of Sleep Medicine; 2005. [Google Scholar]
  • 45. American Academy of Sleep Medicine, ed. International Classification of Sleep Disorders. 3rd ed. American Academy of Sleep Medicine; 2014. [Google Scholar]
  • 46. Guilleminault C, Tilkian A, Dement WC.. The sleep apnea syndromes. Annu Rev Med. 1976;27:465–484. doi: 10.1146/annurev.me.27.020176.002341 [DOI] [PubMed] [Google Scholar]
  • 47. Kemp B, Värri A, Rosa AC, Nielsen KD, Gade J.. A simple format for exchange of digitized polygraphic recordings. Electroencephalogr Clin Neurophysiol. 1992;82(5):391–393. doi: 10.1016/0013-4694(92)90009-7 [DOI] [PubMed] [Google Scholar]
  • 48. Kemp B, Olivan J.. European data format “plus” (EDF+), an EDF alike standard format for the exchange of physiological data. Clin Neurophysiol. 2003;114(9):1755–1761. doi: 10.1016/s1388-2457(03)00123-8 [DOI] [PubMed] [Google Scholar]
  • 49. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational studies. Int J Surg. 2014;12(12):1495–1499. doi: 10.1016/j.ijsu.2014.07.013 [DOI] [PubMed] [Google Scholar]
  • 50. Cui L, Zeng N, Kim M, et al. X-search: an open access interface for cross-cohort exploration of the National Sleep Research Resource. BMC Med Inform Decis Mak. 2018;18(1):99. doi: 10.1186/s12911-018-0682-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Tran V-A, Johnson N, Redline S, Zhang G-Q.. OnWARD: ontology-driven web-based framework for multi-center clinical studies. J Biomed Inform. 2011;44(Suppl 1):S48–S53. doi: 10.1016/j.jbi.2011.08.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Bhattacharyya SB. Introduction to SNOMED CT. 1st ed. 2016. Singapore: Springer; 2015:250. [Google Scholar]
  • 53. Stilp AM, Emery LS, Broome JG, et al. A system for phenotype harmonization in the NHLBI Trans-Omics for Precision Medicine (TOPMed) program. Am J Epidemiol. 2021;190:1977–1992. doi: 10.1093/aje/kwab115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. TOPMed. NHLBI Trans-Omics for Precision Medicine. https://topmed.nhlbi.nih.gov/. Accessed January 30, 2023. [Google Scholar]
  • 55. BDC. NHLBI BioData Catalyst. https://biodatacatalyst.nhlbi.nih.gov/. Accessed January 30, 2023. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

zsae088_suppl_Supplementary_Figures_S1-S48_Tables_S1

Data Availability Statement

The data underlying this article is available in the NSRR, at sleepdata.org. The individual-level data is available through application at the NSRR.

RESOURCES