Skip to main content
Sleep Advances: A Journal of the Sleep Research Society logoLink to Sleep Advances: A Journal of the Sleep Research Society
. 2026 Feb 13;7(1):zpag014. doi: 10.1093/sleepadvances/zpag014

Development of a rule-based natural language processing algorithm to extract sleep information in pediatric primary care patients with a sleep diagnosis

Joseph W Sirrianni 1, Ariana Calloway 2, Syed-Amad Hussain 3,4, Hongfang Liu 5, Christopher W Bartlett 6,7, Mattina A Davenport 8,9,
PMCID: PMC12920604  PMID: 41725982

Abstract

Study Objectives

The current study employed natural language processing (NLP) to capture multidimensional and transdiagnostic information in pediatric clinical notes. We present a novel, low-resource sleep vocabulary that can be applied to notes to identify pediatric sleep-related mentions automatically.

Methods

Using a combination of existing medical sleep ontologies, interviews with clinicians, and examination of clinical note narratives, we develop a novel vocabulary of pediatric sleep-related terms and phrases that covers both technical terms, abbreviations, and colloquial keywords used in describing pediatric sleep health. We compare our vocabulary against a set of manually annotated clinical notes to determine the effectiveness of our vocabulary for identifying notes with pediatric sleep-related mentions.

Results

Our vocabulary was able to correctly identify clinical notes with pediatric sleep-related mentions with a recall of 0.992 and a precision of 0.852. Most false positives occurred in notes that either explicitly stated no sleep issues or contained text unrelated to patient sleep health (e.g. medication side effects). Among the text spans annotated as sleep-related mentions, 77.1% include at least one keyword from our vocabulary.

Conclusions

Our vocabulary showed excellent performance for identifying pediatric sleep-related mentions at the clinical note level and decent performance for identifying the specific text containing patient mentions. Our low-resource vocabulary, which can be deployed in almost any compute environment, can serve as an identifying first pass over clinical notes to identify which notes or note sections should be further processed by more advanced models or manual annotation review to identify more narrow mentions.

Keywords: pediatrics, artificial intelligence, natural language processing, children, adolescents


Statement of Significance.

Extracting multidimensional and transdiagnostic sleep-related information in clinical notes is an essential next step to improve pediatric learning health systems’ cohort identification and harmonization. Without this step, efforts toward automated surveillance of subthreshold symptoms, monitoring sleep disparities in detection and care among pediatric populations, and developing clinical decision support and treatment platforms are limited. Natural language processing (NLP) has emerged as a tool to capture sleep condition information (e.g. insomnia) among adults. Yet, implementation of NLP in pediatrics is emerging. However, our preliminary work shows there are still some challenges with noise and identification before these tools can embed a rule-based approach in the NLP pipeline.

Introduction

It is established that sleep is a critical factor in youth development; however, many youth across the United States report getting inadequate sleep [1, 2]. Although pediatric primary care (PPC) providers gather sleep-related information during clinical encounters, their time-limited context strongly relies on explicit patient/parent complaints and/or specialty sleep care services to confirm clinical pathways for sleep problems and disorders [3]. The current reliance on patient/parent report and time-intensive evaluations (e.g. polysomnography) results in many patients with subthreshold sleep symptoms (i.e. insufficient sleep durations) being missed or under-detected in PPC [4, 5].

An automated surveillance system could alleviate this issue. By identifying sleep-related mentions contained in clinical notes from patient encounters, a system could be developed to recognize patients who would otherwise be overlooked in the siloed and fragmented systemwide pediatric sleep care continuum (e.g. prevention to specialty services) [6]. This could lead to both improved care for individual patients and an enhanced ability to identify specific sleep cohorts for further study. The first step in developing such a system would be to create a component that could automatically identify and collect pediatric sleep-related mentions in their notes for further examination [7–9].

However, identifying pediatric sleep-related mentions in clinical notes is a complex task for two reasons [10, 11]. First, based upon an initial exploration, pediatric sleep-related mentions can occur in a variety of note types across several siloed departments [5]. In addition, sleep screening protocols vary in a learning health system [3, 4]. Therefore, pediatric sleep-related mentions can appear in almost any clinical context. Second, mentions can be expressed in notes using formal clinical terminology (e.g. “obstructive sleep apnea”), informal clinical terminologies like acronyms (e.g. “osa”), and layperson terminology directly quoting a patient (e.g. “trouble breathing while asleep”) [5, 6]. Therefore, the ideal algorithm to identify mentions needs to (1) address the variability of language in mentions and (2) be able to process a large quantity of documentation by note and department type. Deep learning-based natural language processing (NLP) methods have shown strong performance in extracting clinical information from electronic health records, but they remain difficult to scale in low-resources settings [12, 13]. These approaches typically require large annotated datasets, substantial computational resources, and specialized hardware, all of which may be unavailable to under-resourced health systems [7, 10, 11]. Even if the resources can be procured through external vendors, applying such models to large volumes of clinical text can be expensive and time intensive. Consequently, low-resource methods are required for large-scale screening and initial data extraction tasks.

Alternatively, rule-based approaches are computationally efficient and can be applied over millions of notes without requiring specialized hardware or model fine-tuning [5, 6]. Their transparency and ease of use makes them more applicable and adaptable across institutional contexts. As such, many expert crafted rule-based NLP systems remain widely used [14–16]. Prior work has used rule-based systems to identify adult sleep-related information in clinical notes. For instance, a rule-based text mining algorithm was developed for identifying adult sleep-related information in primary care notes [17]. Past work used a combination of ten sleep-related keywords and structured data rules to identify patients with insomnia [18]. However, to our knowledge, there’s only one publicly available vocabulary to identify adult sleep-related mentions across multiple concepts (e.g. snoring, sleep quality, and daytime sleepiness) [18].

In this exploratory analysis, we propose a rule-based approach to identify clinical notes containing pediatric sleep-related mentions. Our keyword bank incorporates terms across 30 pediatric sleep concepts (e.g. nocturnal enuresis and bedtime struggles). These terms were derived from the Peds B-SATED framework, clinical ontologies, mined from our hospital clinical notes, and qualitative feedback from providers [19, 20]. We apply our keyword bank to a set of annotated clinical notes to identify which notes contain pediatric sleep-related mentions and compare our results with an adult-focused keyword bank [21].

Methods

We manually annotated 300 well-child visit notes. These were randomly sampled from a cohort of patients ages 2–18 years with at least one sleep diagnosis who received PPC between January 2018 and December 2023. This cohort also received care at least one of the following frontline departments: school-based health, behavioral health, and/or healthy weight clinic. Within our institution’s PPC department, PPC providers are mandated to screen for sleep during well-child visits. In addition, we defined a sleep diagnosed patient as a patient with at least one of the ICD-10 diagnoses listed in Supplementary Table S1. This cohort was selected to ensure our dataset would have a sizable number of pediatric sleep-related mentions, since subthreshold sleep symptoms may be documented inconsistently across the institution’s sleep care continuum. Patient demographic information is reported in Table 1.

Table 1.

Demographic information of patients in annotation cohort

Category n %
Patients 297 100
Race
Hispanic/Latino 39 13
Non-Hispanic Black 95 32
Non-Hispanic White 128 43
Non-Hispanic Multiracial 23 8
Non-Hispanic Other 12 4
Insurance
Private only 48 16
Public only 196 66
Public and private 51 17
Other 2 1
Notes 300 100
Department
Behavioral health 106 35
Primary care 174 58
Healthy weight 14 5
School based 6 2
Age at encounter (years) [mean (std)] 10.44 (4.50)

We developed an annotated dataset of clinical notes that contained various types of pediatric sleep mentions. Two annotators annotated the 300 notes for six dimensions of sleep and three related clinical concepts, for a total of nine mentions of sleep health. The dimensions were: (1) Sleep Behavior Dimension, (2) Sleep Satisfaction Dimension, (3) Alertness and Daytime Sleepiness Dimension, (4) Sleep Timing Dimension, (5) Sleep Efficiency Dimension, (6) Sleep Duration Dimension, (7) Sleep Medication Mentions, (8) Sleep Disorder Mentions, and (9) Sleep Intervention Mentions. The annotations were performed at the text span level, meaning that specific words and sentences were annotated. Annotator disagreements were resolved by consensus between annotators. Their overall agreement across all dimensions was 0.6, using Cohen’s kappa, which indicates moderate agreement [22]. Each individual mention class varied between 0.5 and 0.74 agreement (see Supplementary Table S2 for a full breakdown).

For each note, we assigned an overall label as positive if the note contained at least one pediatric sleep mention or negative otherwise. This work was approved by the NCH IRB. We developed a novel keyword bank, called Davenport and Sirrianni Expanded (DSE), which builds upon our prior vocabulary by leveraging four different sources [5]. First, we used the Peds B-SATED framework and past pediatric sleep literature as the foundation for determining our initial pediatric sleep-related terminology. Second, we consulted several clinical ontologies, including the Medical Subject Headings (MeSH) thesaurus, UMLS, SNOMED-CT, and LOINC, for technical terms related to sleep health. Third, we consulted with clinicians at our hospital about terms they typically use in their documentation, along with any common abbreviations. Lastly, we utilized our in-house clinical note search engine, named DeepSuggest8, to discover similar terms to those identified from the ontologies and clinicians based on word-embedding similarities derived from our actual historic clinical notes [5].

During the keyword development process, we assigned the keywords to 37 distinct high-level categories. Keyword bank concepts would often appear in clinical note text outside of the intended context (e.g. pediatric sleep). For example, seven keywords (e.g. wheezing) would often show up for patients with asthma outside the context window of pediatric sleep. We created a second tier category in the DSE vocabulary that included the seven aforementioned concepts. Thus, our vocabulary had two tiers, tier 1 concepts which were used in our note identification rule-set and tier 2 concepts that were not used for identification due to their ambiguity. The DSE keyword bank associated high-level categories and the category tier assignments are shown in Table 2. We compared two keyword banks in our analysis, Sivarajkumar et al. (SEA) keyword bank (27 words) and our DSE keyword bank (359 words) [5, 6]. We evaluated the predictions from two keyword prediction models, one using the SEA keyword bank and one using our DSE keyword bank. We compared these predictions to our ground truth annotations in Table 3.

Table 2.

Davenport and Sirrianni Expanded (DSE) keyword bank

High-level category Keywords Regular expressions
Tier 1 concepts
Sleep “sleep,” “sleeping,” “sleeps,” “slept”
Insomnia “insomnia,” “difficulty falling or staying asleep,” “trouble falling or staying asleep”
Restless leg syndrome “restless leg,” “leg jerks during sleep”
Periodic limb movement disorder “periodic limb movement,” “periodic limb movements,” “leg movement during sleep,” “leg movements during sleep,” “arm movement during sleep,” “arm movements during sleep,” “limb movement,” “limb movements”
Obstructive sleep apnea “obstructive sleep apnea,” “sleep apnea,” “sleep disordered breathing,” “apnea,” “osa,” “stops breathing at night,” “gasps at night,” “gasping at night,” “short of breath at night,” “unusual breathing patterns at night,” “sdb,” “breathing pauses”
Nocturnal enuresis “nocturnal enuresis,” “bedwetting,” “nighttime urinary incontinence”
Narcolepsy “narcolepsy,” “cataplexy,” “paroxysmal sleep,” “narcoleptic,” “gelineau’s syndrome”
Hypersomnia “hypersomnia,” “sleeps too much,” “slept too much,” “excessive sleep,” “excessive sleeping,” “hypersomnolence,” “long sleep,” “sleeps a lot,” “slept a lot,” “sleeping a lot,” “sleeps more,” “sleeping more,” “oversleep,” “oversleeps,” “oversleeping” “up to \d + hours,” “more than \d + hours”
Parasomnia “parasomnia,” “sleep paralysis,” “night terrors,” “confusional arousals,” “sleep terror,” “sleep terrors”
Sleep-related movement disorder “sleepwalking,” “sleepwalk,” “sleepwalks,” “sleepwalked,” “acting out dreams,” “acting out dream,” “sleep arousal disorder,” “sleep wake transition disorder,” “sleep talking,” “sleep talk,” “sleep talks,” “sleep head banging,” “sleep related movement disorder”
Bruxism “bruxism,” “childhood sleep bruxism,” “nocturnal bruxism,” “sleep bruxism,” “nocturnal teeth grinding disorder,” “teeth grinding at night,” “teeth grinding while sleep,” “teeth grinding,” “grinds teeth,” “grinding teeth”
Circadian rhythm disorder “circadian rhythm,” “circadian rhythm disorder,” “delayed sleep phase syndrome,” “delayed sleep,” “sleeps late,” “sleeping late,” “slept late,” “sleeps early,” “sleeping early,” “slept early,” “delayed bedtime,” “bedtime delayed,” “delayed sleep phase,” “advanced sleep phase,” “sleep wake schedule disorder,” “shift worker sleep disorder,” “nonorganic sleep wake cycle disorder,” “non 24 hour sleep wake disorder”
Sleep locations “sleeps on couch,” “sleeps on a couch,” “sleeping on a couch,” “sleep in bed,” “sleeps in bed,” “sleeping in a bed,” “sleeps in bedroom,” “sleeping in a bedroom,” “sleeps on bus,” “sleeping on the bus,” “sleeps in car,” “sleeping in the car,” “sleeps in class,” “sleeping in class,” “naps on bus,” “naps on the bus,” “naps at school,” “naps in class,” “naps in car”
Sleepiness “sleepiness,” “sleepy,” “sleepier,” “drowsy,” “drowsier,” “drowsiness,” “somnolence,” “excessive sleepiness during the day,” “sleeps during the day,” “doze,” “dozes,” “dozing,” “dozed,” “drowsiness,” “falls asleep in class,” “staying awake”
Fatigue “tired,” “fatigue,” “fatigued,” “low energy,” “low-energy,” “no energy”
Sleep schedule “bedtime,” “waketime,” “bedtime routine,” “nighttime routine,” “morning routine,” “inconsistent bedtime,” “sleep starts,” “goes to bed,” “wake,” “wakes,” “waking,” “wakes up at,” “school starts at,” “sleeps in,” “gets on bus at,” “bedtime schedule,” “overslept”
Falling asleep “falling asleep,” “sleep latency,” “difficulty getting to sleep,” “sleep onset latency,” “difficulty falling asleep,” “trouble falling asleep,” “up at night,” “staying up,” “stays up”
Difficulty waking “difficulty waking,” “inability to wake,” “wakefulness”
Awakenings “early morning waking,” “early morning awakening,” “early waking,” “wakes early,” “wakes up early,” “waking up early,” “difficulty staying asleep,” “trouble staying asleep,” “awakening,” “awakenings,” “nighttime awakening,” “nighttime awakenings,” “broken sleep,” “waking up,” “wakes up,” “night wake,” “often awake,” “awakening early,” “up during night,” “waking up in the middle”
Sleep duration “sleep deprivation,” “insufficient sleep,” “inadequate sleep,” “sleep insufficiency,” “sleep insufficiencies,” “sleep debt,” “sleep duration,” “short sleep,” “short sleeping,”  "lack of adequate sleep,” “not getting enough sleep,” “sleep deficit,” “getting enough sleep,” “sleep quantity,” “total sleep time,” “sleepless,” “sleeplessness,” “no sleep,” “inability to sleep,” “unable to sleep” “less than \d + hours,” “sleep \d + hours”
Sleep quality “poor sleep,” “poor sleep pattern,” “sleep quality,” “sleep disorder,” “sleep problem,” “sleeping problems,” “trouble sleep,” “trouble sleeping,” “problem sleeping,” “sleep issue,” “sleep issues,” “sleep difficulties,” “sleep difficulty,” “difficulty sleeping,” “difficulties sleep,” “difficulties sleeping,” “problems with sleeping,” “impaired sleep,” “fair sleep quality,” “bad sleep quality,” “tosses and turns in sleep”
Restless sleep “restless sleep,” “sleep restless,” “restless sleeping”
Snoring “snore,” “snores,” “snoring,” “snoring symptoms”
Use of medication or supplements to aid sleep “sleep aid,” “sleeping aids,” “melatonin,” “taking for sleep,” “sleeping pills,” “sleep supplements,” “hypnotics,” “bendryl,” “tylenol pm,” “nyquil,” “chamomile tea,” “lavender,” “valerian,” “atarax,” “tenex,” “clonidine,” “diazepam,” “clonazepam,” “chloral hydrate,” “ambien,” “sonata,” “tricyclics,” “ssris,” “trazadone,” “remeron,” “phenobarbital,” “risperdal,” “topamax”
Sleep disturbances “waking during night,” “wakes up at night,” “waking up a night,” “sleep disturbance,” “sleep disturbances,” “disturbance in sleep,” “disturbances in sleep,” “disturbed sleep,” “interrupted sleep,” “interrupting sleep,” “sleep disturbed,” “sleep pattern disturbance,” “sleep disturbance,” “sleep disturbances,” “disturbance in sleep,” “can’t sleep,” “can not sleep,” “can not sleep at all,” “can’t sleep at all,” “sleep fragmentation,” “fragmented sleep,” “trouble sleeping,” “nocturnal agitation” “\w + keeping \w + up”
Napping “nap,” “naps,” “napping”
Sleep hygiene “sleep hygiene,” “sleep habit,” “sleep habits,” “sleeping habit,” “sleeping habits,” “excessive screentime,” “uses screens at night,” “using screens at night,”  "eats late,” “eating late,” “uses electronics at night,” “using electronics at night,” “conflict at bedtime,” “plays at night,” “caffeine” “doing \w + at night”
Dreams “nightmare,” “nightmares,” “dream,” “dreams,” “dreaming,” “bad dreams,” “bad dream,” “vivid dreams,” “vivid dreaming,” “vivid dream”
Bedtime struggles “bedtime struggle,” “bedtime struggles,” “bedtime resistance,” “bedtime battle”
Surgery “tonsillectomy,” “adenoidectomy”
Tier 2 concepts
Wheezing “wheeze,” “wheezing,” “wheezes”
Hyperactive “hyper,” “hyperactive,” “hyperactive behavior,” “hyperactive behaviors”
Dizziness “dizziness”
Daytime mood “irritable,” “grouchy,” “irritability,” “frustrated,” “agitated”
Tense “tense sleep,” “tense sleeping,” “trouble winding down at night”
Inattention “inattentive,” “troubling focusing,” “poor focus,” “difficulty focusing,” “alertness,” “staying alert”
Nighttime anxiety “worries at night,” “worry at night,” “worrying at night,” “anxious at night,” “overthinks at night,” “thoughts at night,” “thinking at night,” “thoughts at bedtime,” “thinking at bedtime,” “anxiety at bedtime”

Table 3.

Confusion matrix for the DSE and SEA vocabularies and their precision, recall, and F1-scores

SAE vocabulary
Predicted positive Predicted negative Total
True positive 209 35 244
True negative 35 21 56
Total 244 56 300
DSE vocabulary
Predicted positive Predicted negative Total
True positive 242 2 244
True negative 42 14 56
Total 284 16 300
Vocabulary Precision Recall F1-score
SAE 0.857 0.857 0.857
DSE 0.852 0.992 0.917

The software was written in Python and ran in a Linux environment. The code utilized the FlashText library, which implements a variation on the Aho-Corasick algorithm, for string searching and the re package for regular expressions [23]. The DSE vocabulary and code is available at https://github.com/jsirrianni-NCH/rule-based-nlp-sleep-info-pediatric. For our results, we report precision (i.e. the proportion of true positives identified by the keywords out of all instances containing a keyword), recall (i.e. the proportion of true positives identified by the keywords out of all the true positives in the dataset), and F1-score (i.e. the harmonic mean of precision and recall).

Results

Overall, there were 244 clinical notes containing pediatric sleep-related mentions and 56 notes containing no mentions. The DSE model had a total of two false negatives and 42 false positives, while SEA had 35 false negatives and 35 false positives. The DSE model had a much higher recall (0.992 vs 0.857) while having a comparable precision (0.852 vs 0.857). This difference resulted in a 0.06 difference in F1-score, driven by the increased recall. Table 4 shows the total percentage of tagged pediatric sleep-related category spans in the notes containing at least one keyword from DSE. Across all tags, 77.2% contained at least one keyword from DSE. For each individual tag category, the keyword occurrence ranged from 88.7% (sleep satisfaction) to 54.2% (sleep behaviors). “Sleep” was the most commonly occurring keyword, appearing in 80.7% of all notes with a mention; however, it was also the most common keyword occurring in the false positives, appearing in 71.4% of all false positives. The other keywords had much lower false positive rates but occurred in fewer of the notes. A breakdown of the top 10 appearing keywords across the dataset and in false positives are in Supplementary Tables S3 and S4.

Table 4.

Tagged spans containing a DSE keyword (tier 1) by dimension type

Tag name Total tags Tags containing keyword Accuracy (%) Most present keyword (Total tags)
ALL TAGS 1156 892 77.2 Melatonin (101)
Alertness/Day time sleepiness 143 102 71.3 Fatigue (41)
Sleep satisfaction 124 110 88.7 Sleep (38)
Sleep medications 199 162 81.4 Melatonin (101)
Sleep disorder 204 169 82.8 Insomnia (44)
Sleep intervention 99 83 83.8 Sleep (53)
Sleep timing 87 50 57.5 Bedtime (13)
Sleep behavior 96 52 54.2 Sleep (14)
Sleep efficiency 164 131 79.9 Sleep disturbance (37)
Sleep duration 40 33 82.5 Sleep (18)

Discussion

Rule-based NLP models are easily deployable but lack the contextual understanding needed to determine whether a keyword is used in a clinically valid way [10, 11, 19]. This limitation is evident with terms like “sleep,” which appear in both true positives and false positives due to partial negation (e.g. “patient doesn’t snore often”), speculation (e.g. “patient thinks they might snore”), non-patient references (e.g. “snoring runs in the family”), and other contextual variability, making it difficult to achieve both high recall and high precision. Although large language models can better resolve contextual validity, they are computationally expensive and slow [12, 13]. Therefore, we developed an NLP pipeline that uses a rule-based model as an initial low-resource filter to identify clinical notes likely to contain pediatric sleep-related mentions.

Several limitations of this study should be noted. First, our dataset was derived from a single pediatric institution, which may bias the vocabulary toward the specific health system level characteristics, documentation styles, templates, and jargon used by providers at this site. Second, by sampling patients with an existing sleep diagnosis, our cohort likely has a higher prevalence of pediatric sleep-related mentions than a general primary care population. Consequently, the performance metrics reported here may not fully generalize to undiagnosed populations or other screening contexts (e.g. dental and specialty clinics) where sleep mentions are sparser. Future research is needed to validate this vocabulary systemwide and across multi-institutional datasets, with different ratios of specialty to general clinics and within general pediatric cohorts to ensure its robustness and broader applicability in diverse clinical settings.

In conclusion, we present a preliminary keyword bank for pediatric sleep-related mentions and demonstrate its use within an efficient note classification system. When compared with prior work, our vocabulary increases recall by 13.5% while maintaining precision, supporting its role as an initial step toward accurately and efficiently identifying pediatric sleep-related content in clinical notes at scale [21]. Beyond its immediate research utility, the DSE vocabulary has clear potential applications in clinical operations and quality improvement (QI). For example, in QI initiatives, the algorithm could be used to audit well-child visit documentation and quantify how often sleep-related concerns are recorded, thereby identifying gaps in screening and documentation practices [24]. For cohort identification, the tool could surface patients with sleep-related concerns who lack formal ICD-10 diagnoses, enabling more comprehensive registries for population-level sleep health management [25]. Finally, the low-compute nature of the vocabulary makes it well suited for real-time integration into the electronic health record as a first-pass filter that flags charts containing sleep mentions and triggers downstream best practice advisories [26]. Future work will focus on piloting and evaluating these workflows to ensure they are clinically actionable and do not add to provider burden.

Supplementary Material

piag007_Supplemental_Files

Acknowledgments

We would like to acknowledge Tes Abraha for his work on wrangling and pulling the patient data from our institutional electronic health record (EHR).

Contributor Information

Joseph W Sirrianni, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, United States.

Ariana Calloway, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, United States.

Syed-Amad Hussain, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, United States; Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

Hongfang Liu, Department of Health Data Science and Artificial Intelligence, University of Texas Health Science Center, Houston, TX, United States.

Christopher W Bartlett, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, United States; Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

Mattina A Davenport, Abigail Wexner Research Institute, Nationwide Children’s Hospital, Columbus, OH, United States; Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

Author contributions

Joseph W. Sirrianni (Conceptualization, Formal analysis, Methodology, Writing—original draft [equal]), Ariana Calloway (Formal analysis, Methodology, Project administration, Validation, Writing—original draft [equal]), Syed-Amad Hussain (Methodology, Writing—review & editing [equal]), Hongfang Liu (Conceptualization, Methodology, Supervision, Writing—review & editing [equal]), Christopher Bartlett (Conceptualization, Methodology, Supervision, Writing—review & editing [equal]), and Mattina Davenport (Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing—original draft, Writing—review & editing [equal])

Disclosure statement

Financial disclosure: At the time of this study, M.D.’s time was funded by the National Heart, Lung, Blood Institute (NHLBI) award number: 1K01HL169493-01.

Non-financial disclosure: None.

Data availability

Patient data are not available to be shared. Vocabulary set and code available at: https://github.com/jsirrianni-NCH/rule-based-nlp-sleep-info-pediatric.

Preprint repositories

This manuscript was previously made available as a preprint in medRXiv (DOI: https://doi.org/10.1101/2025.05.31.25328640).

References

  • 1. El-Sheikh  M, Gillis  BT, Saini  EK, Erath  SA, Buckhalt  JA. Sleep and disparities in child and adolescent development. Child Dev Perspect. 2022;16(4):200–207. 10.1111/cdep.12465 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Jesmin  SS, Amin  I. Addressing the sleep deprivation epidemic in adolescents: findings from the youth risk behavior survey 2021. Am J Health Educ. 2025;56(2):142–151. 10.1080/19325037.2024.2366463 [DOI] [Google Scholar]
  • 3. Davenport  MA, Berkley  S, Phillips  SR, et al.  Association of exposure to interpersonal racism and racial disparities in inadequate sleep risk. J Pediatr. 2025;276:114378. 10.1016/j.jpeds.2024.114378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Meltzer  LJ, Johnson  C, Crosette  J, Ramos  M, Mindell  JA. Prevalence of diagnosed sleep disorders in pediatric primary care practices. Pediatrics. 2010;125(6):e1410–e1418. 10.1542/peds.2009-2725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Honaker  SM, Meltzer  LJ. Sleep in pediatric primary care: a review of the literature. Sleep Med Rev. 2016;25:31–39. 10.1016/j.smrv.2015.01.004 [DOI] [PubMed] [Google Scholar]
  • 6. Calloway  A, Kalra  M, Davenport  M. 1327 awakening the sleeping giant: a process-based model for sleep care coordination at a pediatric healthcare system. Sleep. 2025;48(Suppl 1). 10.1093/sleep/zsaf090.1327 [DOI] [Google Scholar]
  • 7. Williamson  AA, Powell  M, Luberti  A, et al.  Implementing an electronic health record–integrated pediatric primary care sleep screener. JAMA Netw Open. 2025;8(8):e2525346. 10.1001/jamanetworkopen.2025.25346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. McQuillan  ME, Chernyak  Y, Honaker  SM. Evidence-based detection, prevention, and behavioral intervention for sleep disorders in integrated care. In: O’Donohue  W, Zimmermann  M, eds. Handbook of Evidence-Based Prevention of Behavioral Disorders in Integrated Care. Springer, Nature Switzerland; 2021. 10.1007/978-3-030-83469-2_17 [DOI] [Google Scholar]
  • 9. Minami  Y, Kishi  A, Ueda  HR. Preventive circadian medicine: improving health with sleep checkups. npj Biol Timing Sleep. 2025;2:31. 10.1038/s44323-025-00047-z [DOI] [Google Scholar]
  • 10. Mazzotti  DR, Haendel  MA, McMurry  JA, et al.  Sleep and circadian informatics data harmonization: a workshop report from the Sleep Research Society and sleep research network. Sleep. 2022;45(6). 10.1093/sleep/zsac002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Mazzotti  DR. Landscape of biomedical informatics standards and terminologies for clinical sleep medicine research: a systematic review. Sleep Med Rev. 2021;60:101529. 10.1016/j.smrv.2021.101529 [DOI] [PubMed] [Google Scholar]
  • 12. Sciannameo  V, Pagliari  DJ, Urru  S, et al.  Information extraction from medical case reports using OpenAI InstructGPT. Comput Methods Programs Biomed. 2024;255:108326. [DOI] [PubMed] [Google Scholar]
  • 13. Samsi  S, Zhao  D, McDonald  J, et al.  From words to watts: benchmarking the energy costs of large language model inference. In: Proceedings of the 2023 IEEE High Performance Extreme Computing Conference (HPEC), Boston, MA, USA, 2023, pp. 1–9, 10.1109/HPEC58863.2023.10363447 [DOI]
  • 14.Moosavinasab S, Sezgin E, Sun H, Hoffman J, Huang Y, Lin S. DeepSuggest: Using Neural Networks to Suggest Related Keywords for a Comprehensive Search of Clinical Notes. ACI Open, 2021;5(1):e1–e12. 10.1055/s-0041-1729982 [DOI] [Google Scholar]
  • 15. Zhang  Y, Kim  M, Prerau  M, et al.  The National Sleep Research Resource: making data findable, accessible, interoperable, reusable and promoting sleep science. Sleep. 2024;47(7). 10.1093/sleep/zsae088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Patra  BG, Sharma  MM, Vekaria  V, et al.  Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J Am Med Inform Assoc. 2021;28(12):2716–2727. 10.1093/jamia/ocab170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Horner  M, Abul-el-Rub  N, Mays  M, Mazzotti  D. Development of a rule-based text mining algorithm to identify sleep complaints in primary care progress notes. Sleep. 2022;45(Supplement_1):A267–A268. 10.1093/sleep/zsac079.607 [DOI] [Google Scholar]
  • 18. Kartoun  U, Aggarwal  R, Beam  AL, et al.  Development of an algorithm to identify patients with physician-documented insomnia. Sci Rep. 2018;8(1):7862. 10.1038/s41598-018-25312-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Davenport  MA, Sirrianni  JW, Chisolm  DJ. Machine learning data sources in pediatric sleep research: assessing racial/ethnic differences in electronic health record–based clinical notes prior to model training. Front Sleep. 2024;3:3. 10.3389/frsle.2024.1271167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Meltzer  LJ, Williamson  AA, Mindell  JA. Pediatric sleep health: it matters, and so does how we define it. Sleep Med Rev. 2021;57:101425. 10.1016/j.smrv.2021.101425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Sivarajkumar  S, Tam  TY, Mohammad  HA, et al.  Extraction of sleep information from clinical notes of Alzheimer’s disease patients using natural language processing. J Am Med Inform Assoc. 2024;31(10):2217–2227. 10.1093/jamia/ocae177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. McHugh  ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22(3):276–282. [PMC free article] [PubMed] [Google Scholar]
  • 23. Aho  AV, Corasick  MJ. Efficient string matching: an aid to bibliographic search. Commun ACM. 1975;18(6):333–340. [Google Scholar]
  • 24. Tan  H, Osterman  TJ. SmokeBERT and beyond: bridging clinical narratives and structured smoking data to improve lung cancer screening. JCO Clin Cancer Inform. 2025;9(9):e2500350. 10.1200/cci-25-00350 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Bamikole  AM, Fadare  RI, Afolalu  AS, Alade  MI. Applying natural language processing (NLP) to clinical notes for automated detection of PPD risk factors. NIPES J Sci Technol Res. 2025;7(2):4011–4019. [Google Scholar]
  • 26. ‌Everson  J, Nong  P, Richwine  C. Uptake of generative AI integrated with electronic health records in US hospitals. JAMA Netw Open. 2025;8(12):e2549463. 10.1001/jamanetworkopen.2025.49463 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

piag007_Supplemental_Files

Data Availability Statement

Patient data are not available to be shared. Vocabulary set and code available at: https://github.com/jsirrianni-NCH/rule-based-nlp-sleep-info-pediatric.


Articles from Sleep Advances: A Journal of the Sleep Research Society are provided here courtesy of Oxford University Press

RESOURCES