Abstract
Objective
To measure pediatrician adherence to evidence-based guidelines in the treatment of young children with attention-deficit/hyperactivity disorder (ADHD) in a diverse healthcare system using natural language processing (NLP) techniques.
Materials and Methods
We extracted structured and free-text data from electronic health records (EHRs) of all office visits (2015-2019) of children aged 4-6 years in a community-based primary healthcare network in California, who had ≥1 visits with an ICD-10 diagnosis of ADHD. Two pediatricians annotated clinical notes of the first ADHD visit for 423 patients. Inter-annotator agreement (IAA) was assessed for the recommendation for the first-line behavioral treatment (F-measure = 0.89). Four pre-trained language models, including BioClinical Bidirectional Encoder Representations from Transformers (BioClinicalBERT), were used to identify behavioral treatment recommendations using a 70/30 train/test split. For temporal validation, we deployed BioClinicalBERT on 1,020 unannotated notes from other ADHD visits and well-care visits; all positively classified notes (n = 53) and 5% of negatively classified notes (n = 50) were manually reviewed.
Results
Of 423 patients, 313 (74%) were male; 298 (70%) were privately insured; 138 (33%) were White; 61 (14%) were Hispanic. The BioClinicalBERT model trained on the first ADHD visits achieved F1 = 0.76, precision = 0.81, recall = 0.72, and AUC = 0.81 [0.72-0.89]. Temporal validation achieved F1 = 0.77, precision = 0.68, and recall = 0.88. Fairness analysis revealed low model performance in publicly insured patients (F1 = 0.53).
Conclusion
Deploying pre-trained language models on a variable set of clinical notes accurately captured pediatrician adherence to guidelines in the treatment of children with ADHD. Validating this approach in other patient populations is needed to achieve equitable measurement of quality of care at scale and improve clinical care for mental health conditions.
Keywords: natural language processing, electronic health record, attention-deficit/hyperactivity disorder, health services research, health care quality
Introduction
Attention-deficit/hyperactivity disorder (ADHD) is the most common child neurobehavioral disorder, estimated to affect 8-10% of US children.1,2 Most children with ADHD are diagnosed in the preschool or elementary school period of life and are at high risk for academic failure.3–6 The primary care pediatrician (PCP) is most often the professional who manages the treatment of ADHD in young children.7,8 The American Academy of Pediatrics (AAP) distributed and updated evidence-based clinical practice guidelines for primary-care management of ADHD in 2001, 2011, and 2019.9–11 For 4-5-year-old preschoolers with ADHD or ADHD symptoms, guidelines emphasize parent training in behavior management (PTBM) as the first-line treatment based on stronger evidence for PTBM than for medications for the treatment of preschoolers with or at-risk for ADHD.5 For 6-11-year-old school-aged children with ADHD, guidelines recommend combining PTBM along with medications based on the evidence suggesting a combined therapy in this age group is superior to either therapy alone in improving child and family functioning.11–13 Although clear guidelines are in place, literature has shown differences in PCP recommendations based on insurance coverage, as well as disparities in care across various race and ethnic groups.14–17
The National Institute for Children’s Health Quality has recognized that quality-of-care measures for child mental health disorders, including ADHD, are lacking.18,19 The current national quality-of-care measures used for the assessment of ADHD care include the Healthcare Effectiveness Data and Information Set (HEDIS) measures that capture the timing of follow-up care for children prescribed with ADHD medications.20 These claims-based measures are readily available and easily calculated, but they only address a narrow aspect of care in a subset of patients with ADHD. Furthermore, these measures have received a “poor” or “not-rated” strength of evidence grade.19 Recognizing the significant limitations of the HEDIS measures, many healthcare organizations supplement these crude measures with labor-intensive and expensive manual chart reviews of the electronic health record (EHR) of a sample of ADHD patients per clinician.21,22
The few EHR-based studies that objectively assessed ADHD care by PCPs—through manual review of medical records—found quality gaps in low adherence to several components of published guidelines, including low rates of recommendations for PTBM, use of validated rating scales, and appropriate follow-up.14,23,24 The lack of scalable access to essential components of clinical care that are documented as free text in the EHR is the major barrier preventing the development of process-of-care quality indicators for primary-care management of ADHD that relies on evidence-based clinical practice guidelines. Machine-learning techniques of NLP offer a unique opportunity to analyze at a large-scale free-text information in the EHR that captures the entire set of recommendations given to families.25,26
Therefore, in this study, we aimed to measure PCP adherence to evidence-based guidelines in the treatment of young children with ADHD by using and comparing several transformer-based language models, including BioClinical-BERT —a model pre-trained on clinical notes from EHRs to analyze unstructured (free-text) data from the EHR of a large community-based pediatrics primary-care network. We focused on young children aged 4-6 years who presented with ADHD or related symptoms, for whom PTBM could be especially beneficial. We assessed the rate of PCP recommendations/referrals and in-office counseling regarding PTBM in these children and assessed the biases and fairness of the tools developed.
Methods
We present the study in accordance with the MINimum Information for Medical AI Reporting (MINIMAR) framework.27 This study was approved by the Stanford University School of Medicine institutional review board.
Setting and population
Packard Children’s Health Alliance (PCHA) is a community-based pediatric healthcare network in the San Francisco Bay Area, affiliated with Stanford Medicine Children’s Health and Lucile Packard Children’s Hospital. PCHA has 24 pediatrics primary-care offices, grouped into 10 practices. We confirmed that visit diagnosis codes are clinician entered in all practices.
Study design, data sources, and cohort selection
This was a retrospective observational cohort study. We reviewed EHRs for a cohort of all pediatric patients seen by PCHA PCPs from October 1, 2015 (date coinciding with the adoption of ICD-10 codes) to December 31, 2019. We extracted de-identified structured and free-text data from all office encounters. We identified a cohort of children aged 4-6 years who had at least 2 visits during the examined period, and who had at least one visit in the study period with a disorder (ADHD) or symptom-level ICD-10 diagnosis code (eg, hyperactivity and distractibility), as done in our previous study and other EHR-based studies that used diagnostic codes for ADHD cohort selection (n = 423).28,29 We excluded patients with a diagnosis of autism in the study period because the recommended evidence-based treatment for these patients is applied behavioral analysis (ABA), which is different than PTBM. Figure 1 shows the study cohort selection flowchart. Table S1 includes ICD codes used for inclusion and exclusion criteria.
Figure 1.
Study cohort selection flowchart.
Structured EHR data variables
For the identified cohort of patients that had an ADHD diagnosis at 4-6 years, we defined an “ADHD-related visit” as a visit that occurred when the patient was 3-6 years of age with a visit diagnosis of ADHD (disorder or symptom-level, see Table S1). We defined a “well-care visit” as a visit with a code or descriptor representing a well-care visit in at least one of the following structured EHR fields: Visit diagnosis code, Current Procedural Terminology (CPT) code, and Visit type descriptor (see Table S2). We used patient structured data in the EHR to describe the following patient characteristics: patient age (at visit of interest), sex, race/ethnicity (non-Hispanic White/non-Hispanic Asian/non-Hispanic Black/Hispanic/non-Hispanic Other/Unknown), and medical insurance at the first ADHD-related visit (private/public/unknown).
Manual chart review and annotation of clinical notes
The primary outcome of interest was the rate of PCP recommendation of PTBM as part of the documented treatment recommendations in the clinical note. PTBM recommendations included a referral to therapists who provide PTBM and/or counseling the family in the office regarding behavioral management (eg, providing a handout). Recognizing that both referrals for PTBM (offered in the community or online) and counseling about PTBM were not captured as discrete structured EHR data (eg, electronic referrals), we focused our manual chart review on the “assessment and plan” section of the visit note, where clinicians detail the recommendations given to the family. We focused on PCP documentation in the first ADHD-related visit for each patient because clinical practice guidelines recommend PTBM as the first-line treatment in young children.
To create a “gold standard” for the mention of PTBM by PCPs, we developed annotation guidelines to be used by two pediatricians (Y.B. and S.T.) who performed independent chart review and annotation of the clinical note of the first ADHD-related visit for each child in the study cohort.30 IAA was assessed for mention of PTBM using the F-measure, which focuses on the agreement of positive cases.31 In step 1, the two annotators annotated together 5 randomly selected notes to facilitate discussion and finalize the annotation process and guidelines. Then, each annotator independently annotated 60 randomly selected notes, achieving IAA = 0.90. Annotation guidelines were further refined when reviewing disagreements. In step 2, the annotators independently annotated another 150 randomly selected notes, achieving IAA = 0.89. All disagreements were reviewed and discussed. In cases an agreement could not be reached, a 3rd pediatrician (L.H.) was used to reach a decision. In step 3, after confirming a sufficiently high IAA, the annotators divided the annotation of the remaining notes (n = 202).
Data preparation and model train/test split
Figure 2 illustrates the model training and deployment workflow. We initially focused on the 423 first ADHD-related visit notes because we considered these visits to have the highest likelihood of PTBM recommendations. For the same patient cohort, we identified an additional 1020 notes from (1) all subsequent (follow-up) ADHD-related visits and (2) all well-care visits between the ages of 3 and 6 years, which we defined as “non-first ADHD visits”. We selected these visits because we anticipated that clinicians discussed ADHD management or engaged in parent counseling, including recommendations for PTBM, in these visits. We set these 1020 notes aside for temporal validation. The annotated set of the first ADHD-related visit notes for the cohort was randomly subset into the train (n = 296) and test (n = 127) sets with a 70-30 split. The train set was used for model development and hyperparameter tuning while the test set was set aside to evaluate model performance.
Figure 2.
Model training and validation workflow.
Notes pre-processing was divided into data cleaning and extraction. All notes were processed through a basic cleaning pipeline that included stripping punctuation, extra whitespace, and digits. A pipeline for extracting sections of interest from the notes was created. The notes were organized in the Subjective/Objective/Assessment/Plan (SOAP) structure. To focus on treatment recommendations by clinicians, the chart review and the model input were limited to the Assessment and Plan sections where clinicians are expected to document their impressions, clinical reasoning, and treatment recommendations.
NLP model development
We developed a binary classification pipeline based on various language models to classify notes as containing recommendations for PTBM or not. Four transformer models were used for this task: BERT uncased, RoBERTa, XLNet, and BioClinicalBERT. BERT uncased is a variation of the original BERT model that is only trained on lowercase text from general-purpose corpora. Robustly Optimized BERT Approach (RoBERTa) was trained on diverse general-purpose corpora. eXtreme Language understanding NETwork (XLNet) is based on BERT and RoBERTa. It differs from BERT models in that it learns dependencies between words in different contexts by predicting all permutations of the words in a sequence. BioClinicalBERT was trained on MIMIC III EHR notes but was also initialized from BioBERT, which was trained on biomedical corpora.
For each transformer model, a classification layer was incorporated that generated softmax values for each set of scores. The classification layer took the extracted note text and two patient binary variables—diagnosis type (disorder/symptom-level) and age group (3-5/6 years)—as input. We chose these two structured data variables because they differed across patients with and without a PTBM recommendation. Following the fairness analysis, insurance status was also explored as a structured data variable for model input. Hyperparameter tuning was done on the training set, and performance was evaluated on the test set using various metrics including F1-score, precision, recall, and area-under-the-receiver operating characteristic curve (AUROC). We report 95% confidence intervals (CI) for AUROC, which captures the discrimination power of the model and is not influenced by threshold selection. Model thresholds were selected on the precision–recall curve to maximize recall and minimize the false-negative rate on the train set. As this work aims to assess clinician adherence to guidelines, we intended to avoid wrongly penalizing clinicians by minimizing the false negative rate. The top-performing model on the test set was chosen for temporal validation. The code is available via GitHub at https://github.com/ybannett/NLP_ADHD.
Temporal validation
Temporal validation was conducted by deploying the best model on unannotated patient notes from all “non-first ADHD visits” for the same patient cohort (n = 1020 notes). All notes classified as positive by the model (n = 53) and a random subset of 50 notes (5%) classified as negative were manually reviewed and annotated to assess model performance.
Fairness analysis
In our fairness analysis, we assessed classification parity to ensure model outcomes are roughly equal across patient subgroups.32 We focused our analysis on patient insurance type (private vs. public) because of the strong influence that insurance coverage has on the availability of community-based PTBM, and based on our previous finding of low rates of PTBM recommendations in publicly insured patients in this network.14 We also completed a fairness analysis based on patient age at first diagnosis (3-5 years vs. 6 years). We did not examine race/ethnicity data due to a large percentage of missing race/ethnicity data, small sample sizes for some minority groups, and co-linearity between insurance type and race/ethnicity.33
Error analysis and other post-hoc analysis
The misclassified notes in temporal validation and for publicly insured patients (in fairness analysis) were analyzed to understand model errors and potential reasoning behind misclassifications. To further investigate fairness analysis results, we implemented post-hoc analyses, including the Kolmogorov–Smirnov test (KS-test) to compare subgroup distributions, and assessed variation in several metrics across patient subgroups (eg, note length). Finally, we attempted to mitigate model bias by adding a protected attribute (public insurance) as discrete data into the model input—an approach named “awareness-based fairness”.34
Results
Cohort characteristics
This study included 423 patients aged 4-6 years with an ADHD diagnosis. Based on manual review, PTBM was recommended for 30.5% of patients (n = 129) at their first ADHD-related visit. Cohort characteristics by PTBM recommendation (yes/no) are displayed in Table 1. Of 423 patients, 230 (54.4%) were 6 years old at their first ADHD-related visit; 74% were male. The cohort primarily consisted of non-Hispanic White patients (n = 138, 32.6%) and patients with unknown race/ethnicity (n = 132, 31.2%). Most patients were privately insured (n = 298, 70.4%). Within race/ethnicity subgroups, 31.2% (43/138) of non-Hispanic White patients received behavioral treatment recommendations compared to 16.7% (4/24) of non-Hispanic Black patients. When examining insurance type, 29.5% (88/298) of patients with private insurance received behavioral treatment recommendations compared to 25.8% (32/124) of patients with public insurance.
Table 1.
Patient cohort characteristics (n = 423).
| Behavioral treatment recommended at first ADHD-related visita |
|||
|---|---|---|---|
| Characteristics | No (n = 303) | Yes (n = 120) | Standardized Mean Difference (SMD) d |
| Sex, N (%) | 0.1 | ||
| Male | 221 (72.9) | 92 (76.7) | |
| Female | 82 (27.1) | 28 (23.3) | |
| Age (years)b, N (%) | 0.3 | ||
| 3-5 | 126 (41.6) | 67 (55.8) | |
| 6 | 177 (58.4) | 53 (44.2) | |
| ADHD diagnosis typec | 0.3 | ||
| Symptom level | 139 (45.9) | 74 (61.7) | |
| Disorder | 164 (54.1) | 46 (38.3) | |
| Race/ethnicity, N (%) | 0.2 | ||
| Non-Hispanic White | 95 (31.4) | 43 (35.8) | |
| Non-Hispanic Asian | 24 (7.9) | 13 (10.8) | |
| Non-Hispanic Black | 20 (6.6) | 4 (3.3) | |
| Hispanic | 47 (15.5) | 14 (11.7) | |
| Non-Hispanic Other | 20 (6.6) | 11 (9.2) | |
| Unknown | 97 (32.0) | 35 (29.2) | |
| Insurance, N (%) | 0.2 | ||
| Private | 210 (69.3) | 88 (73.3) | |
| Public | 92 (30.4) | 32 (26.7) | |
| Unknown | 1 (0.3) | ||
Based on manual chart review.
Age at first ADHD-related visit (includes looking back before age 4); in the 3-5 age group, 4 patients had their first ADHD-related visit at the age of 3 and the remaining 189 at age 4-5 years.
Symptom-level=all ADHD-related visits had symptom-level diagnostic codes; disorder=at least one ADHD-related visit had an ADHD disorder diagnostic code (see Table S1).
Standardized mean difference values of 0.2, 0.5, and 0.8 correspond to small, moderate, and large differences, respectively.
Classification results on test set
Four models were compared to classify clinical notes as including a PTBM recommendation or not. Model performance was evaluated on the test set (n = 127) (Table 2). The highest performing model was BioClinicalBERT (F1 = 0.76) followed by XLNet (F1 = 0.72). BioClinicalBERT classified 91 samples as negative (not including a PTBM recommendation), and 36 samples (28%) as positive.
Table 2.
Model performance classifying clinical notes in the test set (n = 127).
| Model | Precision | Recall | F1 Score | AUROC [95% CI] |
|---|---|---|---|---|
| BioClinicalBERT | 0.81 | 0.72 | 0.76 | 0.81 [0.72-0.89] |
| BERT uncased | 0.81 | 0.50 | 0.62 | 0.73 [0.64-0.81] |
| RoBERTa | 0.63 | 0.56 | 0.59 | 0.71 [0.62-0.80] |
| XLNet | 0.77 | 0.67 | 0.72 | 0.79 [0.71-0.88] |
AUROC = Area under the receiver operator characteristic.
BioClinicalBERT was selected as the final model for temporal validation because it had reasonable performance across all metrics, including recall. The area under the prediction–recall curve (AUPRC) for BioClinicalBERT was 0.85, and a prediction probability threshold of 0.001152 was selected after maximizing recall to minimize the false-negative rate (Figure 3).
Figure 3.
BioClinicalBERT precision–recall curve for threshold selection using the training set.
Temporal validation results
Temporal validation was conducted with the best-performing model, BioClinicalBERT, on all clinical notes from “non-first ADHD visits” for the same patient cohort (n = 1020 notes). The model classified 967 samples as negative (not including a PTBM recommendation), and 53 samples as positive. All 53 positively classified notes and a random sample of 50 negatively classified notes were manually reviewed to determine model performance. BioClinicalBERT demonstrated a similar performance in temporal validation (Table 3), with an improved recall of 0.88, representing a low rate of false-negative classifications (12%) and revealing that pediatricians recommended behavioral treatment in only 5% of non-first-ADHD visits.
Table 3.
Best model (BioClinicalBERT) performance classifying clinical notes (full annotation of first ADHD-related visits and sample annotation of non-first ADHD-related visits).
| Included notes | n | Recall | Precision | F1 Score |
|---|---|---|---|---|
| First ADHD-related visits | 127 | 0.72 | 0.81 | 0.76 |
| Non-first ADHD-related visits | 1020 | 0.88 | 0.68 | 0.77 |
Fairness analysis
The model exhibited disparities in performance metrics between patients with different insurance types (Figure 4A). For patients with private insurance, the model achieved a precision of 0.87 and recall of 0.74 (F1 = 0.80). In contrast, for patients with public insurance, performance was significantly lower with precision = 0.67 and recall = 0.44 (F1 = 0.53). Similarly, the model exhibited disparities in performance between patients from different age groups (Figure 4B). For patients aged 3-5 years, the model achieved a precision of 0.94 and a recall of 0.70 (F1 = 0.80). For patients aged 6 years, performance was lower with precision = 0.67 and recall = 0.62 (F1 = 0.64).
Figure 4.
Fairness analysis: BioClinicalBERT performance on the test set (A) by insurance type and (B) by age group.
Error analysis
We performed an error analysis to investigate potential reasons for model misclassifications for temporal validation and for publicly insured patients (in the fairness analysis). We found similar types of errors in both models. Most misclassified notes had borderline prediction probabilities. False-negative classifications mostly missed mentions of in-office counseling on behavior management and mentions of names of therapists. A common false-positive classification was a recommendation for other types of behavioral treatment for co-occurring conditions (eg, cognitive behavioral therapy (CBT) for anxiety).
Table 4 illustrates misclassification examples for both models where note 1 is a false negative and note 2 is a false positive. For both models, in note 1, the clinician documented counseling the family about behavior modification using phrases that were likely missed by the model. For temporal validation note 2, a referral for developmental–behavioral pediatrics for consultation may have caused a false-positive classification. For the fairness analysis note 2, the false-positive prediction was likely caused by a recommendation for a different type of behavioral therapy (CBT).
Table 4.
Example notes with labels and predictions. Possible reasons for misclassification are bold, and true labels are also underlined.
| # | Example notes | Label | Prediction |
|---|---|---|---|
| Temporal validation | |||
| 1 | well child care year old male icd cm. encounter for routine child health examination without abnormal findings z. plan hyperactivity likely adhd since present at school and at home. discussed setting rules and creating a schedule try to limit dyes and added sugars vanderbilt forms follow up with iep evaluation healthcare maintenance anticipatory guidance handout for age given. | 1 | 0 |
| 2 | female with the following diagnoses addressed today. screening for iron deficiency anemia poct hemoglobin. behavior hyperactive external referral to development behavioral pediatrics plan hgb is normal at.. encourage variety of foods. mom has scheduled a conference with teacher but meanwhile i will initiate a referral for developmental behavior peds to address concerns of inattentiveness and hyperactivity. f u as needed. | 0 | 1 |
| Publicly insured patients (fairness analysis) | |||
| 1 | discussed safety issues. orders placed this encounter procedures vision test hearing screening test pure tone air only medications prescribed dextroamphetamine amphetamine adderall mg tabletd w iep mom will ask teacher f u vanderbilt plan to recheckfocus on home plan boundaries rtc next mo return in about year around. | 1 | 0 |
| 2 | given that mom is hesitant about medications we discussed behavioral interventions in detail cbt in addition to other interventions for support iep vs plan. discussed options for medications including stimulants. | 0 | 1 |
Other post-hoc analysis
The KS-test, comparing subgroup distributions, produced a value of 0.40 (P-value = 1.91e-01) for publicly insured patients and a value of 0.81 (P-value = 2.93e-13) for privately insured patients. The KS test produced a value of 0.76 (P-value = 1.07e-08) for children 3-5 years of age and a value of 0.62 (P-value = 3.10e-04) for children 6 years of age.
We completed four descriptive analyses of factors that could influence model results, including variation in (1) rate of clinician documentation of PTBM recommendation and (2) length of notes across privately and publicly insured patients, number of patients seen by more than one clinician, and distribution of privately and publicly insured patients across the train and test sets. Fifty-six clinicians (61%) recommended PTBM at least once in the first ADHD-related visits. The average note lengths in the test set for privately and publicly insured patient notes were 1001 and 1022 characters, respectively. Of 423 patients, 372 (88%) were seen by more than one clinician for ADHD-related visits. Distribution of patients was similar across the train/test sets—31% of patients (93/296) were publicly insured in the train set; 24% of patients (31/127) were publicly insured in the test set.
To mitigate model bias, we added insurance status as a structured data variable to BioClinicalBERT; however, the model results did not improve (precision = 0.65, recall = 0.72).
Discussion
In this study, we have shown that it is possible to effectively evaluate the adherence of pediatricians to recommended guidelines for treating young children with ADHD, a condition affecting 8-10% of the pediatric population in the United States. This single-site study serves as a proof-of-concept for applying an NLP algorithm to clinical notes to measure the quality-of-care on a scale not feasible by manual chart review. This application is particularly significant for mental health disorders, where treatment recommendations are often embedded in the narrative sections of the EHR, rather than in easily quantifiable data points. Our approach, which requires replication in other healthcare systems, carries broad implications for quality improvement in healthcare, particularly in the management of mental and behavioral health disorders. By leveraging this informatics approach, clinicians and healthcare organizations can receive near real-time feedback on treatment adherence, enabling immediate corrective actions and ultimately improving patient outcomes.
There are no studies, to our knowledge, that used pre-trained large language models (eg, BERT) in the evaluation of quality-of-care for ADHD patients. An advantage to using BERT for classification is that models have already been trained on large datasets, meaning a large amount of data is not required to train models de novo. Despite the limited sample size in our study, we were able to fine-tune BERT and demonstrate reasonable performance in extracting adherence to guideline recommendations. Previous studies have examined similar methods for information extraction for various use cases (eg, medication management, patient monitoring)35; however, the application of these methods to assess the quality-of-care and improve healthcare delivery is largely unexplored. In our work, we demonstrated the potential of using these methods for quality-of-care measurement.
To address growing concerns regarding AI bias and health equity outcomes,36,37 we incorporated a fairness analysis into our study. Prior literature has illustrated differences in pediatric care based on patient insurance type,14 and our fairness analysis also demonstrated lower model performance for patients with public insurance than those with private insurance. We also examined age as a protected attribute, and our fairness analysis showed that the model performed reasonably well across age groups, though performance was relatively low for 6-year-old children.
Based on post-hoc analyses, lower model performance on notes of patients with public insurance could be related to the smaller sample size, which limited model training. Alternatively, extrinsic bias may exist due to differences in language used for documentation in specific patient populations, as found in other research on bias detection tasks.38,39 The implications of this disparity are important. A poorer-performing algorithm for patients on public insurance raises concerns about health equity and could perpetuate existing healthcare inequalities, especially as these patients often come from marginalized communities. From a practical standpoint, health organizations relying on such algorithms might inadvertently misallocate resources or deliver ineffective interventions to these patient groups. Therefore, our findings underline the urgent need for further research and refinements to ensure equitable algorithmic performance across all patient demographics by implementing mitigation strategies, such as weighted sampling in train/test splits to assure sufficient representation of under-represented groups (eg, publicly insured patients).
The long-term goal of this work is to construct a performance dashboard for clinicians and quality officers that would allow clinicians to understand how they are performing relative to evidence-based guidelines and support quality officers in auditing charts for quality-of-care assessments. Quality assurance (QA) is a critical component of healthcare as it helps ensure that patients receive high-quality care. Manual chart review is a time-intensive and painstaking process that can lead to delays in QA processes. Currently, at the examined healthcare network, standard QA processes are conducted twice per year on a sample of patient notes, which does not facilitate actionable feedback to clinicians. Furthermore, the time-consuming task of manual chart review leads to a limited assessment of care, focusing on narrow and accessible information in the chart (eg, obtaining vitals in children prescribed ADHD medications). Such sub-optimal quality-of-care measurement, which represents the current status quo, are often discounted by healthcare organizations and clinicians as lacking in accuracy and clinical meaning.40 By automating the assessment of clinician adherence to established guidelines, the process can become more robust—assessing all aspects of evidence-based care for the entire population of children with ADHD rather than a small scope of care in a small sample, more efficient—reducing cost and time, more accurate—reducing human error and variability in the review process, and more actionable—allowing frequent reviews (eg, monthly, quarterly) resulting in timely feedback to clinicians and healthcare organizations. An accurate and meaningful assessment of quality-of-care can lead to improved patient outcomes.
Limitations
We chose to rely on ICD-10 diagnostic codes for cohort identification because they facilitate standardized implementation in any healthcare system and have been previously validated in children with ADHD.29 However, this approach introduces some level of misclassification error. Our study focused on the management of ADHD by PCPs within the examined study period, and we did not have information on the receipt of recommended therapies in the community nor on medical care provided to children outside of the examined network (eg, by child psychiatrists). Although the study was conducted in a large network of primary-care practices, which included a diverse population (including 14% Hispanic, similar to the US census), the sample size was limited, and the dataset was imbalanced with relatively few positive cases. To mitigate the imbalance, we leveraged BERT and evaluated model performance with metrics that are less sensitive to imbalanced classes such as precision and recall. In terms of generalizability, though the study was conducted in a single community-based pediatric network, the different geographic locations of the practices (n = 24) with varying available community services support the generalizability of the study. To further analyze generalizability, we plan to deploy the model on EHR data from other healthcare networks.
Conclusion
Given the high prevalence of ADHD in the United States and the far-reaching ramifications of ineffective treatment provided to large populations of children with ADHD, the implications of this study are significant. Traditional methods of quality measurement in healthcare—largely reliant on restrictive claims-based metrics or time-consuming manual chart reviews—fall short of providing a comprehensive view of patient care. Our innovative use of an NLP pipeline, built around a BERT model, addresses this gap by offering a more efficient and thorough way to assess the quality of ADHD treatment recommendations. This novel approach reduces data collection and reporting burden, enabling a comprehensive assessment of care quality for an entire population of patients, not just a select sample. The pipeline’s capacity for nuanced data analysis extends beyond limited, traditional quality metrics, offering clinicians and healthcare organizations actionable insights in near real-time. By facilitating a data-driven, standardized approach to ADHD care, our model can play a crucial role in addressing healthcare disparities and enhancing adherence to evidence-based treatment guidelines, ultimately improving patient outcomes.
Supplementary Material
Acknowledgments
We thank Packard Children’s Health Alliance and Stanford Research Information Technology for their support and assistance in data acquisition and extraction. We thank Dr That Nam Tran (Sony) Ton, MD, for assistance in the completion of the chart review and annotation.
Contributor Information
Malvika Pillai, Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, United States.
Jose Posada, Computer Science Department, University of the North, Barranquilla 080020, Colombia.
Rebecca M Gardner, Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA 94305, United States.
Tina Hernandez-Boussard, Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, United States.
Yair Bannett, Division of Developmental-Behavioral Pediatrics, Department of Pediatrics, Stanford University School of Medicine, Stanford, CA 94304, United States.
Author contributions
Dr Pillai participated in the study design, carried out the data analyses and model training, drafted the manuscript, and reviewed and revised the manuscript.
Dr Posada participated in the study design, conceptualization, and planning of chart review and data analysis, and critically reviewed and revised the manuscript.
Ms Gardner participated in the study design, extensively reformatted the data for analysis, performed statistical data analysis, and critically reviewed and revised the manuscript.
Dr Hernandez-Boussard participated in the conceptualization of the study, interpretation of the data, and critically reviewed and revised the manuscript.
Dr Bannett conceptualized and designed the study, defined and coordinated data extraction, participated in chart reviews and annotation, participated in data analyses and drafting the manuscript, and reviewed and revised the manuscript.
Drs Pillai and Bannett had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. All authors approved the final manuscript as submitted and are responsible for all aspects of the work.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This work was supported by the Stanford Maternal and Child Health Research Institute and by the National Institute of Mental Health of the National Institutes of Health under Grant number K23MH128455 (Dr Bannett). Ms Gardner's effort was supported by the T32 Training in Advanced Data and Analytics for Behavioral and Social Sciences Research (TADA-BSSR) training grant from the NIH National Heart, Lung, and Blood Institute (NHLBI, Grant number 1T32HL151323).
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funders did not have any part in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Conflict of interest
The authors have no competing interests to declare.
Data availability
The entire code for the NLP pipeline and transformer models trained, which can be used to reproduce our study in other settings, is available in the GitHub repository at https://github.com/ybannett/NLP_ADHD. The datasets generated and analyzed in the current study contain protected patient health information and are therefore not publicly available; the data will be shared on reasonable request to the corresponding author.
References
- 1. Sclar DA, Robison LM, Bowen KA, et al. Attention-deficit/hyperactivity disorder among children and adolescents in the United States: trend in diagnosis and use of pharmacotherapy by gender. Clin Pediatr (Phila). 2012;51(6):584-589. 10.1177/0009922812439621 [DOI] [PubMed] [Google Scholar]
- 2. Danielson ML, Bitsko RH, Ghandour RM, et al. Prevalence of parent-reported ADHD diagnosis and associated treatment among U.S. children and adolescents, 2016. J Clin Child Adolesc Psychol. 2018;47(2):199-212. 10.1080/15374416.2017.1417860 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Visser SN, Zablotsky B, Holbrook JR, et al. Diagnostic experiences of children with attention-deficit/hyperactivity disorder. Natl Health Stat Rep. 2015;(81):1-7. [PubMed] [Google Scholar]
- 4. Loe IM, Feldman HM.. Academic and educational outcomes of children with ADHD. Ambul Pediatr. 2007;7(suppl 1):82-90. 10.1016/j.ambp.2006.05.005 [DOI] [PubMed] [Google Scholar]
- 5. Charach A, Carson P, Fox S, et al. Interventions for preschool children at high risk for ADHD: a comparative effectiveness review. Pediatrics. 2013;131(5):e1584-e1604. 10.1542/peds.2012-0974 [DOI] [PubMed] [Google Scholar]
- 6. Perrin HT, Heller NA, Loe IM.. School readiness in preschoolers with symptoms of attention-deficit/hyperactivity disorder. Pediatrics. 2019;144(2). 10.1542/peds.2019-0038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Visser SN, Bitsko RH, Danielson ML, et al. Treatment of attention deficit/hyperactivity disorder among children with special health care needs. J Pediatr. 2015;166(6):1423-1430.e1-2. 10.1016/j.jpeds.2015.02.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Albert M, Rui P, Ashman JJ.. Physician office visits for attention-deficit/hyperactivity disorder in children and adolescents aged 4-17 years: United States, 2012-2013. NCHS Data Brief. 2017;(269):1-8. [PubMed] [Google Scholar]
- 9. Perrin JM, Stein MT, Amler RW, et al. Clinical practice guideline: treatment of the school-aged child with attention-deficit/hyperactivity disorder. Pediatrics. 2001;108(4):1033-1044. [DOI] [PubMed] [Google Scholar]
- 10. Wolraich M, Brown L, Brown RT, et al. ; Steering Committee on Quality Improvement and Management. ADHD: clinical practice guideline for the diagnosis, evaluation, and treatment of attention-deficit/hyperactivity disorder in children and adolescents. Pediatrics. 2011;128(5):1007-1022. 10.1542/peds.2011-2654 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wolraich ML, Hagan JF Jr, Allan C, et al. ; Subcommittee on Children And Adolescents with Attention-Deficit/Hyperactive Disorder. Clinical practice guideline for the diagnosis, evaluation, and treatment of attention-deficit/hyperactivity disorder in children and adolescents. Pediatrics. 2019;144(4). 10.1542/peds.2019-2528 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Pelham WE Jr, Fabiano GA.. Evidence-based psychosocial treatments for attention-deficit/hyperactivity disorder. J Clin Child Adolesc Psychol. 2008;37(1):184-214. 10.1080/15374410701818681 [DOI] [PubMed] [Google Scholar]
- 13. Pelham WE Jr, Fabiano GA, Waxmonsky JG, et al. Treatment sequencing for childhood ADHD: a multiple-randomization study of adaptive medication and behavioral interventions. J Clin Child Adolesc Psychol. 2016;45(4):396-415. 10.1080/15374416.2015.1105138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Bannett Y, Gardner RM, Posada J, et al. Rate of pediatrician recommendations for behavioral treatment for preschoolers with attention-deficit/hyperactivity disorder diagnosis or related symptoms. JAMA Pediatr. 2021;176(1):92-94. 10.1001/jamapediatrics.2021.4093 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Morgan PL, Hillemeier MM, Farkas G, et al. Racial/ethnic disparities in ADHD diagnosis by kindergarten entry. J Child Psychol Psychiatry. 2014;55(8):905-913. 10.1111/jcpp.12204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kamimura-Nishimura KI, Epstein JN, Froehlich TE, et al. Factors associated with attention deficit hyperactivity disorder medication use in community care settings. J Pediatr. 2019;213:155-162.e1. 10.1016/j.jpeds.2019.06.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Walls M, Allen CG, Cabral H, Kazis LE, et al. Receipt of medication and behavioral therapy among a national sample of school-age children diagnosed with attention-deficit/hyperactivity disorder. Acad Pediatr. 2018;18(3):256-265. 10.1016/j.acap.2017.10.003 [DOI] [PubMed] [Google Scholar]
- 18. Zima BT, Mangione-Smith R.. Gaps in quality measures for child mental health care: an opportunity for a collaborative agenda. J Am Acad Child Adolesc Psychiatry. 2011;50(8):735-737. 10.1016/j.jaac.2011.05.006 [DOI] [PubMed] [Google Scholar]
- 19. Zima BT, Murphy JM, Scholle SH, et al. National quality measures for child mental health care: background, progress, and next steps. Pediatrics. 2013;131(suppl 1):S38-S49. 10.1542/peds.2012-1427e [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. National Committee for Quality Assurance. Follow-up care for children prescribed ADHD medication. Accessed October 24, 2019. http://www.ncqa.org.
- 21. Casalino LP, Gans D, Weber R, et al. US physician practices spend more than $15.4 Billion annually to report quality measures. Health Aff (Millwood). 2016;35(3):401-406. 10.1377/hlthaff.2015.1258 [DOI] [PubMed] [Google Scholar]
- 22. Schuster MA, Onorato SE, Meltzer DO.. Measuring the cost of quality measurement: a missing link in quality strategy. JAMA. 2017;318(13):1219-1220. 10.1001/jama.2017.11525 [DOI] [PubMed] [Google Scholar]
- 23. Epstein JN, Kelleher KJ, Baum R, et al. Variability in ADHD care in community-based pediatrics. Pediatrics. 2014;134(6):1136-1143. 10.1542/peds.2014-1500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Fiks AG, Mayne SL, Michel JJ, et al. Distance-learning, ADHD quality improvement in primary care: a cluster-randomized trial. J Dev Behav Pediatr. 2017;38(8):573-583. 10.1097/dbp.0000000000000490 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Tamang SR, Hernandez-Boussard T, Ross EG, et al. Enhanced quality measurement event detection: an application to physician reporting. EGEMS (Wash DC). 2017;5(1):5. 10.13063/2327-9214.1270 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hernandez-Boussard T, Blayney DW, Brooks JD.. Leveraging digital data to inform and improve quality cancer care. Cancer Epidemiol Biomarkers Prev. 2020;29(4):816-822. 10.1158/1055-9965.Epi-19-0873 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hernandez-Boussard T, Bozkurt S, Ioannidis JPA, et al. MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care. J Am Med Inform Assoc. 2020;27(12):2011-2015. 10.1093/jamia/ocaa088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Bannett Y, Feldman HM, Gardner RM, et al. Attention-deficit/hyperactivity disorder in 2- to 5-year-olds: a primary care network experience. Acad Pediatr. 2020;21(2):280-287. 10.1016/j.acap.2020.04.009 [DOI] [PubMed] [Google Scholar]
- 29. Gruschow SM, Yerys BE, Power TJ, et al. Validation of the use of electronic health records for classification of ADHD status. J Atten Disord. 2016;23(13):1647-1655. 10.1177/1087054716672337 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Soysal E, Wang J, Jiang M, et al. CLAMP—a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc. 2017;25(3):331-336. 10.1093/jamia/ocx132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Hripcsak G, Rothschild AS.. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005;12(3):296-298. 10.1197/jamia.M1733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Röösli E, Bozkurt S, Hernandez-Boussard T.. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci Data. 2022;9(1):24. 10.1038/s41597-021-01110-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Bannett Y, Gardner RM, Huffman LC, et al. Continuity of care in primary care for young children with chronic conditions. Acad Pediatr. 2023;23(2):314-321. [DOI] [PubMed] [Google Scholar]
- 34. Wang X, Zhang Y, Zhu R.. A brief review on algorithmic fairness. MSE. 2022;1(1):7. 10.1007/s44176-022-00006-z [DOI] [Google Scholar]
- 35. Jeddi Z, Bohr A.. Chapter 9 - Remote patient monitoring using artificial intelligence In: Bohr A, Memarzadeh K, eds. Artificial Intelligence in Healthcare. Academic Press; 2020:203-234. [Google Scholar]
- 36. Czarnowska P, Vyas Y, Shah K.. Quantifying social biases in NLP: a generalization and empirical comparison of extrinsic fairness metrics. Trans Assoc Comput Linguist. 2021;9:1249-1267. 10.1162/tacl_a_00425 [DOI] [Google Scholar]
- 37. Röösli E, Rice B, Hernandez-Boussard T.. Bias at warp speed: how AI may contribute to the disparities gap in the time of COVID-19. J Am Med Inform Assoc. 2020;28(1):190-192. 10.1093/jamia/ocaa210 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Orgad H, Belinkov Y. Choose your lenses: flaws in gender bias evaluation. arXiv, arXiv:221011471, 2022, preprint: not peer reviewed.
- 39. Dev S, Sheng E, Zhao J, et al. On measurements of bias and fairness in NLP. 2022. Accessed September 15, 2023. https://research.google/pubs/on-measurements-of-bias-and-fairness-in-nlp/
- 40. Romano PS, Rainwater JA, Antonius D.. Grading the graders: how hospitals in California and New York perceive and interpret their report cards. Med Care. 1999;37(3):295-305. 10.1097/00005650-199903000-00009 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The entire code for the NLP pipeline and transformer models trained, which can be used to reproduce our study in other settings, is available in the GitHub repository at https://github.com/ybannett/NLP_ADHD. The datasets generated and analyzed in the current study contain protected patient health information and are therefore not publicly available; the data will be shared on reasonable request to the corresponding author.




