Abstract
Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21,659 patients (out of 1,560,564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% versus 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation.
Introduction
Research into COVID-19 has been facilitated by observational studies using administrative and clinical databases.1,2 While these data sources have often provided a comprehensive view into patient health before and after episodes of COVID-19, time-varying trends such as the scarcity of diagnostics early in the pandemic and a shift to self-testing and at-home disease management have created challenges around the use of diagnostic codes to identify patients with COVID-19.3-5 Selection bias is particularly concerning when cohort identification methods vary in reliability across groups,6 leading to non-representative sampling and jeopardizing generalizability.
Manual chart review is often seen as the gold standard of studies seeking to validate cohort identification strategies. However, advances in natural language processing (NLP) may allow more efficient and accurate parsing of clinical text.7,8 At the same time, concern around machine learning’s tendency towards amplifying existing biases encourages a cautious approach towards its application.9
In this study, we explored the utility of NLP classifiers for the identification of COVID-19 cohorts in clinical notes and compared them to the use of diagnostic codes within a large database of primary care electronic health records (EHRs) from practices across the United States. Our focus was on evaluating the rate of potential differential outcome misclassification in the cohorts identified by either NLP or diagnostic codes. We did not aim with this research to provide a reusable tool that can be employed across many settings, which would not be possible given the restricted nature of our data and our uncertainty about applicability across settings. Instead, we explored the potential use of NLP classifiers trained on a single dataset to improve the representative capture of patient cohorts with COVID-19, with the hope that other researchers could use similar methods with their own data.
Specifically, we used NLP to identify cohorts of patients with COVID-19 from two different populations within a large dataset of primary care encounters. Within a cohort of 5000 patients with COVID-19 symptoms, we trained three NLP classifiers based on expert-annotated clinical notes. Then, in a cohort of patients who had received COVID-19 antivirals, we assessed the performance of our classification model. Within this group that received antivirals and was presumably infected with SARS-CoV-2, we next assessed the presence of differential outcome misclassification by comparing the sensitivity of NLP versus diagnostic codes for subgroups defined by race, ethnicity, and age. We also assessed NLP performance on a cohort of patients who received a COVID-19 antiviral after the end of training data collection. Finally, we examined the accuracy of NLP classification when trained with different dataset sizes to provide guidance to future researchers on the trade-off between performance and burden on annotators.
Methods
Data Source
We used the American Family Cohort (AFC) as the data source for all portions of this study. The AFC is a collection of EHRs from primary care clinics across the United States.10 It includes data prospectively collected starting in 2017 and is currently the largest clinical registry for primary care in the US. The patients within the AFC are of highly diverse ages, races, and ethnicities, and reside in all 50 U.S. states. Race and ethnicity data were imputed for individuals with missing data using extensively validated random forest methods (Supplemental Material, Appendix S1).11
Training Data for NLP Classifiers
To generate the sample of clinical notes for training our NLP classifiers, we began by identifying all patient visits from April 1, 2020, when the ICD-10 code for COVID-19 became available based on the recorded clinical encounter reasons, to January 9, 2023. The AFC contained 12,752,546 real-time visits (either in-person or via telehealth) conducted by 5228 providers at 699 practices with 2,317,670 unique patients in this period. However, not all practices provided clinical notes to AFC in addition to structured data on diagnoses, prescriptions, and so on. There were 10,085,239 visits during this period with notes. These visits were conducted by 3439 providers at 681 practices with 1,987,094 unique patients.
We next excluded notes with fewer than 100 characters. After this exclusion, we were left with 8,733,202 notes (from 1,560,564 unique patients). This allowed us to discard most superfluous data (e.g., records of no-shows) in the clinical notes section. From this dataset, we selected only notes linked to at least a single visit carrying a SNOMED, ICD-9, or ICD-10 diagnostic code present in the "COVID-19 Potential Signs and Symptoms" National Library of Medicine value sets (object identifier: 2.16.840.1.113762.1.4.1223.22).
Lastly, we randomly selected 5000 clinical notes (Supplemental Material, Figure S1). Our choice of sample size was informed by a similarly structured study that demonstrated commendable model performance.12 We used Python's regular expression-based functions to clean the documents of formatting marks that would hinder readability for annotators and potentially guide models towards identifying high-prevalence settings by note formatting rather than meaningful text.13
Dataset Annotation
Two family physicians (MT and GW) each manually reviewed 2,750 of the selected notes to identify patients with COVID-19. The physician annotators did not have access to diagnostic codes or other structured data from the patients’ charts, since our NLP classifier would also only have access to notes. A 10% subset of the 5,000 notes was annotated by both physicians to allow us to calculate interrater reliability. A third reviewer (NH) adjudicated disagreements in classification within the doubly-annotated subset. In the annotated sample, 827 of the 5000 notes were deemed to be related to COVID-19.
Development of NLP classifiers
We were interested in evaluating the importance of model complexity and domain-specificity. As a research-ready NLP classifier may need to operate on very large datasets, we also wanted to identify the most parsimonious classifier that offered adequate accuracy. To these ends, we trained three models of increasing complexity: first, a sparse, tree-based classifier using regularized term frequency; next, a recurrent neural network (RNN); and finally, a transformer model fine-tuned on clinical text.
Our simplest model, a tree-based classifier, used eXtreme Gradient Boosting (XGBoost) on term frequency-inverse document frequency (TF-IDF) of single-word data from the training corpus.14 The tabular data generated via regularized bag-of-words text models is well suited to using tree-based methods such as XGBoost, which operates by creating myriad decision trees and outputting an ensemble prediction.15 We further experimented with the XGBoost model by testing its performance when stopwords (i.e., articles, conjunctions, and other common words), numbers, or both stopwords and numbers were removed, then selecting the model with the best performance according to maximum F1 score for further assessments. The F1 score is a commonly used metric in machine learning that rewards balance between positive predictive value and sensitivity. It ranges from 0, which would indicate failure to identify any positive cases, to 1, which would indicate perfect classification accuracy.
Next, we employed a RNN as our simplest deep learning model. Unlike the tree-based model, RNNs are able to integrate information about the sequence of words into their predictions. Specifically, we used a long short-term memory (LSTM) RNN, which allows for more distant textual dependencies than standard RNN models.16
Our final classifier was a transformer model finetuned on a corpus of deidentified clinical notes. Transformer models are generally seen as state-of-the-art in NLP. Clinical text tends to use words in ways that do not always match their everyday usage. As such, we deemed it important to use a model that accounted for these specialized meanings. We decided to use the BioClinicalBERT model developed by Alsentzer, et al, which we accessed through HuggingFace transformers.17,18
We trained each of the three models on an identical 80% sample of the annotated clinical notes and assigned the remaining 20% to the test dataset. To avoid data leakage, we ensured that no individual patient appeared in both the training and test datasets. We also stratified the training and test datasets to contain the same proportion of positively labeled cases. We optimized hyperparameters in the RNN and transformer models over 50 trials using evaluation loss as the cost function.19
To evaluate the accessibility of NLP-based strategies for cohort identification, we tested different training set sizes for our most performant model. Our combined training and test dataset comprised 5000 labeled clinical notes. Our goal was to evaluate whether a smaller training dataset can produce similar performance in the test dataset. We compared training dataset samples of 500, 1000, 2000, and 3000 labeled notes to our primary analysis that used a training set size of 4000. We compared the mean areas under the ROC curves (AUCs) and their 95% credible intervals, as estimated with 100 bootstrap samples of complete training and evaluation runs, assessed on the 1000 test notes.
COVID-19 Positive Cohort
We identified a cohort that was likely to have COVID-19 for the purpose of comparing NLP classifiers to diagnostic codes. This cohort included all patients in the AFC who had received at least one of the following COVID-19 antiviral therapies between April 1, 2020 and January 9, 2023: bamlanivimab, bebtilovimab, casirivimab-imdevimab, nirmatrelvir-ritonavir, molnupiravir, remdesivir, and sotrovimab. We identified prescriptions of these drugs through both string searches and RXNORM codes. Because of these drugs’ potential adverse effects and the lack of authorization for pre-exposure prophylaxis, we deemed it likely that patients in this cohort were infected with SARS-CoV-2.
The data contained 35,317 unique prescriptions among 34,936 patients for antivirals specific to COVID-19. The most common of these antivirals was nirmatrelvir-ritonavir, which accounted for 83.6% of the prescriptions. Molnupiravir was the second most common of these antivirals and accounted for 11.5% of prescriptions. Among patients with a prescription for a COVID-19 antiviral, 27,634 (78.2%) prescriptions were from practices that sent clinical notes to AFC. We further excluded 5975 (21.6%) notes that had a total length of fewer than 100 characters. This left us with an analytic cohort of 21,659 notes from encounters with likely COVID-positive patients.
Comparative Evaluation
Since diagnostic codes are regularly used for cohort identification, they served as the comparator to the NLP classifiers. The COVID-19 diagnostic codes we used to identify cohorts were U07.1 (ICD-10) and 840539006 (SNOMED). In the COVID-19 positive cohort, this diagnostic code was required to have been recorded no more than seven days from the visit at which they were prescribed a COVID-19 specific antiviral.
We performed both overall and group-level comparison between the diagnostic code and the top-performing model with subgroups defined by age, race and ethnicity. Our primary measure of interest was sensitivity within each group, since specificity could not be evaluated in this cohort. While the presence of a diagnostic code is a binary measure, the models output a continuous prediction. To convert this continuous prediction into a binary label, we defined a threshold as the point on the ROC curve with the maximum F1 score.20 Thus, we calculated sensitivity as the percent of individuals in the COVID-19 positive cohort with a model prediction over the threshold, and we estimated 95% credible intervals around this measure using 1000-sample bootstraps.
Diagnosis and management of COVID-19 has changed substantially since the beginning of the pandemic. We were therefore interested in assessing the generalizability of our classifier to cases treated with a COVID-19 antiviral after the training data was collected in the first 32 months of the pandemic. As such, we repeated the above evaluation procedure on 25,019 visits where a COVID-19 antiviral was prescribed and a note documented. To understand potential differences in performance, we compared the frequency of highly relevant terms identified in the first-stage classifier between antiviral-associated notes collected at the same time as the training data and those collected after.
Software
Descriptive analyses were conducted in R, version 4.2, while modeling was conducted in Python 3.7. We used the scikit-learn and xgboost packages for the TF-IDF analyses; Optuna for hyperparameter optimization; PyTorch for the RNN; and HuggingFace Transformers for the transformer.
Research Ethics
This research was approved by the Stanford University Institutional Review Board.
Supplementary Data
Supplementary data are available at the end of this article.
Data Availability
Please see https://americanfamilycohort.org for information on data access.
Results
Natural Language Classifier Performance
Within the 500 notes that were annotated by both family physicians, Cohen’s Kappa for agreement between the two annotators was 0.7, indicating moderate agreement beyond chance. Disagreements largely stemmed from different willingness to diagnose COVID-19 from symptoms alone as well as differences in the understanding of what those qualifying symptoms might be. In the 20% held-out test set of the annotated dataset, the performance of the tree-based (i.e., XGBoost) and transformer (i.e., BioClinicalBERT) models was nearly identical (Supplemental Material, Table S1; Figure 1). The RNN classifier was significantly less accurate. The difference in accuracy between these models was largely due to differences in sensitivity: the two top-performing models had sensitivities of 72.0% (95% credible interval [CI]: 64.7 – 78.3%) and 79.2% (95% CI: 73.4 – 84.1%) for XGBoost and transformer models, respectively, while the RNN model’s sensitivity was just 56.0% (95% CI: 48.7 – 63.2%). Specificity values were more similar and ranged from 89.2 to 92.9%. By comparison, the use of diagnostic codes had a sensitivity of 62.9% and a specificity of 69.2%.
Figure 1:

Receiver operating characteristic curve for the three NLP classifiers. AUC = Area Under Curve, RNN = Recurrent Neural Network.
The removal of stopwords and/or numbers from the training data did not meaningfully alter XGBoost performance (Supplemental Material, Figure S2). The most important words in the XGBoost algorithm were “COVID”, “cough”, and “fever” (Supplemental Material, Figure S3). Note that a simple search for the word “COVID” produced high specificity in the test set (0.94) but low sensitivity (0.36). Thus, the implementation of more sophisticated methods was necessary for the development of a usable classifier. Because the XGBoost model offered equivalent accuracy to the transformer model and was much faster, we used it for all subsequent comparisons.
Comparative Performance in the COVID-19 Antiviral Cohort
Sensitivity was higher for the XGBoost-based NLP classifier than identification based on diagnostic code (Table 1). Sensitivity for the diagnostic code was 63.6% (95% CI: 62.9 - 64.3%) overall and ranged between 59.5% (95% CI: 56.5 - 62.4%) for Black/African American patients and 68.0% (95% CI: 66.3 - 69.7%) for Hispanic/Latino patients. The use of NLP significantly increased sensitivity across most patient groups. Both NLP and diagnostic codes, though, showed lower overall sensitivity among patients aged 65 and older.
Table 1:
Sensitivity (with 95% credible interval) of the diagnostic code (Dx) classifier versus XGBoost-based natural language classifier (NLP). Dashed cells represent subgroups containing fewer than 10 individuals.
| Race and Ethnicity* |
All (n = 21,659) | AIAN (n = 59) | Asian (n = 533) | B/AA (n = 1175) |
H/L (n = 2725) | White (n = 16,985) |
Multi / Other (n = 182) |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age (years) | Dx | NLP | Dx | NLP | Dx | NLP | Dx | NLP | Dx | NLP | Dx | NLP | Dx | NLP |
| All (n = 21,659) | 63.6% (62.9 - 64.3%) | 76.9% (76.3 - 77.5%) | 66.1% (54.2 - 78.0%) | 74.6% (64.4 - 84.8%) | 58.2% (54.0 - 62.5%) | 73.7% (70.0 - 77.3%) | 59.5% (56.5 - 62.4%) | 76.5% (74.0 - 78.8%) | 68.0% (66.3 - 69.7%) | 77.0% (75.3 - 78.6%) | 63.4% (62.6 - 64.1%) | 77.0% (76.4 - 77.6%) | 62.6% (55.5 - 69.8%) | 77.0% (71.4 - 83.5%) |
| < 18 (n = 102) | 68.6% (58.8 - 77.5%) | 84.3% (77.5 - 91.2%) | – | – | – | – | – | – | 73.5% (55.9 - 88.2%) | 88.0% (76.5 - 97.1%) | 67.9% (54.7 - 79.2%) | 84.9% (75.5 - 94.3%) | – | – |
| 18 to 64 (n = 11,154) | 66.0% (65.1 - 66.8%) | 78.8% (78.1 - 79.6%) | 64.0% (44.0 - 84.0%) | 64.0% (44.0 - 84.0%) | 59.3% (53.6 - 64.7%) | 78.9% (74.4 - 83.3%) | 63.8% (60.0 - 67.3%) | 77.6% (74.2 - 80.6%) | 69.4% (67.2 - 71.6%) | 78.8% (76.9 - 80.8%) | 65.8% (64.8 - 66.8%) | 79.0% (78.1 - 79.8%) | 59.0% (49.0 - 69.0%) | 74.0% (65.0 - 82.0%) |
| 65 to 74 (n = 5814) | 61.3% (60.0 - 62.5%) | 75.9% (74.8 - 77.0%) | 76.0% (60.0 - 92.0%) | 84.0% (68.0 - 96.0%) | 53.7% (45.6 - 61.8%) | 65.4% (56.6 - 73.5%) | 56.8% (51.6 - 62.1%) | 77.0% (72.7 - 81.7%) | 65.0% (61.1 - 68.8%) | 72.8% (69.1 - 76.4%) | 61.2% (59.8 - 62.6%) | 76.3% (75.1 - 77.5%) | 64.4% (51.1 - 80.0%) | 84.4% (71.1 - 93.3%) |
| 75 and above (n = 4400) | 60.6% (59.1 - 62.0%) | 73.5% (72.2 - 74.8%) | – | – | 63.4% (52.1 - 74.6%) | 69.0% (57.7 - 78.9%) | 50.0% (42.8 - 56.7%) | 71.1% (64.4 - 77.3%) | 65.6% (60.4 - 70.9%) | 73.3% (68.1 - 77.9%) | 60.6% (59.1 - 62.3%) | 73.7% (72.4 - 75.1%) | 67.7% (48.4 - 83.9%) | 74.2% (58.1 - 90.3%) |
AIAN = American Indian or Alaska Native, B/AA = Black or African American, H/L = Hispanic or Latino. Note that the “Native Hawaiian or Other Pacific Islander” group was included in the “Multi / Other” category due to low numbers in our sample.
Among this COVID-19 positive cohort, 80.8% of patients identified through diagnostic codes were also identified through the NLP-based classifier. On the other hand, 66.9% of patients identified as having COVID-19 by the NLP-based classifier also had a diagnostic code present. Neither diagnostic codes nor NLP identified 10.9% of the patients who received COVID-19 antiviral therapy as having COVID-19.
Training Evaluation
The XGBoost classifier lost relatively little accuracy with smaller training dataset sizes (Figure 2). For the reduced training dataset sizes, the mean AUCs and their 95% confidence intervals derived from the bootstrap analysis were as follows: 500 samples, 0.888 (0.886 to 0.890); 1000 samples, 0.900 (0.898 to 0.901); 2000 samples, 0.908 (0.907 to 0.910); and 3000 samples, 0.912 (0.911 to 0.913). The 3000 sample training runs produced a mean AUC that is lower than the 4000 sample main analysis’s AUC of 0.922. Adding additional training data may further improve classifier accuracy, though our findings suggest that the benefit of adding additional training data decreases relatively quickly.
Figure 2:

ROC curves of the XGBoost classifier based on training datasets of different sizes. Training data of 3000 samples or fewer uses 100 bootstrap samples to determine the mean ROC curve (confidence intervals omitted for clarity, but available in text).
Generalizability Assessment
The sensitivity of both identification methods fell when applied to the cohort prescribed a COVID-19 antiviral from January 10, 2023 to December 31, 2023 (Supplemental Material, Table S2). The sensitivity of diagnostic codes was 44.3% (43.7 – 45.0%), while the sensitivity of the XGBoost classifier was just 27.4% (26.9 – 28.0%). Of 18 highly relevant terms, differences in frequency per million words between the training corpus and the post-training corpus were highly significant (p < 0.0001) for all but a single term (Supplemental Material, Table S3 and Figure S4).
Discussion
This proof of concept study of cohort identification methods for patients with COVID-19 in a nationwide dataset of primary care EHRs suggests that reliance upon diagnostic codes alone for cohort identification may introduce the potential for bias and unrepresentative cohort selection, as assessed in cohorts of patients selected either through manual chart review or through their receipt of an antiviral prescription. Under two-thirds of patients in this study who received a COVID-19 antiviral did not have an associated diagnostic code for COVID-19, with particularly low rates for Asian patients, Black/African American patients, and patients aged 65 and older – populations among those most likely to experience worse outcomes with COVID-19.21,22 This makes their representational inclusion in observational studies especially important for the accurate ascertainment of COVID-19’s impacts.
Our work points to the challenge of reliable cohort identification, particularly for short-term exposures such as COVID-19 for which there is no gold standard in EHRs: diagnostic codes, lab results, antiviral prescriptions, and manual chart review all suffer from incomplete capture. While many researchers may rely on diagnostic codes, we found that differential absence of documentation across ages, races, and ethnicities may lead to non-representative samples. In our study, we used two reference groups: COVID-19 patients identified by the physician annotators and patients who received antiviral prescriptions. The sensitivity of diagnostic codes was similar in both cases (62.9% with physician annotation, 63.6% with antiviral prescriptions), just as the sensitivity of the XGBoost model was similar across reference groups (72.0% and 76.9%, respectively). This suggests broad generalizability of these methods, though it may be because the flaws of these reference groups place a ceiling on classifier performance. For instance, antiviral prescriptions were likely very sensitive while selecting for people with risk factors for severe illness. Other potential reference groups identified for example through positive lab results would likely have similar biases to the population they identify.
Although NLP classification is limited by the imperfect data that goes into it, we found that it improves upon the use of diagnostic codes alone for most examined subgroups. Furthermore, we demonstrated that reasonable performance on an XGBoost model can be achieved with as few as 500 training samples – just one-eighth the total used in our main analysis. This highlights how accessible this approach to cohort identification is, even in settings with limited computational resources. At the same time, the dramatic drop in the NLP classifier’s performance on notes after the end of training data collection points to the need to assess model performance and update model training frequently. The costs of these frequent assessments and updates should be taken into account when considering the use of similar methods, and researchers should try different longitudinal cuts of data to assess how changing the beginning and end dates of data collection may affect performance.
This study contributes to the broader literature about the risk of misclassification in epidemiological studies using EHR data. It is well-known that, while documentation of a given diagnostic code is usually a reliable indicator of disease, lack of documentation cannot be assumed to indicate lack of disease with the same level of certainty.23 This can introduce bias into epidemiological studies when a specific code is correlated with access to care, which is referred to as “informative presence bias.”24 Studies’ inclusion criteria around number and type of observations can also unintentionally introduce bias, since sicker patients are more likely to have more documented care.25 Bayesian methods, among others, may be used as a statistical approach to correcting for differential misclassification,26 although reducing misclassification is acknowledged as superior to statistical corrections, when possible.27
Using NLP either alone or in combination with structured data such as diagnosis codes is a common strategy for overcoming some of the challenges of cohort identification in EHRs, although its application to COVID-19 has been limited. Mermin-Bunnel, et al., used NLP to triage patients with COVID-19 based on written messages to their providers and achieved an F1 of 96% for the COVID-positive class.28 Researchers have used NLP more broadly for other conditions, generally noting a substantial increase in performance versus structured data. For example, Cunningham, et al., achieved an AUC of 0.93 using text data to identify heart failure patients, while structured data identification had an AUC of 0.50.29 Weiner, et al., found similar gains when identifying patients with chronic cough: their NLP model had a sensitivity of 77% versus just 15% for their structured data model.30 Similarly, the NLP model created by Liu, et al., had a sensitivity of 73.8% for lung cancer screening eligibility (versus 22.2% with structured data only), and the authors noted a particularly substantial increase in identification of Black/African American patients who were eligible for lung cancer screening.31 Of course, not all researchers find such remarkable performance improvements from adding text data: Zhao, et al., used structured data to achieve an AUC of up to 0.87 for identification of axial spondyloarthritis patients, which increased to 0.93 when text data were added.32
Future research could adopt several strategies to address the issues raised by our work. One of these may be to directly query generative transformer models like Qwen,33 rather than using classification models built on encoder-based transformer models. Another useful line of research could be directed towards better understanding how often NLP-based classification models need to be retrained in order to maintain performance given changes to documentation. Our study focused on COVID-19, documentation of which likely changed dramatically as a result of improved understanding of the disease and the dissemination of home-based testing and care. As a result, the model’s performance became very poor when tested on data from outside the training period. Other diseases are subject to these changes in documentation style as well, which may cause classification models to lose performance. For instance, the rollout of virtual scribes has been shown to change note composition style.34
Our study had several limitations. Agreement between expert annotators was only moderate, which likely limited the potential performance of the NLP models. Many of the notes included in this study were from early in the COVID-19 pandemic, when diagnostic criteria and understanding of the disease were still evolving. As such, annotation was difficult in part because of the challenge of applying current diagnostic and treatment standards retrospectively. Approximately 21% of patients in the COVID-19 positive cohort were diagnosed at clinics that did not contribute clinical notes to the dataset, meaning that our cohort identification method could not be tested in all patients who received a COVID-19 antiviral medication. Another limitation stemming from missing data was our use of imputed race and ethnicity data: while our imputation method had a high accuracy (AUC of 0.93), we cannot ascertain for any specific patient among the approximately 20% who were missing race or ethnicity data. We should also caution that our model was trained in approximately the first two years of the COVID-19 pandemic and became dramatically less accurate when used to predict COVID-19 in patients seen after the end of training data collection. Finally, our identification of antiviral prescriptions with COVID-19 prevalence is based on the assumption that physicians did not provide prophylactic prescriptions of these medications. We based this assumption on the fact that many of the COVID-19 antivirals had potential adverse effects and drug interactions that would create barriers to prescription and that organizations such as the Infectious Disease Society of America have recommended the included antivirals only for post-exposure use.35 Still, we could not rule out prophylactic use.
Conclusion
This research has highlighted the challenges to cohort identification posed by the underuse of diagnostic codes for COVID-19. Our observation of differential misclassification should particularly cause us to reevaluate published observational studies for potential lack of generalizability. For researchers working with EHR data, NLP offers somewhat more reliable and equitable cohort identification of COVID-19 patients. Furthermore, our work suggests that relatively small training datasets can be used with minimal loss of accuracy, making NLP more accessible than many believe, although we also find that performance of these models may slip dramatically when applied to out-of-sample populations. The methods highlighted here also have wide-ranging implications for observational studies conducted on populations with any type of novel disease for which documentation standards may evolve and be implemented only gradually.
Supplementary Material
Contributor Information
Nathaniel Hendrix, Center for Professionalism and Value in Health Care, American Board of Family Medicine 1016 16th St NW Ste 800 Washington, DC 20036 United States of America.
Rishi V. Parikh, Department of Epidemiology and Population Health, Stanford School of Medicine Palo Alto, CA, USA
Madeline Taskier, Center for Professionalism and Value in Health Care, American Board of Family Medicine Washington, DC, USA.
Grace Walter, Robert Graham Center, American Academy of Family Physicians Washington, DC, USA.
Robert L. Phillips, Center for Professionalism and Value in Health Care, American Board of Family Medicine Washington, DC, USA
David H. Rehkopf, Department of Epidemiology and Population Health, Stanford School of Medicine Palo Alto, CA, USA
References
- 1.Pfaff ER, Girvin AT, Bennett TD, et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health. 2022;4(7):e532–e541. doi: 10.1016/S2589-7500(22)00048-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wong HL, Hu M, Zhou CK, et al. Risk of myocarditis and pericarditis after the COVID-19 mRNA vaccination in the USA: a cohort study in claims databases. The Lancet. 2022;399(10342):2191–2199. doi: 10.1016/S0140-6736(22)00791-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bhatt AS, McElrath EE, Claggett BL, et al. Accuracy of ICD-10 Diagnostic Codes to Identify COVID-19 Among Hospitalized Patients. J Gen Intern Med. 2021;36(8):2532–2535. doi: 10.1007/s11606-021-06936-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lynch KE, Viernes B, Gatsby E, et al. Positive Predictive Value of COVID-19 ICD-10 Diagnosis Codes Across Calendar Time and Clinical Setting. Clin Epidemiol. 2021;13:1011–1018. doi: 10.2147/CLEP.S335621 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Moll K, Hobbi S, Zhou CK, et al. Assessment of performance characteristics of COVID-19 ICD-10-CM diagnosis code U07.1 using SARS-CoV-2 nucleic acid amplification test results. PLOS ONE. 2022;17(8):e0273196. doi: 10.1371/journal.pone.0273196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gilbert R, Martin RM, Donovan J, et al. Misclassification of outcome in case–control studies: Methods for sensitivity analysis. Stat Methods Med Res. 2016;25(5):2377–2393. doi: 10.1177/0962280214523192 [DOI] [PubMed] [Google Scholar]
- 7.Chen Y, Hao L, Zou VZ, Hollander Z, Ng RT, Isaac KV. Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system. BMC Med Res Methodol. 2022;22(1):136. doi: 10.1186/s12874-022-01583-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yang R, Zhu D, Howard LE, et al. Identification of Patients With Metastatic Prostate Cancer With Natural Language Processing and Machine Learning. JCO Clin Cancer Inform. 2022;(6):e2100071. doi: 10.1200/CCI.21.00071 [DOI] [PubMed] [Google Scholar]
- 9.Zhang H, Lu AX, Abdalla M, McDermott M, Ghassemi M. Hurtful words: quantifying biases in clinical contextual word embeddings. In: Proceedings of the ACM Conference on Health, Inference, and Learning. ACM; 2020:110–120. doi: 10.1145/3368555.3384448 [DOI] [Google Scholar]
- 10.Vala A, Hao S, Chu I, Phillips RL, Rehkopf D. The American Family Cohort (V12.5). Redivis; 2023. [Google Scholar]
- 11.Cheng L, Gallegos IO, Ouyang D, Goldin J, Ho D. How Redundant are Redundant Encodings? Blindness in the Wild and Racial Disparity when Race is Unobserved. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. FAccT ‘23. Association for Computing Machinery; 2023:667–686. doi: 10.1145/3593013.3594034 [DOI] [Google Scholar]
- 12.Fu S, Thorsteinsdottir B, Zhang X, et al. A hybrid model to identify fall occurrence from electronic health records. Int J Med Inf. 2022;162:104736. doi: 10.1016/j.ijmedinf.2022.104736 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A crosssectional study. PLOS Med. 2018;15(11):e1002683. doi: 10.1371/journal.pmed.1002683 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ‘16. Association for Computing Machinery; 2016:785–794. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
- 15.Shwartz-Ziv R, Armon A. Tabular data: Deep learning is not all you need. Inf Fusion. 2022;81:84–90. doi: 10.1016/j.inffus.2021.11.011 [DOI] [Google Scholar]
- 16.Staudemeyer RC, Morris ER. Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks. Published online September 12, 2019. doi: 10.48550/arXiv.1909.09586 [DOI] [Google Scholar]
- 17.Alsentzer E, Murphy JR, Boag W, et al. Publicly Available Clinical BERT Embeddings. Published online June 20, 2019. doi: 10.48550/arXiv.1904.03323 [DOI] [Google Scholar]
- 18.Wolf T, Debut L, Sanh V, et al. HuggingFace’s Transformers: State-of-the-Art Natural Language Processing. arXiv; 2020. doi: 10.48550/arXiv.1910.03771 [DOI] [Google Scholar]
- 19.Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ‘19. Association for Computing Machinery; 2019:2623–2631. doi: 10.1145/3292500.3330701 [DOI] [Google Scholar]
- 20.Lipton ZC, Elkan C, Naryanaswamy B. Optimal Thresholding of Classifiers to Maximize F1 Measure. In: Calders T, Esposito F, Hüllermeier E, Meo R, eds. Machine Learning and Knowledge Discovery in Databases. Springer; 2014:225–239. doi: 10.1007/978-3-662-44851-9_15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Magesh S, John D, Li WT, et al. Disparities in COVID-19 Outcomes by Race, Ethnicity, and Socioeconomic Status: A Systematic Review and Meta-analysis. JAMA Netw Open. 2021;4(11):e2134147. doi: 10.1001/jamanetworkopen.2021.34147 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Isath A, Malik AH, Goel A, Gupta R, Shrivastav R, Bandyopadhyay D. Nationwide Analysis of the Outcomes and Mortality of Hospitalized COVID-19 Patients. Curr Probl Cardiol. 2023;48(2):101440. doi: 10.1016/j.cpcardiol.2022.101440 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol. 2021;21(1):234. doi: 10.1186/s12874-021-01416-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Harton J, Mitra N, Hubbard RA. Informative presence bias in analyses of electronic health records-derived data: a cautionary note. J Am Med Inform Assoc. 2022;29(7):1191–1199. doi: 10.1093/jamia/ocac050 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Rusanov A, Weiskopf NG, Wang S, Weng C. Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med Inform Decis Mak. 2014;14(1):51. doi: 10.1186/1472-6947-14-51 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Chu R, Gustafson P, Le N. Bayesian adjustment for exposure misclassification in case–control studies. Stat Med. 2010;29(9):994–1003. doi: 10.1002/sim.3829 [DOI] [PubMed] [Google Scholar]
- 27.Brooks DR, Getz KD, Brennan AT, Pollack AZ, Fox MP. The Impact of Joint Misclassification of Exposures and Outcomes on the Results of Epidemiologic Research. Curr Epidemiol Rep. 2018;5(2):166–174. doi: 10.1007/s40471-018-0147-y [DOI] [Google Scholar]
- 28.Mermin-Bunnell K, Zhu Y, Hornback A, et al. Use of Natural Language Processing of Patient-Initiated Electronic Health Record Messages to Identify Patients With COVID-19 Infection. JAMA Netw Open. 2023;6(7):e2322299. doi: 10.1001/jamanetworkopen.2023.22299 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cunningham JW, Singh P, Reeder C, et al. Natural Language Processing for Adjudication of Heart Failure in the Electronic Health Record. JACC Heart Fail. 2023;11(7):852–854. doi: 10.1016/j.jchf.2023.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Weiner M, Dexter PR, Heithoff K, et al. Identifying and Characterizing a Chronic Cough Cohort Through Electronic Health Records. Chest. 2021;159(6):2346–2355. doi: 10.1016/j.chest.2020.12.011 [DOI] [PubMed] [Google Scholar]
- 31.Liu S, McCoy AB, Aldrich MC, et al. Leveraging natural language processing to identify eligible lung cancer screening patients with the electronic health record. Int J Med Inf. 2023;177:105136. doi: 10.1016/j.ijmedinf.2023.105136 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhao SS, Hong C, Cai T, et al. Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records. Rheumatology. 2020;59(5):1059–1065. doi: 10.1093/rheumatology/kez375 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yang A, Yu B, Li C, et al. Qwen2.5-1M Technical Report. Published online January 26, 2025. doi: 10.48550/arXiv.2501.15383 [DOI] [Google Scholar]
- 34.Ong SY, Moore Jeffery M, Williams B, O’Connell RT, Goldstein R, Melnick ER. How a Virtual Scribe Program Improves Physicians’ EHR Experience, Documentation Time, and Note Quality. NEJM Catal. 2021;2(12). doi: 10.1056/CAT.21.0294 [DOI] [Google Scholar]
- 35.Bhimraj A, Morgan R, Shumaker A, Baden L, Cheng V, Edwards K. Infectious Diseases Society of America Guidelines on the Treatment and Management of Patients with COVID-19, Version 11.0.0. Infectious Disease Society of America. Accessed May 30, 2024. https://www.idsociety.org/practice-guideline/covid-19-guideline-treatment-and-management/ [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Please see https://americanfamilycohort.org for information on data access.
