Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 May 24.
Published in final edited form as: Nat Med. 2023 May;29(5):1040–1043. doi: 10.1038/s41591-023-02274-y

Potential pitfalls in the use of real world data to study Long COVID

Harrison G Zhang 1,2,5, Jacqueline P Honerlaw 2, Monika Maripuri 2, Malarkodi Jebathilagam Samayamuthu 3, Brendin R Beaulieu-Jones 1, Huma S Baig 4, Sehi L’Yi 1, Yuk-Lam Ho 2, Michele Morris 3, Vidul Ayakulangara Panickan 1, Xuan Wang 1, Griffin M Weber 1, Katherine P Liao 2,5, Shyam Visweswaran 3, Bryce WQ Tan 6, William Yuan 1, Nils Gehlenborg 1, Sumitra Muralidhar 7, Rachel B Ramoni 7; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE)1, Isaac S Kohane 1,*, Zongqi Xia 8,*, Kelly Cho 2,*, Tianxi Cai 1,2,*, Gabriel A Brat 1,*,
PMCID: PMC10205658  NIHMSID: NIHMS1877723  PMID: 37055567

The value of large-scale real-world data such as that from electronic health records (EHRs) has been used to establish vaccine efficacy, elucidate the genetic etiologies of diseases, and advance epidemiological research.13 Real-world data also has the potential to capture the wide spectrum of clinical features attributed to post-acute sequalae of SARS-CoV-2, also called long COVID, in diverse patient populations.4

We are an international consortium that has operationalized definitions of long COVID using health agency guidelines and established a chart review procedure based on these definitions.5 During this process, we identified 3 major challenges in using real world data to study long COVID: ambiguity and heterogeneity in clinical coding of long COVID; inadequacy of diagnostic codes in capturing the constellation of symptoms; and biases in EHR data arising from variability in the number and kind of contacts with the healthcare system. These challenges warrant special attention if the clinical community wishes to arrive at a robust understanding of long COVID using evidence derived from real world data.

We performed a manual medical record review of 300 randomly sampled patients infected with SARS-CoV-2 and assigned an International Classification of Diseases (ICD)-10 code (U09.9) for long COVID at the Beth Israel Deaconess Medical Center, University of Pittsburgh Medical Center, and national U.S. Veterans Health Administration.5 These three health systems collectively serve over 15 million patients each year.

We evaluated the extent to which patients with the ICD-10 code for this condition met our operationalized definitions from the World Health Organization’s (WHO) and the US Centers for Disease Control.57 Our long COVID definition based on WHO guidelines required that a patient present with at least two new-onset persistent symptoms lasting for 60 days after infection, whereas our definition based on CDC guidelines required that a patient present with at least one new-onset persistent symptom lasting for 30 days.5

A comparison of real-world EHR and administrative data with manually extracted clinical information (obtained through chart review of patients with the U09.9 code) found that functional definitions of long COVID varied widely by provider, which led to inconsistencies in coding practice and adherence to clinical definitions. Among patients assigned the U09.9 code, an average of 40.2% met the more stringent WHO definition, 58.3% had a single symptom that met the WHO definition, and 65.4% met the least stringent CDC definition.5 This shows that the ICD-10 code is an unreliable surrogate of long COVID disease status in research. Research and policy efforts are needed to converge on a definition that will standardize coding practices and improve the ICD-10 code reliability.

Coding is further obfuscated by the potential for misclassification of long COVID with long-lasting complications from acute hospitalization, which are not specific to COVID-19.8 We found an average of 42.3% patients assigned the U09.9 code were hospitalized after infection and an average of 12.3% received intensive care, both of which can produce long lasting symptoms that overlap with long COVID.5 Physical and physiological effects of hospitalization or critical care are important patient-level factors that should not be misattributed to SARS-CoV-2 infection.

Capturing long COVID symptomology using diagnosis codes is difficult, as the syndrome encompasses a constellation of non-specific symptoms including pain, fatigue, and brain fog that are not well represented by coding schemes such as ICD-10.47 Leveraging textual data from EHRs may improve the ability to capture symptomatology. When we examined the data capture of symptoms by ICD-10 codes and natural language processing of clinical narratives, such as clinician notes and discharge summaries, we found that the incorporation of narrative data significantly improved identification of symptoms, compared to using diagnosis codes alone.5 This shows the potential use of natural language processing techniques to ascertain a more complete representation of a patient’s health.

A further challenge is the definition of long COVID patient cohorts, as the syndrome is defined by a time to presentation and therefore requires an index date from which to observe clinical outcomes. The index date is usually an initial infection date, which may become increasingly difficult to ascertain with the use of at-home testing, the results of which are inconsistently reported in EHRs. Researchers should therefore perform routine quality controls (such as checking the time period between initial infection and input of the ICD-10 code for long COVID), to better understand biases present in the data. Researchers should also allow for some flexibility in defining index dates, such as considering an infection time period rather than a single date, which helps to account for delays in billing or data processing.

Researchers must remain cognizant of potential patient selection bias in real-world data; using visits to a long COVID clinic as a proxy for true disease status is problematic. We found that, on average, only 24.0% of patients assigned the U09.9 code visited a long COVID clinic, suggesting that the majority of patients sampled were being coded by physicians who do not work at these clinics.5 Among patients who met the WHO definition of long COVID, only an average of 35.6% visited a long COVID clinic, suggesting that many patients are not being seen at these specialty care facilities.5

Studies should also account for differential data density and healthcare utilization. We found that patients who visited a long COVID clinic were on average annotated with more new-onset conditions when compared to patients who never visited a long COVID clinic.5 Physicians working at long COVID clinics could be more experienced with the syndrome and therefore document the disease more thoroughly. This contributes to a difference in data density and granularity, which can confound findings if not properly addressed.

Studies of long COVID using real-world data must be based on robust and comprehensive clinical datasets. The incorporation of narrative data obtained using natural language processing techniques should better capture symptoms, and researchers should take caution when using only the ICD-10 code or a visit to a long COVID clinic as surrogates for disease status.

Computational phenotypes (where data elements are combined using machine learning algorithms to describe a particular disorder) have the potential to account for the longitudinal persistence of symptoms while avoiding the misattribution of conditions that existed prior to initial infection.9 Semi-supervised machine learning algorithms are resistant to some of these challenges, and so may be powerful tools to capture complex underlying temporal patterns in the data using a small number of manually curated labels. Rule-based algorithms may be less suited for the inherent complexity of long COVID.10

Real-world data has an important role in supporting long COVID research, but these pitfalls should be considered so that the most equitable clinical and policy decisions can be informed by population-level studies.

Footnotes

Competing interests

The authors declare no competing interests

REFERENCES

RESOURCES