ConceptWAS: a high-throughput method for early identification of COVID-19 presenting symptoms

Juan Zhao; Monika E Grabowska; Vern Eric Kerchberger; Joshua C Smith; H Nur Eken; QiPing Feng; Josh F Peterson; S Trent Rosenbloom; Kevin B Johnson; Wei-Qi Wei

doi:10.1101/2020.11.06.20227165

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2020 Nov 10:2020.11.06.20227165. [Version 1] doi: 10.1101/2020.11.06.20227165

ConceptWAS: a high-throughput method for early identification of COVID-19 presenting symptoms

Juan Zhao ¹, Monika E Grabowska ², Vern Eric Kerchberger ^3,¹, Joshua C Smith ¹, H Nur Eken ⁴, QiPing Feng ⁵, Josh F Peterson ¹, S Trent Rosenbloom ¹, Kevin B Johnson ¹, Wei-Qi Wei ^1,^*

PMCID: PMC7668764 PMID: 33200151

Abstract

Objective:

Identifying symptoms highly specific to COVID-19 would improve the clinical and public health response to infectious outbreaks. Here, we describe a high-throughput approach – Concept-Wide Association Study (ConceptWAS) that systematically scans a disease’s clinical manifestations from clinical notes. We used this method to identify symptoms specific to COVID-19 early in the course of the pandemic.

Methods:

Using the Vanderbilt University Medical Center (VUMC) EHR, we parsed clinical notes through a natural language processing pipeline to extract clinical concepts. We examined the difference in concepts derived from the notes of COVID-19-positive and COVID-19-negative patients on the PCR testing date. We performed ConceptWAS using the cumulative data every two weeks for early identifying specific COVID-19 symptoms.

Results:

We processed 87,753 notes 19,692 patients (1,483 COVID-19-positive) subjected to COVID-19 PCR testing between March 8, 2020, and May 27, 2020. We found 68 clinical concepts significantly associated with COVID-19. We identified symptoms associated with increasing risk of COVID-19, including “absent sense of smell” (odds ratio [OR] = 4.97, 95% confidence interval [CI] = 3.21–7.50), “fever” (OR = 1.43, 95% CI = 1.28–1.59), “with cough fever” (OR = 2.29, 95% CI = 1.75–2.96), and “ageusia” (OR = 5.18, 95% CI = 3.02–8.58). Using ConceptWAS, we were able to detect loss sense of smell or taste three weeks prior to their inclusion as symptoms of the disease by the Centers for Disease Control and Prevention (CDC).

Conclusion:

ConceptWAS is a high-throughput approach for exploring specific symptoms of a disease like COVID-19, with a promise for enabling EHR-powered early disease manifestations identification.

Keywords: COVID-19, EHR, Natural language processing

Introduction

As of October 14, 2020, over 7.7 million people in the United States (U.S.) and 37 million worldwide were infected with coronavirus SARS-CoV-2, the agent responsible for COVID-19 [1]. The virus’s high transmissibility, lack of native immunity, and the dearth of effective treatments make managing COVID-19 uniquely challenging. Hence, early recognition of specific COVID-19 symptoms plays an essential role in the clinical and public health response, enabling rapid symptom screening, diagnostic testing, and contact tracing.

Early in the pandemic, physicians observed fever, cough, and shortness of breath as presenting symptoms of COVID-19; however, these symptoms are common to many viral and bacterial illnesses [2]. Subsequently, as new symptoms were reported, health departments and ministries updated the list of COVID-19 symptoms; for example, the U.S. CDC and the Department of Health and Social Care in the UK added loss of smell or taste, a highly indicative symptom [3] to the list in late April and mid-May, respectively [4,5]. Therefore, methods of identifying the specific COVID-19 symptoms early during a pandemic are required, which is crucial to inform the public on when to present for testing, and potentially reduce the ultimate size of the outbreak, lowering overall morbidity and mortality.

Recent efforts to track COVID-19 symptoms have used methods such as scanning publications or twitter [6,7], deploying questionnaires [8], or releasing apps to self-report symptoms [9]. However, results from publications and questionnaires may be delayed; data from social media or self-reported apps do not always include proper controls and lack physiological assessments to determine the COVID-19 status. Electronic Health Records (EHR) data has also been used to characterize COVID-19, due to the availability of routinely collected medical data. However, existing studies were limited by structured data (e.g., coded diagnoses, procedures, or lab tests) [10,11] and lacked a portable and high-throughput approach [12].

Here, we present a high-throughput approach (ConceptWAS) for early identification of clinical manifestations of COVID-19 using natural language processing (NLP) on EHR clinical notes. By examining the EHR-derived concepts related to patients’ signs or symptoms, ConceptWAS assesses whether any of the concepts are associated with the presence or absence of a disease (e.g., COVID-19). Using this approach, we identified the symptoms that were specific to COVID-19. In particular, we performed ConceptWAS using the cumulative data every two weeks to demonstrate the timeline of emerging symptoms. We also conducted a chart review to validate the significant associations.

Methods

Study setting

The study was performed at Vanderbilt University Medical Center (VUMC), one of the largest primary care and referral health systems serving over one million patients annually from middle Tennessee and the Southeast United States. We used data from patients represented in the VUMC EHR aged ≥18 years. The study was approved by the VUMC Institutional Review Board (IRB #200512).

Cohort definition

We firstly identified patients who received at least one SARS-CoV-2 polymerase chain reaction (PCR) test between March 8 (when the first COVID-19 emerged in VUMC) and May 27, 2020 (Figure A.1). The COVID-19 status was determined using the PCR test result. The case group (COVID-19-positive) was defined as patients who had >=1 PCR positive result, and the control group (COVID-19-negative) consisted of patients with only negative PCR tests. We excluded patients who had no clinical notes on the day the PCR test was ordered.

Data collection

We extracted clinical notes from 24 hours prior to PCR testing date (day₀) for COVID-19-positive and negative patients (>86% of patients had at least one note within the time window, see Figure B.1). If a patient first tested negative and then subsequently tested positive or if a patient were tested positive more than once, we used the date of the first positive PCR test as day₀. We also segmented the study period into a two-week interval and used the cumulative data every two weeks to perform a temporal analysis.

Concept extraction

We used KnowledgeMap Concept Indexer (KMCI [13]) to preprocess the clinical notes and extract concepts (Figure C.1). KMCI is a local NLP pipeline developed at VUMC for medical notes processing and entity recognition, which has been used for several clinical and genomic studies [13–15]. The preprocessing includes sentence boundary detection, tokenization, part-of-speech tagging, section header identification. The concepts were represented as Unified Medical Language System concept unique identifiers (UMLS CUIs). Since we focused on capturing clinical manifestations of COVID-19, we restricted the concepts to SNOMED Clinical Terms and specific semantic types, e.g., finding, sign or symptom, disease or syndrome, individual behaviors, or mental process (see full list in Table C.1).

Assertion and negation detection

A main challenge of clinical NLP is to accurately detect the clinical entities’ assertion modifier such as negated, uncertain, and hypothetical (e.g. describe a future hypothetical or instruction for patients). We took the following steps to post-process the KMCI output to remove CUIs that were uncertain, and hypothetical. We first excluded any concepts that arose from family history sections. Next, we removed any sentences with future sense or subjunctive mood (e.g. “should”, “could”, or “if”) that describe a hypothetical or instruction for patients. We excluded inquiry sentences that served as the template questions without a simple confirmed answer (e.g. “Yes”, “No”, or “None”) as well. For recognition of negated concepts (e.g. “patient denies having any fever”), KMCI implements NegEx, a widely-used algorithm to detect negations. However, NegEx may miss post-negation triggers such as “Cough: No”. We added regular expression rules based on our local note template to enhance negation detection.

The extended processing modules was implemented using Python 3.6. After processing, the extracted CUIs served as the input for following ConceptWAS analysis.

ConceptWAS analysis

Similar like genome-wide association study (GWAS) and phenome-wide association study (PheWAS) that scan genomic and phenomic data for associations with a given disease or a genetic variant [16,17], conceptWAS examines the clinical concepts retrieved from clinical notes to determine if any concept is associated with a disease. In this study, we applied a conceptWAS to identify associations between symptoms-related concepts and the presence of COVID-19.

We applied Firth’s logistic regression to examine the association for each concept, adjusted by age, gender, and race. We chose Firth’s logistic regression because it has become a standard approach for analyzing binary outcomes with small samples [18]. Negated and non-negated CUIs are treated separately. A CUI with a negation flag (yes/no) is only counted once per patient. Firth’s logistic regression was implemented using R version 3.4.3 and the logistf package. As we tested multiple hypotheses, we used a Bonferroni correction for the significance level. We report the odds ratio (OR), p-values, and the prevalence in case and control groups for each CUI. We used a volcano plot to show p-values and the odds ratio for all CUIs. We also used a forest plot to show the significant concepts that were relevant to signs and symptoms.

Chart review

We performed a manual chart review to evaluate the clinical plausibility of identified signals. We reviewed a CUI if 1) its p-value met Bonferroni-corrected significance, and 2) it was clinically meaningful (e.g., we excluded CUIs such as “finding [CUI C00000243095]” in a sentence like “Findings are nonspecific.”). We randomly selected notes from which the CUI was identified. Two authors (M.E.G. and H.N.E.) with clinical background ascertained whether the identified CUI was a true signal or false positive.

Results

We identified 19,692 patients with COVID-19 PCR test results during the study period (Figure A.1). Patients’ mean age was 45 (44.6 ± 16.9) years. A total of 1,483 (7.5%) patients tested positive for COVID-19. The COVID-19-positive group was younger (41.5 ± 16.2 vs. 44.9 ± 16.9), more often male (48.0% vs. 41.7%), less often white (49.6% vs. 66.7%), and newer to VUMC (EHR length 7.3 years ± 8.1 vs. 9.2 ± 8.5) compared to COVID-19-negative patients (Table 1).

Table 1.

Patient characteristics of the study cohort

Attribute	Cases: COVID-19-positive (n=1,483)	Controls: COVID-19-negative (n=18,209)	P- value
Age (mean years +/− stddev)	41.5 (16.2)	44.9 (16.9)	<0.0001
Gender (% Male)	48.0%	41.7%	<0.0001^*
Race (% White)	49.6%	66.7%	<0.0001^*
Average EHR length (years, +/− stddev)	7.3 (8.1)	9.2 (8.5)	<0.0001
Average CUIs (+/− stddev)	46.1 (61.1)	71.9 (96.3)	<0.0001

Open in a new tab

2-proportion z hypothesis test was performed. For age, EHR length, and average CUIs, a t- test was performed for comparing the median and standard deviations.

Comparison of EHR-derived concepts between COVID-19 positive and negative patients

We extracted 87,753 clinical notes for the 19,692 patients. After using the NLP pipeline to processing the notes, we recognized 19,595 CUIs (including negated status) with semantic types of interests (Table B.1). Using ConceptWAS to compare EHR-derived concepts for COVID-19 positive and negative patients, 68 CUIs were identified after adjusting for multiple testing (Bonferroni-corrected significance, P < 2.55E-06) (Figure 1, Table E.1). The top signals included “depression” (OR = 0.34, 95% CI = 0.24–0.47), “edema” (OR = 0.40, 95% CI = 0.29–0.53), “fever (negated)” (OR = 0.63, 95% CI = 0.55–0.72), and “reaction anxiety” (OR = 0.39, 95% CI = 0.28–0.52). Specifically, symptoms concepts associated with COVID-19-positive patients included “absent sense of smell” (OR = 4.97, 95% CI = 3.21–7.50), “fever” (OR = 1.43, 95% CI = 1.28–1.59), “with cough fever” (OR = 2.29, 95% CI = 1.75–2.96), and “ageusia” (OR = 5.18, 95% CI = 3.02–8.58) (Figure 2).

Figure 1. — Volcano plot of a ConceptWAS scan for 19, 692 patients that included COVID-19-positive group (cases) and negative group (controls). The points are colored by the semantic type of the concepts. Selected associations related to signs, symptoms, or diseases/syndromes are labeled. The volcano plot indicates -log 10 (p-value) for association (y-axis) plotted against their respective log 2 (fold change) (x-axis). The dashed line represents significance level using a Bonferroni correction.

Figure 2. — Forest plot comparing individual concepts between COVID-19-positive (case) and COVID-19-negative (control) patients. Selected associations include the significant signals related to semantic types of symptoms that met Bonferroni-corrected significance (p-value < 2.55E-06). The odds ratio has been adjusted for age, gender, and race. The concepts are ordered by p-value.

Concepts related to smoking status such as “current some day smoker”, “former smoker”, and “smoking monitoring status” were more frequently reported in the COVID-negative group than in the COVID-positive group (OR < 1, P < 2.55E-06), suggesting more smokers in control group. To ascertain whether this signal was true or false positives due to wrongly assertion detection by NLP pipeline, we performed a chart review of 80 patients’ notes that had smoking-related CUIs. We found that 79 of 80 patients confirmed an affirmative smoking status (see below chart review).

Temporal analysis

We performed ConceptWAS using the cumulative data every two weeks within the study period (Figure 3, Figure D.1). By week 4 (by April 5, 2020), “absence of smell” (OR = 10.24; 95% CI = 5.18–20.06) and “ageusia” (loss of taste, OR = 11.79; 95% CI = 5.55–25.2) became significantly associated with increased risk of COVID-19 infection. These two signals remained significant through the subsequent weeks (Supplementary Data). Fever (negated) appeared (OR = 0.55; 95% CI = 0.43–0.71) at week 2 (between March 8 and 22, 2020), and “with cough fever” became significant (OR = 2.09; 95% CI = 1.60–2.70) from the 8th week (between March 8 and May 3, 2020). The depression and anxiety appeared significantly starting from week 4 (by April 5, 2020).

Chart review

To validate the signals, we reviewed patient’s charts for significant concepts with high clinical relevance. We randomly selected 20 notes for each concept to review whether the notes mentioned the symptoms in the expected attribute (e.g. affirmative or negated) (Table 2). The significant concepts from the ConceptWAS that compared COVID-19 positive and negative patients, such as “absent sense of smell”, “ageusia”, “mental depression”, and concepts related to smoking status (e.g. “current some day smoker”, “former smoker”, and “smoking monitoring status”) were consistent with the expected attribute based on chart review.

Table 2.

Results of chart reviews.

Concepts	Reviewed samples	True signals	True signals percentage %	Examples of false positive
Absent sense of smell	20	19	95.00%	“(−) altered/loss of smell”, were wrongly recognized as an affirmative/ positive attribute.
Ageustia	20	19	95.00%	“Symptoms, n/v, fever, cough, loss of taste or smell or around anyone + for Covid 19.”
Mental Depression	20	18	90.00%	One was recognized from a medical history title without any answers; the other came from a recommendation for further Psychosocial assessment.
Current some day smoker	20	20	100.00%
Smoking monitoring status	20	19	95.00%	One is uncertain. “Smoking Status Not on file”.
Fever	20	17	85.00%	Template issue. “The following ROS were reviewed and are negative, unless otherwise stated as +positive: 1 Constitutional: Fever; malaise”
Pericardial Fluid (neg)	20	20	100.00%
Hydrocephalus [neg]	20	20	100.00%
Hydronephroses	20	20	100.00%
Blood group AB Rh(D) negative (finding)	20	0	0.00%	From blood typing tests. This signal was not specific to blood type AB+, but generated by other ABO blood types and Rh-positive patients.
Allergy test positive (finding)	20	5	25.00%	The false positives were wrongly mapped from a sentence like “He /She has been exposed to covid, family member or friends have tested positive.”
Laurin-Sandrow syndrome	20	20	100.00%
Cough nonproductive	20	20	100.00%
In total	260	217	83.46%

Open in a new tab

Although “smoking monitoring status” was generated by an inquiry term used in a template of a chart, after we post-processed the KMCI output to remove irrelevant concepts and refine negation, the smoking monitoring status followed by a negated answer was recognized as a negated attribute. We reviewed 20 notes that mentioned the “smoking monitoring status (affirmative/positive attribute)” and 19 were either current or former smokers.

We also found false positive concepts, mostly due to NLP entity recognition errors. For example, “additional information” was recognized as “adequate knowledge”. The concept “fever” with positive attribute has three false positives, mainly due to a few specific chart templates used for denoting the negation, which were not captured by NLP pipeline.

Discussion

Our work describes a high-throughput and reproducible approach (ConceptWAS) that use EHR notes to early identify pandemic disease symptoms and investigate clinical manifestations for further hypothesis-driven study. We applied ConceptWAS to a cohort of patients who underwent COVID-19 testing. We replicated several well-known symptoms of COVID-19, such as fever, loss of smell/taste, and cough with fever [19–21]. Using ConceptWAS, we were able to detect the signal of loss of smell and taste as early as April 5, 2020, nearly three weeks earlier than the date that they were listed as COVID-19 symptoms by the CDC [3]. - Our results demonstrate the feasibility of using ConceptWAS for an early detection of symptoms of an unknown disease.

We also observed several signals enriched in the COVID-19-negative group. For example, depression and anxiety have a higher prevalence among patients who tested negative. These signals first became significant starting from April 5, 2020, corresponding to the date when the Governor of Tennessee issued a “safer at home” Executive Order and a “stay at home” order. It reflects the mental health issues that the shutdown and quarantine policies may bring to the people [22,23]. We also find a higher percentage of smoking status concepts in the COVID-19-negative group. Earlier epidemiological studies found that fewer smokers are among COVID-19 patients or hospitalized COVID-19 patients [21,24], which are consistent with our findings of the negative correlation between smoking and COVID-19. One explanation could be the impact of nicotine on ACE-2, as nicotine has been suggested to play a protective role against COVID-19 [25]. It is also possible that smokers are taking greater social precautions because of perceived higher risk for respiratory complications from COVID-19, thus reducing their risk of contracting the virus. Although these findings suggest that smoking may be a protective factor, lack of evidence and known adverse events associated with smoking dissuade continued smoking as a protective measure against COVID-19.

ConceptWAS is open-source, portable, and reproducible. Researchers/Users can choose other NLP pipelines (e.g. MetaMap, CLAMP, cTAKES) [26] for concept extraction and use the derived concepts as the input to ConceptWAS.

Through a proof-of-concept study by applying NLP techniques to identify COVID-19 symptoms, we summarize the lessons learned that may help others apply this method.

A high-throughput, lightweight, and reproducible method is important for pandemic disease. ConceptWAS enables a rapid scan of symptoms using clinical notes. These symptoms provided an initial hypothesis for further investigation and could alert clinicians to pay attention to patients with presenting specific symptoms. Researchers can run the ConceptWAS regularly (e.g., weekly or bi-weekly cumulative data) to track the symptoms changes for a pandemic disease.
Running ConceptWAS needs to be cautious about the distribution of different clinical note types. Different clinical note types are unlike from each other due to their specific clinical usage. They may have variable templates and inconsistent length. Therefore, we recommend that researchers check the distribution of document types between cases and controls to avoid sampling bias.
Although NLP has been used in various medical fields to improve information processing and practice [26–29], recognition of negative and uncertain concepts remains a challenge. We enhanced the detection of uncertain arguments and negated concepts by developing rule-based methods as wrappers for entity-identification generated results. Still, our manual chart review suggest that the outcome is not perfect. For example, some notes mentioned negative concepts like “the following ROS were reviewed and are negative, unless otherwise stated as +positive: Constitutional: Fever; malaise.” Such scenarios are difficult for NLP tools to identify. A combination of machine learning and rule-based approaches may improve the detection.
We also learned and recognized that our study has several limitations. First, the study was performed at a single institution with a limited number of COVID-19 patients. As the pandemic crisis evolves and more patients are tested for SARS-CoV-2 in our healthcare system, our ability to detect COVID-19 and clinical concepts’ associations will continue to improve. Second, this study used data from a limited time (before May 27, 2020). In the future, we will extract notes prior to the test date to study the progression of the symptoms and analyze them along different periods. Lastly, as the performance of an NLP system may vary across institutions and databases [26,30], further studies are necessary to assess the generalizability of our findings.

Conclusion

In this study, we describe a high-throughput approach (ConceptWAS) that systematically scans a disease’s clinical manifestations from clinical notes. By applying ConceptWAS on EHR clinical notes from patients subjecting to a COVID-19 test, we detected loss of smell/taste three weeks prior to their inclusion as symptoms of the disease by CDC. The study demonstrates the potential of the EHR-based methods to enable early recognition of specific COVID-19 symptoms, and improve our clinical and public health response to the pandemic.

Supplementary Material

Supplement 1

media-1.docx^{(220.2KB, docx)}

Supplement 2

media-2.xlsx^{(34.7KB, xlsx)}

Acknowledgments

We thanked Dr. Vivian Siegel from Department of Biology at MIT Department of Medicine at Vanderbilt University for helpful suggestions on the study design and manuscript drafting.

Funding

The project was supported by the National Institutes of Health (NIH), National Institute of General Medical Studies (P50 GM115305), National Heart, Lung, and Blood Institute (R01 HL133786), National Library of Medicine (T15 LM007450, R01 LM010685), and the American Heart Association (18AMTG34280063), as well as the Vanderbilt Biomedical Informatics Training Program, Vanderbilt Faculty Research Scholar Fund, and the Vanderbilt Medical Scientist Training Program. The datasets used for the analyses described were obtained from Vanderbilt University Medical Center’s resources and the Synthetic Derivative, which are supported by institutional funding and by the Vanderbilt National Center for Advancing Translational Science grant (UL1 TR000445) from NCATS/NIH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Footnotes

Code availability

Up-to-date developments of ConceptWAS are available in GitHub (https://github.com/zhaojuanwendy/ConceptWAS).

Competing interests

The authors have no competing interests to declare.

References

[1].WHO Coronavirus Disease (COVID-19) Dashboard, (n.d.). https://covid19.who.int/ (accessed May 26, 2020).
[2].Guan W., Ni Z., Hu Y., Liang W., Ou C., He J., Liu L., Shan H., Lei C., Hui D.S.C., Du B., Li L., Zeng G., Yuen K.-Y., Chen R., Tang C., Wang T., Chen P., Xiang J., Li S., Wang J., Liang Z., Peng Y., Wei L., Liu Y., Hu Y., Peng P., Wang J., Liu J., Chen Z., Li G., Zheng Z., Qiu S., Luo J., Ye C., Zhu S., Zhong N., Clinical Characteristics of Coronavirus Disease 2019 in China, New England Journal of Medicine. (2020). 10.1056/NEJMoa2002032. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Makaronidis J., Mok J., Balogun N., Magee C.G., Omar R.Z., Carnemolla A., Batterham R.L., Seroprevalence of SARS-CoV-2 antibodies in people with an acute loss in their sense of smell and/or taste in a community-based population in London, UK: An observational cohort study, PLOS Medicine. 17 (2020) e1003358. 10.1371/journal.pmed.1003358. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Fritz A., Brice-Saddler M., Judkis M., CDC confirms six coronavirus symptoms showing up in patients over and over, Washington Post. (n.d.). https://www.washingtonpost.com/health/2020/04/27/six-new-coronavirus-symptoms/ (accessed September 25, 2020). [Google Scholar]
[5].Statement from the UK Chief Medical Officers on an update to coronavirus symptoms: 18 May 2020, GOV.UK. (n.d.). https://www.gov.uk/government/news/statement-from-the-uk-chief-medical-officers-on-an-update-to-coronavirus-symptoms-18-may-2020 (accessed June 5, 2020).
[6].Awasthi R., Pal R., Singh P., Nagori A., Reddy S., Gulati A., Kumaraguru P., Sethi T., CovidNLP: A Web Application for Distilling Systemic Implications of COVID-19 Pandemic with Natural Language Processing, MedRxiv. (2020) 2020.04.25.20079129. 10.1101/2020.04.25.20079129. [DOI] [Google Scholar]
[7].Mackey T., Purushothaman V., Li J., Shah N., Nali M., Bardier C., Liang B., Cai M., Cuomo R., Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study, JMIR Public Health Surveill. 6 (2020). 10.2196/19509. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Burke R.M., Symptom Profiles of a Convenience Sample of Patients with COVID-19 — United States, January–April 2020, MMWR Morb Mortal Wkly Rep. 69 (2020). 10.15585/mmwr.mm6928a2. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Menni C., Valdes A.M., Freidin M.B., Sudre C.H., Nguyen L.H., Drew D.A., Ganesh S., Varsavsky T., Cardoso M.J., El-Sayed Moustafa J.S., Visconti A., Hysi P., Bowyer R.C.E., Mangino M., Falchi M., Wolf J., Ourselin S., Chan A.T., Steves C.J., Spector T.D., Real-time tracking of self-reported symptoms to predict potential COVID-19, Nature Medicine. 26 (2020) 1037–1040. 10.1038/s41591-020-0916-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Richardson S., Hirsch J.S., Narasimhan M., Crawford J.M., McGinn T., Davidson K.W., Barnaby D.P., Becker L.B., Chelico J.D., Cohen S.L., Cookingham J., Coppa K., Diefenbach M.A., Dominello A.J., Duer-Hefele J., Falzon L., Gitlin J., Hajizadeh N., Harvin T.G., Hirschwerk D.A., Kim E.J., Kozel Z.M., Marrast L.M., Mogavero J.N., Osorio G.A., Qiu M., Zanos T.P., Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area, JAMA. 323 (2020) 2052–2059. 10.1001/jama.2020.6775. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Brat G.A., Weber G.M., Gehlenborg N., Avillach P., Palmer N.P., Chiovato L., Cimino J., Waitman L.R., Omenn G.S., Malovini A., Moore J.H., Beaulieu-Jones B.K., Tibollo V., Murphy S.N., L’Yi S., Keller M.S., Bellazzi R., Hanauer D.A., Serret-Larmande A., Gutierrez-Sacristan A., Holmes J.H., Bell D.S., Mandl K.D., Follett R.W., Klann J.G., Murad D.A., Scudeller L., Bucalo M., Kirchoff K., Craig J., Obeid J., Jouhet V., Griffier R., Cossin S., Moal B., Patel L.P., Bellasi A., Prokosch H.U., Kraska D., Sliz P., Tan A.L., Ngiam K.Y., Zambelli A., Mowery D.L., Schiver E., Devkota B., Bradford R.L., Daniar M., APHP/Universities/INSERM COVID-19 research collaboration, Daniel C., Benoit V., Bey R., Paris N., Jannot A.S., Serre P., Orlova N., Dubiel J., Hilka M., Jannot A.S., Breant S., Leblanc J., Griffon N., Burgun A., Bernaux M., Sandrin A., Salamanca E., Ganslandt T., Gradinger T., Champ J., Boeker M., Martel P., Gramfort A., Grisel O., Leprovost D., Moreau T., Varoquaux G., Vie J.-J., Wassermann D., Mensch A., Caucheteux C., Haverkamp C., Lemaitre G., Krantz I.D., Cormont S., South A., The Consortium for Clinical Characterization of COVID-19 by EHR (4CE), Cai T., Kohane I.S., International Electronic Health Record-Derived COVID-19 Clinical Course Profiles: The 4CE Consortium, Infectious Diseases (except HIV/AIDS), 2020. 10.1101/2020.04.13.20059691. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Wagner T., Shweta F., Murugadoss K., Awasthi S., Venkatakrishnan A., Bade S., Puranik A., Kang M., Pickering B.W., O’Horo J.C., Bauer P.R., Razonable R.R., Vergidis P., Temesgen Z., Rizza S., Mahmood M., Wilson W.R., Challener D., Anand P., Liebers M., Doctor Z., Silvert E., Solomon H., Anand A., Barve R., Gores G., Williams A.W., Morice II W.G., Halamka J., Badley A., Soundararajan V., Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis, ELife. 9 (2020) e58227. 10.7554/eLife.58227. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Denny J.C., Spickard A., Miller R.A., Schildcrout J., Darbar D., Rosenbloom S.T., Peterson J.F., Identifying UMLS concepts from ECG Impressions using KnowledgeMap, AMIA Annu Symp Proc. (2005) 196–200. [PMC free article] [PubMed] [Google Scholar]
[14].Denny J.C., Irani P.R., Wehbe F.H., Smithers J.D., Spickard A., The KnowledgeMap Project: Development of a Concept-Based Medical School Curriculum Database, AMIA Annu Symp Proc. 2003 (2003) 195–199. [PMC free article] [PubMed] [Google Scholar]
[15].Denny J.C., Peterson J.F., Choma N.N., Xu H., Miller R.A., Bastarache L., Peterson N.B., Extracting timing and status descriptors for colonoscopy testing from electronic medical records, J Am Med Inform Assoc. 17 (2010) 383–388. 10.1136/jamia.2010.004804. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits, PNAS. 106 (2009) 9362–9367. 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C., PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations, Bioinformatics. 26 (2010) 1205–1210. 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Firth’s logistic regression with rare events: accurate effect estimates and predictions? - Puhr - 2017 - Statistics in Medicine - Wiley Online Library, (n.d.). https://onlinelibrary.wiley.com/doi/full/10.1002/sim.7273 (accessed June 7, 2020). [DOI] [PubMed]
[19].COVID-19 Patients’ Clinical Characteristics, Discharge Rate, and Fatality Rate of Meta-Analysis - PubMed, (n.d.). https://pubmed.ncbi.nlm.nih.gov/32162702/ (accessed June 30, 2020). [DOI] [PMC free article] [PubMed]
[20].Vaira L.A., Salzano G., Deiana G., De Riu G., Anosmia and Ageusia: Common Findings in COVID-19 Patients, Laryngoscope. 130 (2020) 1787. 10.1002/lary.28692. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Moein S.T., Hashemian S.M., Mansourafshar B., Khorram Tousi A., Tabarsi P., Doty R.L., Smell dysfunction: a biomarker for COVID-19, International Forum of Allergy & Rhinology. n/a (n.d.). 10.1002/alr.22587. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Pfefferbaum B., North C.S., Mental Health and the Covid-19 Pandemic, New England Journal of Medicine. 383 (2020) 510–512. 10.1056/NEJMp2008017. [DOI] [PubMed] [Google Scholar]
[23].Sturges W., Gov. Bill Lee issues stay-at-home order through April 14, Impact. (2020). https://communityimpact.com/nashville/franklin-brentwood/coronavirus/2020/03/30/gov-bill-lee-issues-statewide-stay-at-home-order-for-tennesseans/ (accessed October 7, 2020). [Google Scholar]
[24].Emami A., Javanmardi F., Pirbonyeh N., Akbari A., Prevalence of Underlying Diseases in Hospitalized Patients with COVID-19: a Systematic Review and Meta-Analysis, Arch Acad Emerg Med. 8 (2020). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7096724/ (accessed July 31, 2020). [PMC free article] [PubMed] [Google Scholar]
[25].Farsalinos K., Niaura R., Le Houezec J., Barbouni A., Tsatsakis A., Kouretas D., Vantarakis A., Poulas K., Editorial: Nicotine and SARS-CoV-2: COVID-19 may be a disease of the nicotinic cholinergic system, Toxicol Rep. 7 (2020) 658–663. 10.1016/j.toxrep.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Soysal E., Wang J., Jiang M., Wu Y., Pakhomov S., Liu H., Xu H., CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines, J Am Med Inform Assoc. 25 (2018) 331–336. 10.1093/jamia/ocx132. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Yetisgen-Yildiz M., Gunn M.L., Xia F., Payne T.H., Automatic identification of critical follow-up recommendation sentences in radiology reports, AMIA Annu Symp Proc. 2011 (2011) 1593–1602. [PMC free article] [PubMed] [Google Scholar]
[28].Mf D., S S., Ws B., Jc D., Jl H., Automated extraction of clinical traits of multiple sclerosis in electronic medical records, Journal of the American Medical Informatics Association : JAMIA. 20 (2013). 10.1136/amiajnl-2013-001999. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Ww Y., M Y, Wp H., Sw K., Natural Language Processing in Oncology: A Review, JAMA Oncology. 2 (2016). 10.1001/jamaoncol.2016.0213. [DOI] [PubMed] [Google Scholar]
[30].Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing, (n.d.). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112774 (accessed August 18, 2020). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.docx^{(220.2KB, docx)}

Supplement 2

media-2.xlsx^{(34.7KB, xlsx)}

[R1] [1].WHO Coronavirus Disease (COVID-19) Dashboard, (n.d.). https://covid19.who.int/ (accessed May 26, 2020).

[R2] [2].Guan W., Ni Z., Hu Y., Liang W., Ou C., He J., Liu L., Shan H., Lei C., Hui D.S.C., Du B., Li L., Zeng G., Yuen K.-Y., Chen R., Tang C., Wang T., Chen P., Xiang J., Li S., Wang J., Liang Z., Peng Y., Wei L., Liu Y., Hu Y., Peng P., Wang J., Liu J., Chen Z., Li G., Zheng Z., Qiu S., Luo J., Ye C., Zhu S., Zhong N., Clinical Characteristics of Coronavirus Disease 2019 in China, New England Journal of Medicine. (2020). 10.1056/NEJMoa2002032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Makaronidis J., Mok J., Balogun N., Magee C.G., Omar R.Z., Carnemolla A., Batterham R.L., Seroprevalence of SARS-CoV-2 antibodies in people with an acute loss in their sense of smell and/or taste in a community-based population in London, UK: An observational cohort study, PLOS Medicine. 17 (2020) e1003358. 10.1371/journal.pmed.1003358. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Fritz A., Brice-Saddler M., Judkis M., CDC confirms six coronavirus symptoms showing up in patients over and over, Washington Post. (n.d.). https://www.washingtonpost.com/health/2020/04/27/six-new-coronavirus-symptoms/ (accessed September 25, 2020). [Google Scholar]

[R5] [5].Statement from the UK Chief Medical Officers on an update to coronavirus symptoms: 18 May 2020, GOV.UK. (n.d.). https://www.gov.uk/government/news/statement-from-the-uk-chief-medical-officers-on-an-update-to-coronavirus-symptoms-18-may-2020 (accessed June 5, 2020).

[R6] [6].Awasthi R., Pal R., Singh P., Nagori A., Reddy S., Gulati A., Kumaraguru P., Sethi T., CovidNLP: A Web Application for Distilling Systemic Implications of COVID-19 Pandemic with Natural Language Processing, MedRxiv. (2020) 2020.04.25.20079129. 10.1101/2020.04.25.20079129. [DOI] [Google Scholar]

[R7] [7].Mackey T., Purushothaman V., Li J., Shah N., Nali M., Bardier C., Liang B., Cai M., Cuomo R., Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study, JMIR Public Health Surveill. 6 (2020). 10.2196/19509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Burke R.M., Symptom Profiles of a Convenience Sample of Patients with COVID-19 — United States, January–April 2020, MMWR Morb Mortal Wkly Rep. 69 (2020). 10.15585/mmwr.mm6928a2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Menni C., Valdes A.M., Freidin M.B., Sudre C.H., Nguyen L.H., Drew D.A., Ganesh S., Varsavsky T., Cardoso M.J., El-Sayed Moustafa J.S., Visconti A., Hysi P., Bowyer R.C.E., Mangino M., Falchi M., Wolf J., Ourselin S., Chan A.T., Steves C.J., Spector T.D., Real-time tracking of self-reported symptoms to predict potential COVID-19, Nature Medicine. 26 (2020) 1037–1040. 10.1038/s41591-020-0916-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Richardson S., Hirsch J.S., Narasimhan M., Crawford J.M., McGinn T., Davidson K.W., Barnaby D.P., Becker L.B., Chelico J.D., Cohen S.L., Cookingham J., Coppa K., Diefenbach M.A., Dominello A.J., Duer-Hefele J., Falzon L., Gitlin J., Hajizadeh N., Harvin T.G., Hirschwerk D.A., Kim E.J., Kozel Z.M., Marrast L.M., Mogavero J.N., Osorio G.A., Qiu M., Zanos T.P., Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area, JAMA. 323 (2020) 2052–2059. 10.1001/jama.2020.6775. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Brat G.A., Weber G.M., Gehlenborg N., Avillach P., Palmer N.P., Chiovato L., Cimino J., Waitman L.R., Omenn G.S., Malovini A., Moore J.H., Beaulieu-Jones B.K., Tibollo V., Murphy S.N., L’Yi S., Keller M.S., Bellazzi R., Hanauer D.A., Serret-Larmande A., Gutierrez-Sacristan A., Holmes J.H., Bell D.S., Mandl K.D., Follett R.W., Klann J.G., Murad D.A., Scudeller L., Bucalo M., Kirchoff K., Craig J., Obeid J., Jouhet V., Griffier R., Cossin S., Moal B., Patel L.P., Bellasi A., Prokosch H.U., Kraska D., Sliz P., Tan A.L., Ngiam K.Y., Zambelli A., Mowery D.L., Schiver E., Devkota B., Bradford R.L., Daniar M., APHP/Universities/INSERM COVID-19 research collaboration, Daniel C., Benoit V., Bey R., Paris N., Jannot A.S., Serre P., Orlova N., Dubiel J., Hilka M., Jannot A.S., Breant S., Leblanc J., Griffon N., Burgun A., Bernaux M., Sandrin A., Salamanca E., Ganslandt T., Gradinger T., Champ J., Boeker M., Martel P., Gramfort A., Grisel O., Leprovost D., Moreau T., Varoquaux G., Vie J.-J., Wassermann D., Mensch A., Caucheteux C., Haverkamp C., Lemaitre G., Krantz I.D., Cormont S., South A., The Consortium for Clinical Characterization of COVID-19 by EHR (4CE), Cai T., Kohane I.S., International Electronic Health Record-Derived COVID-19 Clinical Course Profiles: The 4CE Consortium, Infectious Diseases (except HIV/AIDS), 2020. 10.1101/2020.04.13.20059691. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Wagner T., Shweta F., Murugadoss K., Awasthi S., Venkatakrishnan A., Bade S., Puranik A., Kang M., Pickering B.W., O’Horo J.C., Bauer P.R., Razonable R.R., Vergidis P., Temesgen Z., Rizza S., Mahmood M., Wilson W.R., Challener D., Anand P., Liebers M., Doctor Z., Silvert E., Solomon H., Anand A., Barve R., Gores G., Williams A.W., Morice II W.G., Halamka J., Badley A., Soundararajan V., Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis, ELife. 9 (2020) e58227. 10.7554/eLife.58227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Denny J.C., Spickard A., Miller R.A., Schildcrout J., Darbar D., Rosenbloom S.T., Peterson J.F., Identifying UMLS concepts from ECG Impressions using KnowledgeMap, AMIA Annu Symp Proc. (2005) 196–200. [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Denny J.C., Irani P.R., Wehbe F.H., Smithers J.D., Spickard A., The KnowledgeMap Project: Development of a Concept-Based Medical School Curriculum Database, AMIA Annu Symp Proc. 2003 (2003) 195–199. [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Denny J.C., Peterson J.F., Choma N.N., Xu H., Miller R.A., Bastarache L., Peterson N.B., Extracting timing and status descriptors for colonoscopy testing from electronic medical records, J Am Med Inform Assoc. 17 (2010) 383–388. 10.1136/jamia.2010.004804. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits, PNAS. 106 (2009) 9362–9367. 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C., PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations, Bioinformatics. 26 (2010) 1205–1210. 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Firth’s logistic regression with rare events: accurate effect estimates and predictions? - Puhr - 2017 - Statistics in Medicine - Wiley Online Library, (n.d.). https://onlinelibrary.wiley.com/doi/full/10.1002/sim.7273 (accessed June 7, 2020). [DOI] [PubMed]

[R19] [19].COVID-19 Patients’ Clinical Characteristics, Discharge Rate, and Fatality Rate of Meta-Analysis - PubMed, (n.d.). https://pubmed.ncbi.nlm.nih.gov/32162702/ (accessed June 30, 2020). [DOI] [PMC free article] [PubMed]

[R20] [20].Vaira L.A., Salzano G., Deiana G., De Riu G., Anosmia and Ageusia: Common Findings in COVID-19 Patients, Laryngoscope. 130 (2020) 1787. 10.1002/lary.28692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Moein S.T., Hashemian S.M., Mansourafshar B., Khorram Tousi A., Tabarsi P., Doty R.L., Smell dysfunction: a biomarker for COVID-19, International Forum of Allergy & Rhinology. n/a (n.d.). 10.1002/alr.22587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Pfefferbaum B., North C.S., Mental Health and the Covid-19 Pandemic, New England Journal of Medicine. 383 (2020) 510–512. 10.1056/NEJMp2008017. [DOI] [PubMed] [Google Scholar]

[R23] [23].Sturges W., Gov. Bill Lee issues stay-at-home order through April 14, Impact. (2020). https://communityimpact.com/nashville/franklin-brentwood/coronavirus/2020/03/30/gov-bill-lee-issues-statewide-stay-at-home-order-for-tennesseans/ (accessed October 7, 2020). [Google Scholar]

[R24] [24].Emami A., Javanmardi F., Pirbonyeh N., Akbari A., Prevalence of Underlying Diseases in Hospitalized Patients with COVID-19: a Systematic Review and Meta-Analysis, Arch Acad Emerg Med. 8 (2020). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7096724/ (accessed July 31, 2020). [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Farsalinos K., Niaura R., Le Houezec J., Barbouni A., Tsatsakis A., Kouretas D., Vantarakis A., Poulas K., Editorial: Nicotine and SARS-CoV-2: COVID-19 may be a disease of the nicotinic cholinergic system, Toxicol Rep. 7 (2020) 658–663. 10.1016/j.toxrep.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Soysal E., Wang J., Jiang M., Wu Y., Pakhomov S., Liu H., Xu H., CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines, J Am Med Inform Assoc. 25 (2018) 331–336. 10.1093/jamia/ocx132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Yetisgen-Yildiz M., Gunn M.L., Xia F., Payne T.H., Automatic identification of critical follow-up recommendation sentences in radiology reports, AMIA Annu Symp Proc. 2011 (2011) 1593–1602. [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Mf D., S S., Ws B., Jc D., Jl H., Automated extraction of clinical traits of multiple sclerosis in electronic medical records, Journal of the American Medical Informatics Association : JAMIA. 20 (2013). 10.1136/amiajnl-2013-001999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Ww Y., M Y, Wp H., Sw K., Natural Language Processing in Oncology: A Review, JAMA Oncology. 2 (2016). 10.1001/jamaoncol.2016.0213. [DOI] [PubMed] [Google Scholar]

[R30] [30].Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing, (n.d.). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112774 (accessed August 18, 2020). [DOI] [PMC free article] [PubMed]

PERMALINK

This is a preprint.

ConceptWAS: a high-throughput method for early identification of COVID-19 presenting symptoms

Juan Zhao

Monika E Grabowska

Vern Eric Kerchberger

Joshua C Smith

H Nur Eken

QiPing Feng

Josh F Peterson

S Trent Rosenbloom

Kevin B Johnson

Wei-Qi Wei

Abstract

Objective:

Methods:

Results:

Conclusion:

Introduction

Methods

Study setting

Cohort definition

Data collection

Concept extraction

Assertion and negation detection

ConceptWAS analysis

Chart review

Results

Table 1.

Comparison of EHR-derived concepts between COVID-19 positive and negative patients

Figure 1.

Figure 2.

Temporal analysis

Figure 3.

Chart review

Table 2.

Discussion

Conclusion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases