Skip to main content
. 2022 Nov 12;10(11):2270. doi: 10.3390/healthcare10112270

Table 1.

Comparison of studies that apply NLP to various applications related to COVID-19.

Study Reference Application Employed NLP Model Data Source
[30] EHRs
  • -

    BERT

  • -

    Specific phenotypes associated with COVID-19 using the list of 60 regular expressions (NLP RegExp)

  • -

    All signs, symptoms, and comorbidities were extracted with the quickUMLS algorithm [97] (NLP UMLS).

A multi-center study involving data from 39 hospitals
[98] EHRs Keyword-extraction NLP that uses an unsupervised ML approach (clustering) 450,114 patient CT comprehensive reports gathered from 1 January to October 2020
[99] EHRs Word frequency for text analytics and CNN trained using Word2vector as a classification model Data are collected through telehealth visits, including 6813 patients, of whom 498 tested positive and 6315 tested negative
[32] EHRs NLP model (medical named entity recognition) Audio or video recordings of clinic visits
[38] EHRs Multi-class logistic regression model trained n-gram features The study cohort includes 1737 COVID-19 adult patients discharged from two hospitals in Boston, Massachusetts, between 10 March and 30 June 2022
[39] EHRs NLP rule-based pipeline Data from VA Corporate Data Warehouse (CDW) include clinical data in 2020 between 1 January and 15 June
[33] EHRs Random-forest trained on N-grams 32,555 radiology reports from brain CTs and MRIs from a comprehensive stroke center
[34] EHRs NLP rule-based pipeline 6250 patients (5664 negative and 586 positives; 46,138 non-severe and 125 severe)
[36] EHRs BERT and Bi-LSTM with attention Annotated 1472 clinical notes distinguishing COVID-19 diagnoses, testing, and symptoms
[35] EHRs NLP rule-based pipeline NLP is validated on several datasets; the main one is related to COVID-19 and contains 50 posts (1162 sentences) of related dialogues
[44] Mental health Supervised text classification used stochastic gradient descent linear classifier with L1 penalty TF-IDF grams with principal component analysis with k-NN used for unsupervised clustering. LDA is used in topic modeling. Social media: Reddit Mental Health Dataset including posts from 826,961 unique users
[45] Mental health BERT (ft) Social media: 1000 English tweets for training the model and 1 million tweets included in the analysis
[46] Mental health Sentiment analytic systems called CrystalFeel Social media: Over 20 million COVID-19 tweets between 28 January and April 2020
[47] Mental health Key phrase extraction and sentiment score using lexicon-based technique Social media: 47 million COVID-19- related comments extracted from Twitter, Facebook, and YouTube
[100] Mental health Bi-directional LSTM and a self-attention layer Social media: The diagnosed group has approximately 900,000 tweets from several countries. The control group has approximately 14 million tweets from several countries
[48] Mental health Sentence-BERT (SBERT) 9090 English free-form texts from 1451 students between 1 February and 30 April 2020
[52] Health behaviors BERT 1.1 million COVID-19-related tweets from 181 counties in the US
[54] Health behaviors
  • -

    t-Distributed Stochastic Neighbor Embedding

  • -

    DistilBART

  • -

    VADER for sentiment analysis

  • -

    Google’s Universal Sentence Encoder

189,958,459 English COVID-19-related tweets COVID-19 between 17 March to 27 July 2020
[55] Health behaviors SVM, XGBoost, and LSTM 771,268 tweets from the US between January and October 2020
[56] Health behaviors LDA for topic modeling andaspect-based sentiment analysis English COVID-19 tweets are 25,595 for Canada and 293,929 for the US
[57] Health behaviors BERT 2,349,659 tweets related to COVID-19 vaccination 1 month after the first vaccine announcement
[52] Health behaviors BERT 1.1 million COVID-19-related tweets from 181 counties in the US
[71] Misinformation detection Uses SAFE systems developed in [53] 2029 news articles on COVID-19 (between January and May 2020) and 140,820 tweets that disclose how these news articles have circulated on Twitter
[76] Misinformation detection NLP and network analysis method 4573 annotated tweets comprising 3629 users
[73] Misinformation detection SVM 10,700 social media posts and articles of real and fake news on COVID-19
[101] Misinformation detection Sentence-BERT and BERTScore 4800 expert-annotated social media posts
[77] Misinformation detection BERT and ALBERT 5500 claims and explanation pairs
[90] COVID QA systems BERT and LDA COVID-19 scientific publications: CORD-19 dataset
[83] COVID QA systems T5 COVID-19 scientific publications: CORD-19 dataset
[102] COVID QA systems
  • -

    An ensemble of two QA models (HLTC-MRQA and BioBERT) for the QA model

  • -

    BART [88] for abstractive summarization

  • -

    ALBERT [89] in extractive summarization block

COVID-19 scientific publications: CORD-19 dataset
[91] COVID QA systems BioBERT COVID-19 scientific publications: CORD-19 dataset, with additional 111 QA pairs annotated for test
[92] COVID QA systems Synthetically generated QA examples to optimize the QA system performance on closed domains. The machine reading comprehension employs the Roberta model. COVID-19 scientific publications: CORD-19 dataset
[95] Knowledge transfer XLM-R Large Dataset, M-CID, containing 5271 utterances across English, Spanish, French, and Spanglish
[96] Knowledge transfer Multilingual Universal Sentence Encoder [103] 4,683,226 geo-referenced tweets in 60 languages located in Europe
[94] Knowledge transfer Variant transformers big architecture The model is trained on more than 350 million sentences in French, Spanish, German, Italian, and Korean (into English)