ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection

K Hayawi; S Shahriar; MA Serhani; I Taleb; SS Mathew

doi:10.1016/j.puhe.2021.11.022

. 2021 Dec 7;203:23–30. doi: 10.1016/j.puhe.2021.11.022

ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection

K Hayawi ^a,^∗, S Shahriar ^a, MA Serhani ^b, I Taleb ^a, SS Mathew ^a

PMCID: PMC8648668 PMID: 35016072

Abstract

Objectives

COVID-19 (SARS-CoV-2) pandemic has infected hundreds of millions and inflicted millions of deaths around the globe. Fortunately, the introduction of COVID-19 vaccines provided a glimmer of hope and a pathway to recovery. However, owing to misinformation being spread on social media and other platforms, there has been a rise in vaccine hesitancy which can lead to a negative impact on vaccine uptake in the population. The goal of this research is to introduce a novel machine learning–based COVID-19 vaccine misinformation detection framework.

Study design

We collected and annotated COVID-19 vaccine tweets and trained machine learning algorithms to classify vaccine misinformation.

Methods

More than 15,000 tweets were annotated as misinformation or general vaccine tweets using reliable sources and validated by medical experts. The classification models explored were XGBoost, LSTM, and BERT transformer model.

Results

The best classification performance was obtained using BERT, resulting in 0.98 F1-score on the test set. The precision and recall scores were 0.97 and 0.98, respectively.

Conclusion

Machine learning–based models are effective in detecting misinformation regarding COVID-19 vaccines on social media platforms.

Keywords: COVID-19, Vaccines, Text classification, Misinformation detection, Deep learning, Natural language processing

Graphical abstract

Introduction

As of July 26, 2021, more than 194 million infections and more than 4 million deaths are attributed to the SARS-CoV-2, commonly referred to as the COVID-19 pandemic.¹ Since the outbreak emerged in Wuhan, Hubei province in China and spread worldwide, lockdown measures and social distancing methods were introduced in most parts of the globe. The impacts were significant on various sectors including the economy,² education,³ and the mental health of the population.⁴ The emergence of various safe and effective vaccines⁵ provided a potential solution by increasing population immunity and rising as an effective method to control the outbreak. Most vaccines authorization and distribution began during December 2020.⁶

Despite the vaccine introduction, increasing hesitancy on vaccine uptake can be observed among significant parts of the population in various countries.⁷ The vaccine hesitancy can be explained in part by the spread of misinformation regarding vaccines that are spread in person.⁸ However, with wide social media access and usage, the spread of vaccine misinformation can be significantly increased, potentially leading to a further decline in vaccine uptake. Misinformation can be spread on social media by human users as well as social bots.⁹ ^, ¹⁰ Social bots are programmed to automatically spread false information in disguise. Therefore, it is essential for algorithms to automatically detect the content of the misinformation regardless of the source being a human or a social bot. More specifically, the focus of this research is on Twitter and detecting misinformation in tweets related to vaccines. To the best of our knowledge, there are no existing datasets for detecting vaccine misinformation tweets and this is the first proposed approach on detecting COVID-19 vaccine misinformation.

Machine learning–based algorithms have been widely and effectively utilized for various COVID-19–related applications including screening, contact tracing, and forecasting.¹¹ CoAID dataset introduced by Cui and Lee¹² contains misinformation related to COVID-19. The authors utilized several machine learning models to classify fake news with the best performance of 0.58 F1-score being obtained using a hierarchical attention network–based model. A COVID-19 vaccine misinformation tweets dataset was introduced by Memon and Carley.¹³ This dataset characterizes both users who are actively posting misinformation and those who are calling out misinformation or spreading true information. It was concluded that informed users tend to use more narratives in their tweets than misinformed ones. The ReCOVery dataset proposed by Zhou et al.¹⁴ contains more than 2000 news articles and their credibility. Furthermore, it also includes more than 140,000 tweets that reveal the way these news articles are spread on Twitter. A F1-score of 0.83 was obtained for predicting reliable news and 0.67 was obtained for predicting unreliable news using a neural network model. A billion-scale COVID-19 Twitter dataset covering 268 countries with more than 100 languages was collected by Abdul-Mageed et al.¹⁵ Two predictive models were proposed for classifying whether a tweet was related to the pandemic (COVID relevance) and detecting whether a tweet was COVID-19 misinformation. The misinformation detection models were trained using the aforementioned CoAID and ReCOVery datasets, and combining them resulted in the best F1-score of 0.92 using a bidirectional encoder representations from transformers (BERT)-based model. Abdelminaam et al.¹⁶ combined four existing datasets including CoAID and used several machine learning algorithms to classify COVID-19 misinformation. The best F1-score of 0.985 was obtained using a two-layer long short-term memory (LSTM) network. The ArCOV19-Rumors dataset was presented by Haouari et al.¹⁷ to detect COVID-19 misinformation in Arabic tweets. Two Arabic BERT-based models were used for classification, obtaining a highest F1-score of 0.74. A bilingual Arabic and English dataset for detecting COVID-19 misleading tweets was presented in the study by Elhadad et al.¹⁸ Several machine learning models were used to annotate the unlabeled tweets. However, the authors did not quantify the evaluation of the predictive models. Finally, a Chinese microblogging dataset for detecting COVID-19 fake news was presented by Yang et al.¹⁹ Various deep learning models were explored, and the best F1-score of 0.94 was obtained using the TextCNN model.

More recently, several research works have focused on analyzing tweets related to COVID-19 vaccines. Muric et al.²⁰ presented a dataset containing tweets that indicate a strong anti-vaccine stance. Descriptive analysis of the tweets as well as geographical distribution of the tweets across the United States (US) were presented. Similarly, Sharma et al.²¹ utilized tweets to investigate any hidden coordinated efforts promoting misinformation about vaccines and obtain insights into conspiracy communities. A dataset called Covaxxy²² containing one week of vaccine tweets was introduced to perform a statistical analysis of COVID-19 vaccine tweets. Moreover, the authors also introduced a dashboard for visualizing the relationship between vaccine adoption and US geolocated posts. Malagoli et al.²³ focused on vaccine sentiment on Twitter by analyzing vaccine-related tweets collected between December 2020 and January 2021. The analysis included the usage of emojis as well as the psycholinguistic properties of these tweets. Finally, Hu et al.²⁴ examined the public sentiment of COVID-19 vaccines in the US by investigating the spatiotemporal patterns of public perception and emotion at national and state levels. No predictive models were introduced by the existing works in the context of the COVID-19 vaccine, and therefore, the proposed work to the best of our knowledge is the first to perform vaccine misinformation detection. Table 1 summarizes the existing works in COVID-19 misinformation detection and COVID-19 vaccine–related tweet datasets.

Table 1.

Existing works in COVID-19 misinformation and COVID-19 vaccine tweets.

Source	Application	Dataset	Available online	Prediction results
12	Misinformation dataset, analysis, and classification	Social media and website misinformation regarding COVID-19	✓	F1-score: 0.58 using hierarchical attention network–based model
13	Misinformation dataset and analysis	Annotated COVID-19 misinformation tweets	✓	N/A
14	Reliable and unreliable news dataset, analysis, and prediction	News articles and their credibility level as well as tweets related to their spread	✓	F1-scores: 0.83 and 0.67 for reliable and unreliable news detection, respectively, using neural networks
15	Large COVID-19 tweets dataset, analysis, and classification	Tweets related to COVID-19 in more than 100 languages from 268 countries	✓	F1-score: 0.98 for COVID-relevant tweets using the transformer-based masked language model F1-score: 0.92 for detecting misinformation tweets using BERT-based model
16	COVID-19 misinformation detection	Combination of various existing tweets datasets related to COVID-19, disasters, news, and gossip	✕	F1-score: 0.985 using LSTM
17	COVID-19 misinformation detection in Arabic	Arabic tweets related to COVID-19	✓	F1-score: 0.74 using MARABERT
18	COVID-19 misinformation detection in English and Arabic	English and Arabic tweets related to COVID-19	✓	Not presented
19	COVID-19 fake news detection in Chinese	Chinese microblog posts from Weibo	✓	F1-score: 0.94 using TextCNN
20	COVID-19 anti-vaccine tweets dataset and analysis	Tweets exhibiting anti-vaccine stance collected using keywords	✓	N/A
21	COVID-19 anti-vaccine tweets analysis	COVID-19 vaccine tweets collected using keywords	✕	N/A
22	COVID-19 vaccine tweets analysis	COVID-19 vaccine tweets collected using keywords	✓	N/A
23	COVID-19 vaccine tweets sentiment analysis	COVID-19 vaccine tweets collected using keywords	✓	N/A
24	COVID-19 vaccine tweets sentiment analysis	COVID-19 vaccine tweets collected using keywords	✕	N/A

Open in a new tab

Methods

This section describes the methodology of the proposed application. The details of the implementation are presented next chronologically.

Dataset collection

Twitter is one of the most popular social media platforms with 353 million active users, and more than 500 million tweets are being posted every day.²⁵ Twitter API allows the extraction of public tweets including the tweet text, user information, retweets, and mentions in JSON format. A Python library called Twarc was utilized to access the Twitter API.

To obtain the relevant tweets about COVID-19 vaccines, we followed the approach in some of the existing works in the literature and collected the tweets using keywords. The following keywords (case insensitive) were used: ‘vaccine,’ ‘pfizer,’ ‘moderna,’ ‘astrazeneca,’ ‘sputnik,’ and ‘sinopharm.’ Additionally, we only considered tweets in the English language. Replies to tweets, retweets, and quote tweets were not considered. Overall, the vaccine-related tweets from December 1, 2020, until July 31, 2021, were collected. In total, 15,465,687 tweets were collected.

Fig. 1 illustrates the total number of tweets per month from December 2020 until July 2021. As vaccines started gaining approval for administration during December 2020, we notice a high volume of tweets with people sharing their initial sentiments regarding the vaccine. In the next couple of months, there is a natural decline as the topic becomes outdated. However, the volume of tweets goes up again from March 2021 and reaches a peak during April 2021. During this time, the rate of vaccination was going up particularly in the UK and the US where a large percentage of Twitter users are from. This led to many expressing their feelings after receiving their vaccines.

Data annotation

In supervised learning, a labeled dataset is required before model training. Because no existing labeled dataset is available for vaccine misinformation, manual annotation of tweets was performed. Unlike the single verification approach by many existing works, we used an additional validation step by medical experts. To label the misinformation, some common myths regarding the COVID-19 vaccines were obtained from reliable sources including Public Health,²⁶ Healthline,²⁷ the Centers for Disease Control and Prevention (CDC),²⁸ and the University of Missouri Health Care.²⁹ This approach is similar to several of the existing works in misinformation detection including the studies by Cui and Lee¹² and Elhadad et al.¹⁸ Some of the common myths and misinformation include ‘The vaccine can alter DNA,’ ‘The vaccine can cause infertility,’ ‘The vaccine contains dangerous toxins,’ and ‘The vaccine contains tracking device.’ In this process, tweets containing this common misinformation were manually read and labeled/flagged. This ensured the context of the tweets was considered and tweets that were sarcastic and humorous were not included as misinformation. Tweets other than these common myths were considered not misinformation and included general opinions regarding the vaccine, official news, and appointment details of vaccination centers. Finally, once the dataset was accurately annotated using verified sources, we invited medical experts in public health to validate the annotation process. This approach helped in ensuring the manual annotation of data was accurate and the quality of the dataset was of high standard.

Consequently, a total of 15,073 tweets were labeled, 5751 of which were misinformation and 9322 were general vaccine-related tweets. Word clouds are a simple but effective tool for text visualization. They are created by collecting words in a corpus and presenting them in different sizes. The larger and bolder a word appears, the more frequent and relevant is its presence in the corpus. Fig. 2, Fig. 3 illustrate the world cloud for misinformation and general tweets, respectively. The vaccine misinformation tweets include several conspiracy terms such as ‘gene therapy,’ ‘untested vaccine,’ and ‘depopulation.’ Meanwhile, the general vaccine tweets include terms related to people sharing their vaccine experience including ‘first dose’ and ‘grateful.’

Fig. 2 — Word cloud visualization for vaccine misinformation tweets.

Fig. 3 — Word cloud visualization for general vaccine-related tweets.

Data preprocessing

Preprocessing the contents of the tweets is significant for efficient model training. First, external links, punctuations, and text in brackets were removed. All text contents were also converted to lower case. Common words such as ‘the,’ ‘and,’ ‘in,’ and ‘for,’ are referred to as stop words. Removing these low-information words that provide little contextual information can reduce the complexity of training. To perform this step, NLTK ³⁰ library in Python was utilized. Stemming is a common preprocessing step that reduces derivationally linked forms of a word to a common base form. For example, both ‘walking’ and ‘walked’ will be converted to the stem ‘walk.’ In this step, snowball³¹ stemmer from the NLTK library was used.

Models architecture and implementation

Machine learning enables computer systems to learn from experience using data, without requiring explicit programming. Feature extraction is required to identify relevant features in the dataset before training the models. However, this process is labor-intensive and predictive performance depends to a large extent on the quality of feature engineering. Deep learning³² models on the other hand can automatically learn the necessary and useful input features and optimize them. Nevertheless, the computational complexity is higher, and consequently, a much longer training time is required. To provide a more comparative evaluation, three models were explored belonging to different categories of machine learning models. From the traditional machine learning, XGBoost was utilized; from the deep learning models, LSTM was utilized; and from the transformer models, BERT was utilized. A description of the models and their implementation are presented next.

XGBoost³³ is considered one of the most competitive and frequently used traditional machine learning models. It is a type of ensemble learning model that uses multiple decision trees which reduces overfitting and maintains complexity at the same time. Term frequency-inverse document frequency (tf-idf) was used to identify the most relevant features. Tf-idf computes values for each word in the corpus by the inverse proportion of the frequency of the word in a specific document to the percentage of documents the word appears in the study by Ramos et al.³⁴ XGBoost library in Python was used for this implementation.

LSTM³⁵ is a popular deep learning architecture for text and sequential data. These networks are composed of cyclic connections as well as specialized memory cells for storing the temporal state of the network.³⁶ Glove,³⁷ a popular unsupervised approach for obtaining vector representations of words, was used with the LSTM network. The obtained word embeddings using Glove represent the semantic similarity between words in a corpus by transforming the words into an n-dimensional space. After the embedding layer, a Bidirectional LSTM layer with 45 units was used followed by a GlobalMaxPool1D. Next, two dense layers of 128 and 32 units, respectively, with ReLU activation³⁸ were used. A dropout layer³⁹ with 0.5 rate was used after all the previous three layers. Finally, the classification layer consisted of a sigmoid activation, and the model was optimized using Adam optimizer⁴⁰ on binary cross-entropy loss. The implementation was done in Python using Keras.

The last approach utilized the transformer-based BERT model. The unconventional training approach used in BERT by looking at a text sequence from both directions provides a comprehensive sense of language context. BERT is pretrained on a large corpus of English texts from Wikipedia and BookCorpus. In this work, the bert-large-uncased version was used. It consists of 24 layers (1024 hidden dimensions), 16 attention heads, and a total of 340M parameters.⁴¹ Transformers ⁴² library in Python was used to implement this approach.

Overfitting is considered a major obstacle in training machine learning algorithms. When a specific model performs outstandingly well during the training phase, by using unnecessary input features, but fails to make generalized predictions on the test set, it is ‘overfitting’ to the training dataset. To avoid the overfitting problem for the two deep learning models, dropout technique was used. Also, training and validation accuracy curves were monitored to ensure no overfitting occurred during training.

The research framework for COVID-19 vaccine misinformation classification is summarized in Fig. 4 . The COVID-19 vaccine–related tweets were first collected and then annotated for misinformation or regular tweets using reliable sources. After necessary preprocessing and feature extraction, machine learning and deep learning models were trained to classify vaccine misinformation. Finally, the performance of the models was evaluated on the test set.

Classification algorithms can be evaluated using several metrics including accuracy, precision, recall, and F1-score, as defined in the following equations.(1), (2), (3), (4) ⁴³

Equation 1.

(1)

Equation 2.

(2)

Equation 3.

(3)

Equation 4.

(4)

Results

The results from the XGBoost model as well as the two deep learning models are presented next. All models were first trained and validated on 75% of the dataset and then evaluated on the remaining 25% of the dataset.

Performance comparison

The training time for XGBoost as expected was much quicker than the other two deep learning models. The training accuracy obtained was 96.9%, and the accuracy on the test set was 95.6%. The precision, recall, and F1-score on the test were 0.96, 0.95, and 0.95, respectively. Fig. 5 presents the confusion matrix on the test set using XGBoost. The majority of the error (84%) resulted from misinformation being classified as otherwise, whereas very few of the non-misinformation tweets were wrongly classified.

Fig. 5 — Confusion matrix on the test set using XGBoost.

The LSTM model was trained for six iterations with 20% of the data from the training set used for validation. Fig. 6 displays the training and validation accuracy curves. Because both the curves are very close to each other, there is no indication of overfitting.

The maximum training accuracy using LSTM was 99%, and the accuracy on the test set was 96%. The precision, recall, and F1-score on the test were 0.97, 0.96, and 0.96, respectively. Overall, there was a slight improvement compared with XGBoost. The confusion matrix on the test set using this approach is presented in Fig. 7 . Compared with XGBoost, there was a decrease in misinformation being misclassified (68%). However, more non-misinformation tweets were being classified as misinformation.

Fig. 7 — Confusion matrix on the test set using LSTM.

Finally, we used the pretrained BERT transformer model for classification. It was trained for three iterations with a 20% validation set taken from a subset of the training set. The training and validation accuracy curve is plotted in Fig. 8 . No overfitting is apparent in this approach as well.

Fig. 8 — Training and validation accuracies using BERT.

The maximum training accuracy using BERT was 99%, and the accuracy on the test set was 98%. The precision, recall, and F1-score on the test were 0.97, 0.98, and 0.98, respectively. The performance using BERT was superior compared with the previous two models. Fig. 9 displays the confusion matrix on the test set using BERT. Compared with the previous two models, BERT provides the lowest error rate (43%) on misclassifying the misinformation tweets, but it has a higher error rate in misclassifying the non-misinformation tweets.

Fig. 9 — Confusion matrix on the test set using BERT.

Discussion

In the previous section, the effectiveness of all the models in vaccine misinformation detection was discussed. Consistent with the literature, superior performance was obtained using the deep learning models compared with XGBoost for a relatively larger training set. BERT is recommended for this application because it was able to predict most of the misinformation.

Table 2 presents a performance comparison between the existing works in COVID-19 misinformation classification and the proposed work. The results reported in this study are consistent with those reported in the previous literature. The focus of this work was specifically on classifying vaccine-related misinformation, unlike the existing works which focused on general COVID-19 misinformation. However, by making the dataset used in the proposed work publicly available, we encourage the research community to experiment with other models and approaches.

Table 2.

Performance comparison with related COVID-19 misinformation detection works.

Source	Classification	F1-score
12	COVID-19 misinformation	0.58
14	COVID-19 news reliability detection	0.83 and 0.67
15	COVID-19 misinformation	0.92
16	COVID-19 misinformation	0.985
17	Arabic COVID-19 misinformation	0.74
19	Chinese COVID-19 misinformation	0.94
ANTi-Vax (Ours)	COVID-19 vaccine misinformation	0.98

Open in a new tab

There are several implications of the proposed application that are not limited to the following: 1) the dataset and models presented in this work can be used by social media sites effectively to limit the spread of misinformation, 2) it would also facilitate the detection of social bots spreading vaccine misinformation, 3) the dataset can also be thoroughly analyzed to identify patterns of misinformation and their spread over the time, and 4) this study will raise awareness regarding the misinformation about vaccines in social media and also will trigger further research in this area. A limitation of this study is that statistical analysis was not presented. As the focus of this study was on detecting misinformation, an in-depth analysis of the vaccine misinformation tweets was not performed.

As future work, it would be interesting to experiment the combination of general COVID-19 misinformation with vaccine misinformation. Moreover, a further performance enhancement is possible by using extracted tweet–level features including the number of capital letters, links, and emojis. Similarly, account-level features such as follower count, tweet count, retweets can potentially provide useful information to the models. Furthermore, sentiment analysis in the English language on COVID-19 vaccines can be performed using the large COVID-19 vaccine dataset. This would reveal the public perception of vaccines and how they evolved over the months. Also, the focus of this study was on English tweets, but researchers are encouraged to extend this study to multilingual tweets related to COVID-19 vaccines. The use of hashtags can provide insights into the general behavior of social media users,⁴⁴ and this could be utilized for future research. Finally, it is also worth investigating vaccine-related misinformation on social media platforms other than Twitter as well as blog posts.

Author statements

Acknowledgements

The authors would also like to thank the medical experts for volunteering their time and effort in validating the annotated dataset.

Ethical approval

None sought.

Funding

This work was supported by Zayed University under the research grant RIF R20132.

Competing interests

None declared.

Availability of data and materials

Both the COVID-19 vaccine tweets dataset and the annotated misinformation dataset are available publicly for researchers. Complying with Twitter's Terms of service, the dataset has been anonymized and contains only the tweet IDs. The dataset can be accessed using the following link: https://github.com/SakibShahriar95/ANTiVax.

References

1.Dong E., Du H., Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.McKibbin W., Fernando R. The global macroeconomic impacts of COVID-19: seven scenarios. Asian Econ Pap. 2021 May 15;20(2):1–30. [Google Scholar]
3.Aristovnik A., Keržič D., Ravšelj D., Tomaževič N., Umek L. Impacts of the COVID-19 pandemic on life of higher education students: a global perspective. Sustainability. 2020;12(20) doi: 10.1016/j.dib.2021.107659. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Usher K., Durkin J., Bhullar N. The COVID-19 pandemic and mental health impacts. Int J Ment Health Nurs. 2020 Jun;29(3):315–318. doi: 10.1111/inm.12726. 2020/04/10. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Polack F.P., Thomas S.J., Kitchin N., Absalon J., Gurtman A., Lockhart S., et al. Safety and efficacy of the BNT162b2 mRNA covid-19 vaccine. N Engl J Med. 2020 Dec 31;383(27):2603–2615. doi: 10.1056/NEJMoa2034577. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Roberts M. BBC News; 2020 Dec 2. Covid-19: pfizer/BioNTech vaccine judged safe for use in UK.https://www.bbc.com/news/health-55145696 [Internet] [cited 2021 Jul 26]; Available from: [Google Scholar]
7.Sallam M. COVID-19 vaccine hesitancy worldwide: a concise systematic review of vaccine acceptance rates. Vaccines. 2021;9(2) doi: 10.3390/vaccines9020160. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lockyer B., Islam S., Rahman A., Dickerson J., Pickett K., Sheldon T., et al. Health Expect; UK: 2021 May 4. Understanding COVID-19 misinformation and vaccine hesitancy in context: findings from a qualitative study involving citizens in Bradford. [Internet] [cited 2021 Jul 25];n/a(n/a). Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Wang P., Angarita R., Renna I. Companion proceedings of the the web conference 2018 [Internet]. Republic and canton of Geneva, CHE. International World Wide Web Conferences Steering Committee; 2018. Is this the era of misinformation yet: combining social bots and fake news to deceive the masses; pp. 1557–1561. (WWW ’18). Available from: [DOI] [Google Scholar]
10.Trabelsi Z., Hayawi K., Al Braiki A., Mathew S.S. CRC Press; 2012. Network attacks and defenses: a hands-on approach. [Google Scholar]
11.Lalmuanawma S., Hussain J., Chhakchhuak L. Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: a review. Chaos, Solit Fractals. 2020;139:110059. doi: 10.1016/j.chaos.2020.110059. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Cui L., Lee D. ArXiv Prepr; 2020. Coaid: covid-19 healthcare misinformation dataset. ArXiv200600885. [Google Scholar]
13.Memon S.A., Carley K.M. ArXiv Prepr; 2020. Characterizing covid-19 misinformation communities using a novel twitter dataset. ArXiv200800791. [Google Scholar]
14.Zhou X., Mulay A., Ferrara E., Zafarani R. Proceedings of the 29th ACM international conference on information & knowledge management [Internet] Association for Computing Machinery; New York, NY, USA: 2020. ReCOVery: a multimodal repository for COVID-19 news credibility research; p. 3205. 12. Available from: [DOI] [Google Scholar]
15.Abdul-Mageed M., Elmadany A., Nagoudi E.M.B., Pabbi D., Verma K., Lin R. ArXiv Prepr; 2020. Mega-cov: a billion-scale dataset of 100+ languages for covid-19. ArXiv200506012. [Google Scholar]
16.Abdelminaam D.S., Ismail F.H., Taha M., Taha A., Houssein E.H., Nabil A. CoAID-DEEP: an optimized intelligent framework for automated detecting COVID-19 misleading information on Twitter. IEEE Access. 2021;9:27840–27867. doi: 10.1109/ACCESS.2021.3058066. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Haouari F., Hasanain M., Suwaileh R., Elsayed T. ArXiv Prepr; 2020. ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. ArXiv201008768. [Google Scholar]
18.Elhadad M.K., Li K.F., Gebali F. In: Advances in intelligent networking and collaborative systems. Barolli L., Li K.F., Miwa H., editors. Springer International Publishing; Cham: 2021. COVID-19-FAKES: a Twitter (Arabic/English) dataset for detecting misleading information on COVID-19; pp. 256–268. [Google Scholar]
19.Yang C., Zhou X., Zafarani R. CHECKED: Chinese COVID-19 fake news dataset. Soc Netw Anal Min. 2021 Jun 22;11(1):58. doi: 10.1007/s13278-021-00766-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Muric G., Wu Y., Ferrara E. ArXiv Prepr ArXiv210505134; 2021. COVID-19 vaccine hesitancy on social media: building a public twitter dataset of anti-vaccine content, vaccine misinformation and conspiracies. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Sharma K., Zhang Y., Liu Y. ArXiv Prepr; 2021. COVID-19 vaccines: characterizing misinformation campaigns and vaccine hesitancy on twitter. ArXiv210608423. [Google Scholar]
22.DeVerna M.R., Pierri F., Truong B.T., Bollenbacher J., Axelrod D., Loynes N., et al. ArXiv Prepr; 2021. CoVaxxy: a collection of English-language Twitter posts about COVID-19 vaccines. ArXiv210107694. [Google Scholar]
23.Malagoli L.G., Stancioli J., Ferreira C.H.G., Vasconcelos M., Couto da Silva A.P., Almeida J.M. 13th ACM web science conference 2021 [Internet] Association for Computing Machinery; New York, NY, USA: 2021. A look into COVID-19 vaccination debate on Twitter; pp. 225–233. (WebSci ’21). Available from: [DOI] [Google Scholar]
24.Hu T., Wang S., Luo W., Zhang M., Huang X., Yan Y., et al. Revealing public opinion towards COVID-19 vaccines with Twitter Data in the United States: a spatiotemporal perspective. J Med Internet Res. 2021;23(9):e30854. doi: 10.2196/30854. https://www.jmir.org/2021/9/e30854. https://doi.rog/10.2196/30854. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Devgan S. Statusbrew blog; 2021. 100 social media statistics for 2021 [+Infographic] | statusbrew.https://statusbrew.com/insights/social-media-statistics/ [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]
26.Vaccine myths debunked. PublicHealth.org; 2021. https://www.publichealth.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/ [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]
27.Cassata C. Healthline; 2021. Doctors debunk popular COVID-19 vaccine myths and conspiracy theories.https://www.healthline.com/health-news/doctors-debunk-9-popular-covid-19-vaccine-myths-and-conspiracy-theories [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]
28.COVID-19 vaccine facts. Centers for Disease Control and Prevention; 2021. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/facts.html [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]
29.The COVID-19 vaccine: myths versus facts [Internet]. [cited 2021 Jul 28]. Available from: https://www.muhealth.org/our-stories/covid-19-vaccine-myths-vs-facts.
30.Loper E., Bird S. ArXiv Prepr; 2002. Nltk: the natural language toolkit. Cs0205028. [Google Scholar]
31.Porter M.F., Boulton R., Macfarlane A. vol. 18. 2002. p. 2011. (The English (porter2) stemming algorithm). [Google Scholar]
32.Goodfellow I., Bengio Y., Courville A. MIT press; 2016. Deep learning. [Google Scholar]
33.Chen T., Guestrin C. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining [Internet] Association for Computing Machinery; New York, NY, USA: 2016. XGBoost: a scalable tree boosting system; pp. 785–794. (KDD ’16). Available from: [DOI] [Google Scholar]
34.Ramos J., others . Proceedings of the first instructional conference on machine learning. Citeseer; 2003. Using tf-idf to determine word relevance in document queries; pp. 29–48. [Google Scholar]
35.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
36.Shahriar S., Tariq U. Classifying maqams of Qur’anic recitations using deep learning. IEEE Access. 2021;1–1 [Google Scholar]
37.Pennington J., Socher R., Manning C.D. Proceedings of the 2014 conference on empirical methods in natural language processing. EMNLP; 2014. Glove: global vectors for word representation; pp. 1532–1543. [Google Scholar]
38.Nair V., Hinton G.E. Icml. 2010. Rectified linear units improve restricted Boltzmann machines. [Google Scholar]
39.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–1958. [Google Scholar]
40.Kingma D.P., Ba J. ArXiv Prepr; 2014. Adam: a method for stochastic optimization. ArXiv14126980. [Google Scholar]
41.Devlin J., Chang M.-W., Lee K., Toutanova K. ArXiv Prepr; 2018. Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv181004805. [Google Scholar]
42.Wolf T., Chaumond J., Debut L., Sanh V., Delangue C., Moi A., et al. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 2020. Transformers: state-of-the-art natural language processing; pp. 38–45. [Google Scholar]
43.Shahriar S., Al-Ali A.R., Osman A.H., Dhou S., Nijim M. Machine learning approaches for EV charging behavior: a review. IEEE Access. 2020;8:168980–168993. [Google Scholar]
44.Alothali E., Hayawi K., Alashwal H. In: Web services – ICWS 2020. Ku W.-S., Kanemasa Y., Serhani M.A., Zhang L.-J., editors. Springer International Publishing; Cham: 2020. Characteristics of similar-context trending hashtags in Twitter: a case study; pp. 150–163. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[bib1] 1.Dong E., Du H., Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.McKibbin W., Fernando R. The global macroeconomic impacts of COVID-19: seven scenarios. Asian Econ Pap. 2021 May 15;20(2):1–30. [Google Scholar]

[bib3] 3.Aristovnik A., Keržič D., Ravšelj D., Tomaževič N., Umek L. Impacts of the COVID-19 pandemic on life of higher education students: a global perspective. Sustainability. 2020;12(20) doi: 10.1016/j.dib.2021.107659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Usher K., Durkin J., Bhullar N. The COVID-19 pandemic and mental health impacts. Int J Ment Health Nurs. 2020 Jun;29(3):315–318. doi: 10.1111/inm.12726. 2020/04/10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Polack F.P., Thomas S.J., Kitchin N., Absalon J., Gurtman A., Lockhart S., et al. Safety and efficacy of the BNT162b2 mRNA covid-19 vaccine. N Engl J Med. 2020 Dec 31;383(27):2603–2615. doi: 10.1056/NEJMoa2034577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Roberts M. BBC News; 2020 Dec 2. Covid-19: pfizer/BioNTech vaccine judged safe for use in UK.https://www.bbc.com/news/health-55145696 [Internet] [cited 2021 Jul 26]; Available from: [Google Scholar]

[bib7] 7.Sallam M. COVID-19 vaccine hesitancy worldwide: a concise systematic review of vaccine acceptance rates. Vaccines. 2021;9(2) doi: 10.3390/vaccines9020160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Lockyer B., Islam S., Rahman A., Dickerson J., Pickett K., Sheldon T., et al. Health Expect; UK: 2021 May 4. Understanding COVID-19 misinformation and vaccine hesitancy in context: findings from a qualitative study involving citizens in Bradford. [Internet] [cited 2021 Jul 25];n/a(n/a). Available from: [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Wang P., Angarita R., Renna I. Companion proceedings of the the web conference 2018 [Internet]. Republic and canton of Geneva, CHE. International World Wide Web Conferences Steering Committee; 2018. Is this the era of misinformation yet: combining social bots and fake news to deceive the masses; pp. 1557–1561. (WWW ’18). Available from: [DOI] [Google Scholar]

[bib10] 10.Trabelsi Z., Hayawi K., Al Braiki A., Mathew S.S. CRC Press; 2012. Network attacks and defenses: a hands-on approach. [Google Scholar]

[bib11] 11.Lalmuanawma S., Hussain J., Chhakchhuak L. Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: a review. Chaos, Solit Fractals. 2020;139:110059. doi: 10.1016/j.chaos.2020.110059. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Cui L., Lee D. ArXiv Prepr; 2020. Coaid: covid-19 healthcare misinformation dataset. ArXiv200600885. [Google Scholar]

[bib13] 13.Memon S.A., Carley K.M. ArXiv Prepr; 2020. Characterizing covid-19 misinformation communities using a novel twitter dataset. ArXiv200800791. [Google Scholar]

[bib14] 14.Zhou X., Mulay A., Ferrara E., Zafarani R. Proceedings of the 29th ACM international conference on information & knowledge management [Internet] Association for Computing Machinery; New York, NY, USA: 2020. ReCOVery: a multimodal repository for COVID-19 news credibility research; p. 3205. 12. Available from: [DOI] [Google Scholar]

[bib15] 15.Abdul-Mageed M., Elmadany A., Nagoudi E.M.B., Pabbi D., Verma K., Lin R. ArXiv Prepr; 2020. Mega-cov: a billion-scale dataset of 100+ languages for covid-19. ArXiv200506012. [Google Scholar]

[bib16] 16.Abdelminaam D.S., Ismail F.H., Taha M., Taha A., Houssein E.H., Nabil A. CoAID-DEEP: an optimized intelligent framework for automated detecting COVID-19 misleading information on Twitter. IEEE Access. 2021;9:27840–27867. doi: 10.1109/ACCESS.2021.3058066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Haouari F., Hasanain M., Suwaileh R., Elsayed T. ArXiv Prepr; 2020. ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. ArXiv201008768. [Google Scholar]

[bib18] 18.Elhadad M.K., Li K.F., Gebali F. In: Advances in intelligent networking and collaborative systems. Barolli L., Li K.F., Miwa H., editors. Springer International Publishing; Cham: 2021. COVID-19-FAKES: a Twitter (Arabic/English) dataset for detecting misleading information on COVID-19; pp. 256–268. [Google Scholar]

[bib19] 19.Yang C., Zhou X., Zafarani R. CHECKED: Chinese COVID-19 fake news dataset. Soc Netw Anal Min. 2021 Jun 22;11(1):58. doi: 10.1007/s13278-021-00766-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Muric G., Wu Y., Ferrara E. ArXiv Prepr ArXiv210505134; 2021. COVID-19 vaccine hesitancy on social media: building a public twitter dataset of anti-vaccine content, vaccine misinformation and conspiracies. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Sharma K., Zhang Y., Liu Y. ArXiv Prepr; 2021. COVID-19 vaccines: characterizing misinformation campaigns and vaccine hesitancy on twitter. ArXiv210608423. [Google Scholar]

[bib22] 22.DeVerna M.R., Pierri F., Truong B.T., Bollenbacher J., Axelrod D., Loynes N., et al. ArXiv Prepr; 2021. CoVaxxy: a collection of English-language Twitter posts about COVID-19 vaccines. ArXiv210107694. [Google Scholar]

[bib23] 23.Malagoli L.G., Stancioli J., Ferreira C.H.G., Vasconcelos M., Couto da Silva A.P., Almeida J.M. 13th ACM web science conference 2021 [Internet] Association for Computing Machinery; New York, NY, USA: 2021. A look into COVID-19 vaccination debate on Twitter; pp. 225–233. (WebSci ’21). Available from: [DOI] [Google Scholar]

[bib24] 24.Hu T., Wang S., Luo W., Zhang M., Huang X., Yan Y., et al. Revealing public opinion towards COVID-19 vaccines with Twitter Data in the United States: a spatiotemporal perspective. J Med Internet Res. 2021;23(9):e30854. doi: 10.2196/30854. https://www.jmir.org/2021/9/e30854. https://doi.rog/10.2196/30854. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Devgan S. Statusbrew blog; 2021. 100 social media statistics for 2021 [+Infographic] | statusbrew.https://statusbrew.com/insights/social-media-statistics/ [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]

[bib26] 26.Vaccine myths debunked. PublicHealth.org; 2021. https://www.publichealth.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/ [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]

[bib27] 27.Cassata C. Healthline; 2021. Doctors debunk popular COVID-19 vaccine myths and conspiracy theories.https://www.healthline.com/health-news/doctors-debunk-9-popular-covid-19-vaccine-myths-and-conspiracy-theories [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]

[bib28] 28.COVID-19 vaccine facts. Centers for Disease Control and Prevention; 2021. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/facts.html [Internet] [cited 2021 Jul 28]. Available from: [Google Scholar]

[bib29] 29.The COVID-19 vaccine: myths versus facts [Internet]. [cited 2021 Jul 28]. Available from: https://www.muhealth.org/our-stories/covid-19-vaccine-myths-vs-facts.

[bib30] 30.Loper E., Bird S. ArXiv Prepr; 2002. Nltk: the natural language toolkit. Cs0205028. [Google Scholar]

[bib31] 31.Porter M.F., Boulton R., Macfarlane A. vol. 18. 2002. p. 2011. (The English (porter2) stemming algorithm). [Google Scholar]

[bib32] 32.Goodfellow I., Bengio Y., Courville A. MIT press; 2016. Deep learning. [Google Scholar]

[bib33] 33.Chen T., Guestrin C. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining [Internet] Association for Computing Machinery; New York, NY, USA: 2016. XGBoost: a scalable tree boosting system; pp. 785–794. (KDD ’16). Available from: [DOI] [Google Scholar]

[bib34] 34.Ramos J., others . Proceedings of the first instructional conference on machine learning. Citeseer; 2003. Using tf-idf to determine word relevance in document queries; pp. 29–48. [Google Scholar]

[bib35] 35.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Shahriar S., Tariq U. Classifying maqams of Qur’anic recitations using deep learning. IEEE Access. 2021;1–1 [Google Scholar]

[bib37] 37.Pennington J., Socher R., Manning C.D. Proceedings of the 2014 conference on empirical methods in natural language processing. EMNLP; 2014. Glove: global vectors for word representation; pp. 1532–1543. [Google Scholar]

[bib38] 38.Nair V., Hinton G.E. Icml. 2010. Rectified linear units improve restricted Boltzmann machines. [Google Scholar]

[bib39] 39.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–1958. [Google Scholar]

[bib40] 40.Kingma D.P., Ba J. ArXiv Prepr; 2014. Adam: a method for stochastic optimization. ArXiv14126980. [Google Scholar]

[bib41] 41.Devlin J., Chang M.-W., Lee K., Toutanova K. ArXiv Prepr; 2018. Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv181004805. [Google Scholar]

[bib42] 42.Wolf T., Chaumond J., Debut L., Sanh V., Delangue C., Moi A., et al. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 2020. Transformers: state-of-the-art natural language processing; pp. 38–45. [Google Scholar]

[bib43] 43.Shahriar S., Al-Ali A.R., Osman A.H., Dhou S., Nijim M. Machine learning approaches for EV charging behavior: a review. IEEE Access. 2020;8:168980–168993. [Google Scholar]

[bib44] 44.Alothali E., Hayawi K., Alashwal H. In: Web services – ICWS 2020. Ku W.-S., Kanemasa Y., Serhani M.A., Zhang L.-J., editors. Springer International Publishing; Cham: 2020. Characteristics of similar-context trending hashtags in Twitter: a case study; pp. 150–163. [Google Scholar]

PERMALINK

ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection

K Hayawi

S Shahriar

MA Serhani

I Taleb

SS Mathew

Abstract

Objectives

Study design

Methods

Results

Conclusion

Graphical abstract

Introduction

Table 1.

Methods

Dataset collection

Fig. 1.

Data annotation

Fig. 2.

Fig. 3.

Data preprocessing

Models architecture and implementation

Fig. 4.

Results

Performance comparison

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Discussion

Table 2.

Author statements

Acknowledgements

Ethical approval

Funding

Competing interests

Availability of data and materials

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases