Opinion classification at subtopic level from COVID vaccination-related tweets

Mrinmoy Sadhukhan; Pramita Bhattacherjee; Tamal Mondal; Sudakshina Dasgupta; Indrajit Bhattacharya

doi:10.1007/s11334-022-00516-9

. 2022 Dec 9:1–12. Online ahead of print. doi: 10.1007/s11334-022-00516-9

Opinion classification at subtopic level from COVID vaccination-related tweets

Mrinmoy Sadhukhan ^1,^✉, Pramita Bhattacherjee ^2,^#, Tamal Mondal ^3,^#, Sudakshina Dasgupta ^2,^#, Indrajit Bhattacharya ^4,^#

PMCID: PMC9734573 PMID: 36531967

Abstract

Coronavirus disease 2019 (Covid-19) is a contiguous disease which affected a large volume of population with a high mortality rate across the globe. For dealing with the recent spread of COVID-19, one of the prime measures was to vaccinate people in full extent. People across the globe have diverse opinion regarding the vaccination process, its side effect and effectiveness. Such opinions get located into different micro-blogging sites including twitter. Opinion mining through analyzing public sentiments of such micro-blogs is a common method for detection of public responses. This paper focuses on classifying the public opinions expressed related to COVID-19 vaccination at sub topic level. The procedure tries to find out different keywords regarding positive, negative and neutral sentences. From those keywords, different related query set was constructed using Rocchio query expansion algorithm for positive, negative and neutral sentiments. Later Extended query set is used to form subtopic using LDA algorithm to identify the nature of the tweets. The proposed LDA model came across with 0.56 coherence score with twenty subtopics, which is fair enough to classify the tweets in different classes. This trained model is finally used to classify the tweets in real time with Apache Kafka framework regarding different subtopic based on positive, negative or neutral sentiment.

Keywords: LDA, Rocchio-expansion, TF-IDF, Tweepy, COVID-19, Indexing, Kafka

Introduction

Twitter is a micro-blogging platform that assists community to provide information and their opinions concisely regarding any topic. The said block of information or tweets conveys public opinion regarding various topics. From past two years due to the spread of COVID-19 , it has been observed that people have been remained stuck into their places and the only way to express their opinion was through different social media platforms [1, 2]. Twitter being a popular platform has seen a major rise in its usage and has provided a rich source of data for opinion mining [3]. As soon as the vaccines rolled out across the globe, we evidenced array of reactions among people. For getting a clear understanding about the thought process of general public in context to vaccination, such reactions should get mined in real time at sub topic level. In the past literature [4], various models and tools have been adopted [5] for analyzing sentiments from tweets. However, specifically the tweets have been classified in positive, negative and neutral scores. COVID-19 being a disease similar to influenza and much other flu like diseases has caused different kind of perception among people. Some thought of it as a mere flu and dismissed the importance of vaccines, while some took the disease seriously and were slightly more intent on taking the vaccine. The constant propagation of positive news regarding vaccine administration by health authorities helped to achieve high vaccine uptake [6]. After the uptake, several side effects were reported. These again caused a wave of changes in people’s attitude toward vaccine [6]. Some feared the disease, some the vaccine. To increase vaccine acceptance among people several notable users assured the safety of the vaccines through tweets [6]. The prime importance should be to evaluate attitude of the community toward accepting the steps to recover from COVID-19 pandemic. For that, effective policies should be adopted to spread the awareness by outreaching to the community either electronically or at ground level. However, in order to germinate such significant policies, one of the prime tasks should be to mine the opinions of the community at deeper to deepest level to make out their thought process periodically. Such an understanding would assist in implementing policies toward the safety drives of COVID-19 and any other epidemic or pandemic which might arrive in near future. In this work, we have developed such a classification model that would categorize the public opinions at subtopic levels in real time. Our contributions are modeled in three folds: Firstly, we have classified the tweets related to COVID-19 in positive, negative and neutral sentiments. Secondly, the topic modeling has been performed in order to extract the hidden topics or vectors from each class, which further act as the sub topics for respective the classes. Thirdly, a classification model is trained that would classify the real-world tweets related to vaccination in subtopic level based on their inner meaning and people’s opinion.

To accomplish this challenging task, we subdivide the proposed methodology in different sections. Introduction to the problem is presented in Sect. 1. We discussed the previous work on this problem in Sect. 2. Data collection part is presented in Sect. 3. Proposed technique is discussed in Sect. 4 and its subsections. In subsections, firstly we discussed outline of proposed methodology in Sect. 4.1. In Sect. 4.2, TF-IDF [7]-based feature extraction technique is discussed to select the word with the highest weight. Thirdly, we have formed query vector using Rocchio algorithm [8], which is discussed in detail in Sect. 4.3. LDA [9] algorithm is used for formation of subtopic related to different sentiment, which is described in Sect. 4.4. In Sect. 4.5, a brief description is given on how Apache Kafka [10] is utilized to build a real-time subtopic classification system . In Sect. 5, results and discussion of our work are presented, and in Sect. 6, conclusion and the future aspect of the work are described.

Literature survey

A number of works have been proposed on the field of natural language processing to identify the sentiment of tweets which are collected from different micro-blogging sites such as twitter. Number of researchers had used Machine learning and deep learning based technique for sentiment classification.

In machine learning-based sentiment classification approach different state-of-the-art models are used such as Random forest (RF) classifier, KNN (K nearest neighbor), SVM (Support vector Machine), Gaussian Naïve Bayes (G-NB), etc. Xiongwei Zhang et al. [10] proposed a real-time sentiment analysis system using Apache Kafka and Spark Streaming. The author has trained different machine learning algorithm SVM, RF, KNN on a pre-processed labeled dataset using Apache spark and choose the best model and use it for prediction of tweets which are coming in real time from twitter. For pre-processing of tweets, they perform noise removal, tokenization, normalization, and stemming to change the tweets in machine interpretable format. Bania [11] proposed TF-IDF and inductive learning based approach . Here author extracts the feature from pre-processed tweets using TF-IDF , uni-gram , bi-gram and tri-gram. Later these feature sets are used to train different machine learning models like Gaussian Naïve Bayes (G-NB), Bernoulli’s Naïve Bayes (BNB), Random forest (RF) and Support vector machine (SVM) and choose the model based on the accuracy for prediction purposes. Yuvraj Jain et al. [12] proposed sentiment analysis technique using Vader (Valence Aware Dictionary and Sentiment Reasoner). For pre-processing, they perform stemming, lemmatization, stop word removal, clearing text by removing hash-tag, @user and http link. After pre-processing, tweets are fit in Vader algorithm, which returns a polarity score of the tweets that can have any value as a decimal. The close the value is to ‘1’, the more positive the tweet is considered and vice versa. Koyel Chakraborty et al. [13] proposed the fuzzy rule-based sentiment analysis technique. They implemented a fuzzy rule based on Gaussian membership function and also triangular membership-based fuzzy inference system. Author chooses the best performing model for future prediction to correctly identify sentiments from tweets. Priya Iyer et al. [14] proposed a NB(naive Bayes) based sentiment analysis technique . For pre-processing, they perform removal of hashtags, white-spaces, hyperlinks and URL-address, HTML, special entities, usernames and removal of stop words and unicode strings from the tweets. After that, they use TF-IDF and bi-gram-based feature extraction technique to extract most important features. Later these tweets are used to train NB model. They obtained about 67% accuracy in time of prediction.

In deep learning-based sentiment analysis approach, we commonly found use of the LSTM (Long Short Term Memory), RNN (Recurrent Neural Network) and BERT (Bidirectional Encoder Representations from Transformers) [15] model. Nalini Chintalapudi et al. [15] uses pre-trained BERT model for sentiment analysis. They fine tuned the pre-trained BERT model on twitter data set which are labeled with sentiment. BERT model can accurately classify the tweets based on the sentiment like sad, joy, fear and anger. Rabindra Lamsal [16] proposed the large-scale tweet data collection technique based on geo-location and popular hash-tag. To analysis the data set, the authors used bi-gram to draw a network between different tweets. For sentiment analysis, they used LSTM-based deep network to calculate sentiment score. Later author replaces the LSTM with TextBlob [17] to get better accuracy. Harleen Kaur et al. [18] designed an algorithm called Hybrid Heterogeneous Support Vector Machine (H-SVM) for performing the sentiment classification. They collect the twitter data based on different hashtag COVID-19, coronavirus, deaths, new case, recovered, etc. The proposed algorithm can classify tweets in positive, negative and neutral sentiment with respective sentiment scores generated as output. They compared the performance of proposed model with RNN and SVM model in terms of accuracy. Soham Poddaret et al. [19] developed a unique way to investigate on specific user group who have posted tweets in Pre-COVID and COVID times. To classify the users from tweet set, they used CT-BERT++ a deep learning based model to accurately detect the Anti-Vax or Pro-Vax tweets and then identify those users who have posted it. Here author used LDA algorithm to identify the distinct topic and investigate how the exposure changes from pre-COVID time to COVID time.

Though some prior work showed that different authors had used machine and deep learning based technique for sentiment analysis. To our knowledge, none of the papers implement any method to analyze the sentiment at subtopic level. In this work, we provide a real-time method that can accurately classify each tweets with proper sentiment and then categorize them with proper subtopic.

Data collection

Various researchers have accumulated tweets on COVID-19-related vaccination from January 2020 to present and stored such collected tweets in the form of tweet ids into different data repositories1. For our work, we have utilized a data store of openICPSR2 for obtaining the tweet ids for collecting tweets. The mentioned organization had collected tweets from $28^{th}$ January 2020 to 1st September 2021. Beside this, we have also collected tweets from October 2021 to March 2022 by using popular COVID-19-related hashtags, i.e. corona, covid19, coronavirus, quarantine, safety, covidcase, lockdown, sarscov2, pandemic, wearamask, socialdistancing, stayathome, stayhome, PfizerBioNTech, Covaxin and many more and also labeled them with proper sentiments by using Vader [12] sentiment analyzer with 0.96 percent accuracy for further research. The two data sets are merged into one data set for training the model. In Table 1, we give some examples for tweets which are collected from twitter and some tweets which are re-hydrated from collected tweet ids.

Table 1.

Samples of collected tweets

Description	Sentiment
There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure #coronavirus (#COVID19). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8	Positive
We desperately need a #coronavirus vaccine. For #BigPharma that means raise prices for private profit. Please sign this @GlobalJusticeUK petition demanding that public research money goes with the condition that any vaccine is cheap enough for all of us https://t.co/0NESauWJa8	Positive
We’ve seen excessive purchasing. This can negatively affect others. #COVID19 has us feeling anxious, but important to prepare, not panic. Guidelines suggest we have food/supplies for 2 weeks. Read abt effects of excessive #stockpiling from @AgriLifeTODAY. https://t.co/EzQfFoIVRi https://t.co/6zAi3WQ7xQ	Positive
Urgent travel warnings, supermarket chaos, and the hunt for a COVID-19 vaccine. Take a look at the latest coronavirus news. #9Today https://t.co/OdcA5ZyIw0	Negative
Today, Jennifer Haller, a healthy mother of two, became the first person in history to test a potential vaccine for COVID-19. We owe her and 44 other people stepping up for human trials a debt of gratitude Â- may their bravery save many lives. https://t.co/eF2StcxHlQ	Positive
There is currently no vaccine to prevent #Coronavirus. The best way to prevent illness is to avoid being exposed to this virus. Here are some measures, Cimas recommends to reduce the risk. #CoronavirusAlert1?? #TogetherWeMakeADifference https://t.co/7JZ45Xz0Rf	Negative
Stabilitech’s COVID-19 Vaccine Intended to Be Delivered in a Disruptive Thermally Stable Capsule safe, efficacious, self-administered vaccine capsules that’s inexpensive to produce, developed in weeks, thermally stable and can be posted direct to consumer. https://t.co/MvyZpxhvVO	Positive

Open in a new tab

Tweets which are collected from twitter are not in valid format to use it in model for training and prediction purposes. Hence, we have to follow some techniques for pre-processing to clean the tweets. In the time of re-hydrate of tweet from tweet ids using twitter API, it returns a HTTP response in json format with multiple label and sub-label. From these labels, some of the them are useful for us. For this, we first convert this HTTP response in python dictionary format and from it necessary fields are fetched in a table for simplicity. In Table 1, the glimpse of HTTP response and collected tweets are given in tabular format. Tweets which are collected contain hash tags, @user, Http links, numbers, etc., which cannot be processed by natural language library. So, we have to pre-process them to make it useful for training. In Table 2, regex expressions which are used in pre-processing algorithm are presented in tabular format.

Table 2.

Steps of tweet pre-processing

Regex expression used for preprocessing
$text - > str(text)$
$text - > regex(r' @ \ w +^{'},^{''}, text)$
$text - > regex(r' # \ w +^{'},^{''}, text)$
$text - > regex(r'RT[\ s] +^{'},^{''}, text)$
$text - > regex(r'https ? : \ / \ / \ S +^{'},^{''}, text)$
$text - > regex(r' [\land a-zA-Z #] +^{'},^{''}, text)$
$text - > text.lower()$
Checking of length of each word which is greater than 2.

Open in a new tab

In Table 3 we have presented a step-by-step operation describing how the pre-processing algorithm works and cleaned tweets are made ready for further use in different algorithms.

Table 3.

Step-by-step process to clean the tweets

Regex expression used for pre-processing	Tweets after applying pre-processing rules
Raw tweets	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure #coronavirus (#COVID19). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8
$text - > str(text)$	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure #coronavirus (#COVID19). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8
$text - > regex(r' @ \ w +^{'},^{''}, text)$	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure #coronavirus (#COVID19). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8
$text - > {regex(r}^{'} # \ w +^{'},^{''}, text)$	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure (). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8
$text - > regex(r'RT[\ s] +^{'},^{''}, text)$	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure (). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog: https://t.co/U4jdCwy9AJ https://t.co/u6HiUMuXN8
$text - > regex(r'https? : \ / \ / \ S +^{'},^{''}, text)$	There currently are no vaccines, pills, potions, lotions, lozenges or other prescription or over-the-counter products available to treat/cure (). Coronavirus-related ad claims will be subject to exacting scrutiny. More on the biz blog:
$text - > regex(r'[\land a-zA-Z #] +^{'},^{''}, text)$	There currently are no vaccines pills potions lotions lozenges or other prescription or over the counter products available to treat cure Coronavirus related ad claims will be subject to exacting scrutiny More on the biz blog
$text - > text.lower()$	There currently are no vaccines pills potions lotions lozenges or other prescription or over the counter products available to treat cure coronavirus related ad claims will be subject to exacting scrutiny more on the biz blog
Checking of length of each word which is greater than 2.	There currently are vaccines pills potions lotions lozenges other prescription over the counter products available treat cure coronavirus-related claims will subject exacting scrutiny more the biz blog

Open in a new tab

After cleaning the tweets, stemming operations are applied to return the words in their root forms so that different words having same meaning could not create separate entity, while building the model. In Table 4, we have presented some tweets after stemming operation.

Table 4.

A snapshot of pre-processed and stemmed tweets

Description	Sentiment
There currently are no vaccines pills potions lotions lozenges other prescription over the counter products available treat cure coronavirus related claims will subject exacting scrutiny more the biz blog	Positive
Desperately need vaccine for that means raise prices for private profit please sign the petition demanding that public research money goes with the condition that any vaccine cheap enough for all	Positive
Seen reports from colleagues across the country doorstep rogues claiming from the nhs providing covid vaccine scammers will take advantage the situation extort money gain access your home report any cold callers via	Positive
Urgent travel warnings supermarket chaos and the hunt for covid vaccine take look the latest coronavirus news	Negative
Today jennifer haller healthy mother two became the first person history test potential vaccine for covid owe her and other people stepping for human trials debt gratitude may their bravery save many lives	Positive
There currently no vaccine prevent the best way prevent illness avoid being exposed this virus here are some measures cimas recommends reduce the risk	Negative
Stabilitech covid vaccine intended delivered disruptive thermally stable capsule safe efficacious self-administered vaccine capsules that inexpensive produce developed weeks thermally stable and can posted direct consumer	Positive

Open in a new tab

Proposed technique

Outline of proposed methodology

The proposed working model is depicted in Fig. 1. In this proposed architecture,3 pre-processed tweets which are rehydrated from tweet ids are supplied as input to train the LDA [9] model. The model can predict one tweet with its subtopic to classify the user opinion. In Fig. 1, we had tried to formalize the problem in a block diagram, from where we can easily understand different steps for implementation.

Feature extraction using TF-IDF

Feature extraction is one of the important parts in the field of Natural Language Processing. For feature extraction purpose, we use TF-IDF (Term Frequency-Inverse Document Frequency) algorithm [7]. It is one of the most important techniques used for information retrieval to represent importance of a specific word or phrase in a given document. To calculate TF-IDF [7] for any word, we use count vectorizer to convert the tweets into bag of words which represents occurrence of each word in a document. Count Vectorizer also converts the each unique word into a numeric value which will be used in calculation of TF-IDF [7]. The TF-IDF value increases in proportion to the number of times a word appears in the document. TF-IDF [7] uses two statistical methods: First, is Term Frequency, and the other is Inverse Document Frequency. Term frequency refers to the total number of times a given word appears in the document against the total number of all words in the document and The inverse document frequency measure of how much information the word provides. TF measures the weight of a given word in the entire document. IDF shows how common or rare a given word is across all documents. TF-IDF [7] can be computed as tf * idf. We sort the word set according to there weight form high to low, and we take first 30 words as a feature and later it is used in Rocchio [8] query expansion as foundation to generate of extended query set. In Table 5 we have presented top 14 among 30 words feature set.

Table 5.

Top 14 words among 30 words Feature set calculated by TF-IDF

1	ID	Word	Count	Weight
2	0	Covid	4667	243.906914
3	1	Store	3340	220.305497
4	2	Grocery	2942	207.898252
5	3	Supermarket	2886	193.077264
6	4	Prices	2964	189.122269
7	5	Food	2717	177.935706
8	6	Grocery store	2296	176.767438
9	7	Amp	2656	168.936069
10	8	Hand	1807	158.525416
11	9	People	2301	157.517270
12	10	Sanitizer	1658	152.005650
13	11	Consumer	2039	146.073079
14	12	Online	1765	139.648347
15	13	Like	1695	135.985668

Open in a new tab

Query vector formation

Here we use classical query expansion algorithm Rocchio [8] relevance feedback algorithm, based on vector space model. In this algorithm, we provide relevant set of tweets which are related to the input query vector and also provide non-relevant set of tweets. Then we form the relevance and non-relevance document vector as input to the Rocchio algorithm. After that, we apply query expansion formula which is depicted below.

\begin{matrix} {\vec{q}}_{m} = α {\vec{q}}_{0} + β \frac{1}{| D_{r} |} \sum_{{\vec{d}}_{j} \in D_{nr}} {\vec{d}}_{j} - γ \frac{1}{| D_{nr} |} \sum_{{\vec{d}}_{j} \in D_{nr}} {\vec{d}}_{j} \end{matrix}

Here $α$ , $β$ and $γ$ are weights attached to it and $q_{0}$ is the original query vector, $D_{r}$ and $D_{nr}$ are the set of known relevant and non-relevant documents respectively. If we have a lot of judged document, then we set the value of $β$ and $γ$ little higher. In our paper, we give more strength on the weightage of relevant documents than the non-relevant, for this we have to set the values of $γ < β$ . Here we set $α = 1.0$ , $β = 0.75$ and $γ = 0.15$ . In Rocchio [8] algorithm, when we provide one word or word pair as input , it will return 30 words as extended query set in output. In Table 6, an example is provided to show how the Rocchio [8] query expansion algorithm generating a extended query set. This extended query set is fed to LDA [9] algorithm as input for subtopic analysis. From Table 6, we can observe that if we provide covid word as a query, then Rocchio will return covid, amp, easter, youth, commission, destroy, disast, sanit, bacteria and many more as extended query set.

Table 6.

Extended query set

Initial query	Extended query set
Store	Store, brix,furlough,wore,pennysylvania,..
Price	Price,crude,slash,opec,exorbit,slump,..
Food	Food,insecur,beverag,destroy,combin, immigr,migrant,..
Covid	Covid,amp,easter,youth,commission,destroy, disast,sanit,bacteria,..

Open in a new tab

Generation of subtopic using LDA

Topic modeling is a kind of method to automatically organize the documents based on some hidden themes. A document can be part of multiple topics at a time with different contribution scores per topic and we choose dominant topic for the document, which is a kind of soft clustering. To analyze topic(or theme), here we used one of the state-of-the-art machine learning algorithm known as LDA(Latent Dirichlet Allocation) [9]. It is an unsupervised document classification algorithm. In this paper, we used LDA algorithm for topic modeling purposes to extract the inner theme of tweets. To construct the LDA model, we have used extended query vector which is provided by Rocchio [8] algorithm as a input . The model outputs the group of words classified with sub topics. In Fig. 2, an image is provided through which we can easily understand how LDA model has generated subtopics based on some inner themes. For each type of sentiment, we trained the LDA model separately with separate extended query set and store it for further use.

Fig. 2 — Clustering of different words based on inner themes

To determine the merit of LDA model, the coherence score of the model is observed. In our work, we set topic size at 20; for this, we observed the coherence score as 0.56, which is good enough for LDA model. We can also increase or decrease number of topics based on word set and document size. In Figs. 3, 4 and 5, we present scatter plot of topic vs coherence score for the positive, negative and neutral sentiment.

Fig. 3 — A coherence versus number of topic graph for positive sentiment

Fig. 4 — A coherence versus number of topic graph for negative sentiment

Fig. 5 — A coherence vs number of topic graph for neutral sentiment

From Figs. 3, 4 and 5, we can observe that after 20 subtopics the coherence score remains constant, which is nearly 0.56; hence, we have chosen 20 as our number of sub topics. In Table 7, we represent some topic with its keyword which is classified by LDA model [9]. From Table 7, it can be observed that in the subtopic numbered 1, we have obtained some keyword such as covid, price, consum, demand, oil, and stop, which are related to each other.

Table 7.

Subtopic related to negative sentiment

Subtopic name	Keywords
1	Covid, Price,consum,demand,oil,coronaviru,stop insight,survey,gold,webniar
2	Store,groceri,supermarket,worker,work,one,go easter,sunday,piece,mail
3	Amp,peopl,pandem,get,time,crisi,connect,walmart dump,distilleri,greater
4	Food,need,buy,suppli,stock,panic,inventori,snap,code
5	Shop,onlin,led,gift,bid,portal,institut,commit

Open in a new tab

Utilization of Apache Kafka and Apache Spark to build the classification model at subtopic level

Event streaming is the digital equivalent of the human body’s central nervous system, it is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval. Apache Kafka [10] is one the leading event streaming architecture. To accomplish our real-time subtopic level classification of tweets, we have used Apache Kafka and Apache Spark. In this step, we have used Twitter Streaming API to stream Twitter data filtered by COVID-19-related hashtags, i.e. corona, covid19, coronavirus, quarantine, safety, covidcase, lockdown, sarscov2, pandemic, wearamask, socialdistancing, stayathome, stayhome, PfizerBioNTech, covaxin and many more. and Apache Kafka [10] is used to ingest data from Twitter. Twitter Streaming API is used to retrieve tweets about coronavirus produced in real-time for classifying the tweets at subtopic level. For connecting to the API and retrieving Twitter data, we have used a python library called Tweepy. After the connection to Twitter Streaming API is established, streaming data are ingested form Twitter to Kafka’s topic. Spark [10, 20] streaming and machine learning capabilities are then utilized to process streaming tweets and use LDA model for subtopic-level analysis. In particular, Apache Kafka [10] streaming pre-processes the collected tweets related to coronavirus on-the-fly and categories them according to their sentiment using the Vader [12] sentiment analyzer. After that, Apache spark [10] is used to classify the tweets with different subtopic using pre-trained LDA [9] model and store them for future uses.

Results and discussion

The main challenges of our work are to classify the tweets with proper sentiment and extract subtopics in real time. For this purposes, we have used the Vader [12] sentiment analyzer. Vader (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is used for sentiment analysis of text which has both the polarities, i.e. positive/negative. In Vader [12], we calculate the intensity of any tweet; if the sentiment score is found to be greater than 0.05 it is marked as positive sentiment, if the sentiment score is less than -0.05 it is treated as negative sentiment otherwise it is classified as neutral sentiment. Here we mainly focus on the tweets with positive and negative sentiment. After sentiment analysis part, we have applied trained LDA [9] model according to tweets sentiment to classify tweets in different subtopic category based on underneath theme of word grouping. To implement it in real time, we first integrate Apache Kafka [10] with twitter API to collect tweets based on some hash-tag related to COVID-19. After that on receiving the tweets, Vader [12] classifies them and Kafka saves the tweets as a message streams in a topic. Now, Apache Spark [10] reads the massages upon subscribing the topic and classifies them according there subtopic with the help of trained LDA model. Thus we can achieve the speed for classifying the tweets in real time. In Table 8, we represent distribution of different topics related to positive and negative sentiment.

Table 8.

Distribution of different topics related to positive and negative sentiment

Sentiment	Topic number	Topic name	Topic concept	Top words	Tweet
Negative	6	Medicine Price Rise	Concerns about Medicine price hike due to covid 19	Covid, price, pandem, contract, youth, opec, ayurved, grip, medicine, hike	Advice talk neighbors family exchange phone numbers create contact list phone numbers neighbors schools employer chemist set online shopping accounts poss adequate supplies regular meds order
Negative	6	Medicine Price Rise	Concerns about Medicine price hike due to covid 19	Covid,price,pandem , contract, youth, opec, ayurved, grip, medicine, hike	Reminder price gouging illegal califor nians protected experience illegal price gouging housing gas food essentials submit complaint office call
Negative	02	Unavailability of vaccine to prevent covid	Concerns about unavailable of covid-19 vaccination	Vaccine, prevent, illness, virus, exposed	There currently no vaccine prevent the best way prevent illness avoid being exposed this virus here are some measures cimas recommends reduce the risk
Negative	11	Food Shortage	Concerns about food shortage and price hike in super- market	Pleas, supermarket , hazard, safe, food, stock, feb, egg, amp, panic	Ready supermarket outbreak paranoid food stock litteraly empty serious thing please panic causes shortage
Positive	02	Human trials	Concerns about the human trials of covid 19 vaccine	Healthy, history, vaccine, human, trails, lives, save	Today jennifer haller healthy mother two became the first person history test potential vaccine for covid owe her and other people stepping for human trials debt gratitude may their bravery save many lives
Positive	16	Increase demand	Due to covid 19 demand became high for food products	Food, increase, demand, stock, full, groceries	Due to the Covid-19 situation, we have increased demand for all food products. The wait time may be longer for all online orders, particularly beef share and freezer packs. We thank you for your patience during this time
Positive	18	Social distancing	People try to maintain their social distancing to stop covid spread	Social, distance, covid, supermarket, spread, stop	I would love to practice social distancing but my occupation doesn’t allow it I monitor the selfservice area in a retail store and to do that I am required to remain in that area and assist customers when needed My greatest fear is getting Covid 19 unknowingly and 1 4
Negative	20	Online scam	Concerns related to online scam cases due to online payment at store or online shopping app	Consumer, online, shopping, workers, scam, needs, thanks, groceries, foods, safety, pricing	Corona prevention stop buy things cash use online payment methods corona spread notes also prefer online shopping home time fight covid

Open in a new tab

From Table 8, we can observe that, across different time frame people tweet different opinion which is happening around us in our country or world. People frequently complain about medicine price hike, supermarket charging extra price for foods due to high demand and low availability, different fraud cases happening around us and money deducted from bank account due to scam. These subjects are highly negative in sentiment. After sentiment analysis, we group these type tweets by using different subtopics. From Table 8, we can observe different types of topic such as, Topic 6 [Medicine Price Rise], Topic 11 [Food Shortage], and Topic 20 [online scam]. In the negativity around us, there are some positive parts also such as small shop keepers tweeting they are getting huge demand from retailers. People maintaining social distancing, etc. For positive tweets, some of the topics are Topic 18 [social distancing], Topic 16 [Increase Demand], etc. In Table 9, a comparison among the performance of LDA model with machine learning-based topic modeling technique such as NMF (Non Negative Matrix Factorization) [21], LSA (Latent Semantic Analysis) [21] and deep learning-based topic modeling technique is presented.

Table 9.

Performance comparison of different topic modeling techniques

LDA		NMF		LSA		BERT
Topic 1	Topic 2	Topic 1	Topic 2	Topic 1	Topic 2	Topic 1	Topic 2
Covid	Store	Covid	Store	Covid	Worker	Covid	Groceri
Covid	Groceri	Price	Supermarket	Consum	Groceri	Demand	Stroe
Consum	Supermarket	Oil	Work	Demand	Store	Price	Work
Demand	Worker	Coronavirus	One	Oil	Panic	Consum	One
Oil	Work	Stop	Go	Coronavirus	Shop	Coronavirus	Go
Coronavirus	One	Insight	Sunday	Stop	Online	Stop	Easter
Stop	Go		Price	Survey	Gift	Insight	Sunday
Insight	Easter				Mall	Survey	Price
Survey	Sunday					Gold	Inventori
Gold	Price						Code

Open in a new tab

From Table 9, observation can be made that for same numbered subtopic, LSA model and NMF model clusters less number of words in comparison with LDA and BERT model. It can also be observed that the performance of LDA and BERT model is nearly comparable as they generate same number of subtopics with respect to a topic. BERT [15] model is not chosen in our case because, there is always a token limitation for training the model, and it can only achieved the highest accuracy score when trained on billion of data sets with huge number of parameters. But machine learning approach based on LDA model is a unsupervised probabilistic clustering algorithm which does not require such huge amount of data for training purposes and does not have any token limitation also. A snapshot is given in Fig. 6 describing the classification of tweets in real time.

From Fig. 6, we can observe that value contains the real-time pre-processed tweets, timestamp helps us to know when the tweets are generated or posted in twitter, and sentiment gives us sentiment perceived in the tweets that is calculated by Apache Kafka [10] framework. Subtopic represents the proof in which the upcoming tweets should belong to, which is evaluated by Apache spark framework [10].

A relation between the methodology and tools required for various activities is presented in Fig. 7.

From Fig. 7, we can observe that for sentiment analysis we have used VADER [12] sentiment analyzer and for query vector formation we have used Rocchio [8] algorithm. For subtopic creation, we have used LDA [9] algorithm, and for real-time classification of tweets we have used Apache Kafka and Apache Spark [10].

Conclusion

This work systematically analyzes how the tweets are finally classified in different subtopics based on the word grouping. This might help to understand the inner theme of tweets posted regarding COVID-19 by analyzing the subtopic name. Essentially the proposed work provides a framework to use live tweet data related to COVID-19. The framework can automatically process the upcoming tweet and is categorized into different sentiments which is fed to the previously trained model to finally map it into appropriate subtopic level. This work would be helpful for the decision makers to identify the public sentiment related to COVID-19 vaccination, depending upon subtopic classification to make necessary decision regarding the vaccination process. In the future work, we like to build more advanced model which helps to visualize the changes of sentiment related to tweets in spatio-temporal way that might help to find the thought process of general people and their reaction regarding the COVID-19 vaccination process. Another future work might be to design a model that can accurately track a subtopic and detect whether the number of tweets related to it with different sentiments is increasing or decreasing.

Data availability

The tweet data are collected using twitter API, and some tweet ids is collected from openICPSR [22].

Declarations

Conflict of interest

Proposed work is self-funded and uses Google Colab free tier version for training purposes. There is no conflict of interest with other proposed state-of-art models.

Footnotes

https://github.com/lopezbec/COVID19_Tweets_Dataset

openICPSR: ’https://www.openicpsr.org/openicpsr/project/120321/version/V6/view?path=/openicpsr/120321/fcr:versions/V6/Twitter-COVID-dataset---Jan-2021’

The GitHub link of the project is given. GitHub: ’https://github.com/mrinmoy-sadhukhan/LDA-topic-Kafka’

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Pramita Bhattacherjee, Tamal Mondal, Sudakshina Dasgupta and Indrajit Bhattacharya have contributed equally to this work.

Contributor Information

Mrinmoy Sadhukhan, Email: mrinmoy.sadhukhan1996@gmail.com.

Pramita Bhattacherjee, Email: pramita.code@gmail.com.

Tamal Mondal, Email: tamalkalyanigov@gmail.com.

Sudakshina Dasgupta, Email: sudakshinadasgupta@yahoo.com.

Indrajit Bhattacharya, Email: indra51276@gmail.com.

References

1.Wong A, Ho S, Olusanya O, Antonini MV, Lyness D (2020) The use of social media and online communications in times of pandemic COVID-19. J Intensive Care Soc. 10.1177/1751143720966280 [DOI] [PMC free article] [PubMed]
2.Website. https://www.statista.com/topics/7863/social-media-use-during-coronaviruscovid-19-worldwide/topicHeader__wrapper last visited: 17-April-2022
3.Wicke P, Bolognesi MM (2021) Covid-19 Discourse on twitter: how the topics, sentiments, subjectivity, and figurative frames changed over time. In: Frontiers in communication, year:2021. 10.3389/fcomm.2021.651997
4.Kharde VA, Sonawane SS (2016) Sentiment analysis of twitter data: a survey of techniques. In: Int J Comput Appl 139(11)
5.Blog. https://monkeylearn.com/blog/sentiment-analysis-of-twitter/ Last Visited On: 11-(April-2022)
6.Dai H, Saccardo S, Han MA, Roh L, Raja N, Vangala S, Modi H, Pandya S, Sloyan M, Croymans DM (2021) Behavioural nudges increase COVID-19 vaccinations. In: Nature, 02 [DOI] [PMC free article] [PubMed]
7.Kim SW, Gil JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum Cent Comput Inf Sci. 2019;9:30. doi: 10.1186/s13673-019-0192-7. [DOI] [Google Scholar]
8.Ye Z, He B, Huang X, Lin H (2010) Revisiting Rocchio’s relevance feedback algorithm for probabilistic models. In: Information retrieval technology—6th Asia information retrieval societies conference, year, pp 151–161. 10.1007/978-3-642-17187-1_14
9.Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res
10.Zhang X, Saleh H, Younis EM, Sahal R, Ali AA (2020) Predicting coronavirus pandemic in real-time using machine learning and big data streaming system, In: Hindwai, Complexity, vol. 2020, Article ID 6688912, 10 p. 10.1155/2020/6688912
11.Bania RK (2020) COVID-19 public tweets sentiment analysis using TF-IDF and inductive learning models. In: INFOCOMP, vol 19(2), pp 23–41
12.Jain Y, Tirth V (2020) Sentiment analysis of tweets and texts using python on stocks and COVID-19. Int J Comput Intell Res 16(2), 87–104. 10.37622/IJCIR/16.2.2020.87-104
13.Chakraborty K, Bhatia S, Bhattacharyya S, Platos J, Bag R, Hassaniene AE (2020) Sentiment analysis of covid-19 tweets by deep learning classifiers—a study to show how popularity is affecting accuracy in social media. Appl Soft Comput. 10.1016/j.asoc.2020.106754 [DOI] [PMC free article] [PubMed]
14.Iyer KBP, Kumaresh S (2020) Twitter sentiment analysis on coronavirus outbreak using machine learning algorithms. In: EJMCM, Volume 207, Issue 203, pp 202663–2676
15.Chintalapudi N, Battineni G, Amenta F (2021) Sentimental analysis of COVID-19 tweets using deep learning models. In: PMC, Apr 1. 10.3390/idr13020032 [DOI] [PMC free article] [PubMed]
16.Lamsal R (2020) Design and analysis of a large-scale COVID-19 tweets dataset. In: Springer Nature. 10.1007/s10489-020-02029-z [DOI] [PMC free article] [PubMed]
17.TextBlob. https://textblob.readthedocs.io/en/dev/, last accessed on (May-2022)
18.Kaur H, Ahsaan SU, Alankar B, Chang V (2021) A proposed sentiment analysis deep learning algorithm for analyzing COVID-19 tweets. In: PMC, 2021 [DOI] [PMC free article] [PubMed]
19.Poddar S, Mondal M, Misra J, Ganguly N, Ghosh S (2021) Winds of change: impact of COVID-19 on vaccine-related opinions of twitter users. In: arxiv, Sat, 20. 10.48550/arXiv.2111.10667
20.Apache Spark. https://spark.apache.org/, last accessed on (May-2022)
21.Kherwa P, Bansal P (2019) Topic modeling: a comprehensive review, In: EAI endorsed transactions on scalable information systems 24. 10.4108/eai.13-7-2018.159623
22.openICPSR. https://www.openicpsr.org/openicpsr/project /120321/version/V6/view ?path=/openicpsr/120321/fcr:versions/V6 /Twitter-COVID-dataset-Jan-2021, last accessed on (May-2022)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The tweet data are collected using twitter API, and some tweet ids is collected from openICPSR [22].

[CR1] 1.Wong A, Ho S, Olusanya O, Antonini MV, Lyness D (2020) The use of social media and online communications in times of pandemic COVID-19. J Intensive Care Soc. 10.1177/1751143720966280 [DOI] [PMC free article] [PubMed]

[CR2] 2.Website. https://www.statista.com/topics/7863/social-media-use-during-coronaviruscovid-19-worldwide/topicHeader__wrapper last visited: 17-April-2022

[CR3] 3.Wicke P, Bolognesi MM (2021) Covid-19 Discourse on twitter: how the topics, sentiments, subjectivity, and figurative frames changed over time. In: Frontiers in communication, year:2021. 10.3389/fcomm.2021.651997

[CR4] 4.Kharde VA, Sonawane SS (2016) Sentiment analysis of twitter data: a survey of techniques. In: Int J Comput Appl 139(11)

[CR5] 5.Blog. https://monkeylearn.com/blog/sentiment-analysis-of-twitter/ Last Visited On: 11-(April-2022)

[CR6] 6.Dai H, Saccardo S, Han MA, Roh L, Raja N, Vangala S, Modi H, Pandya S, Sloyan M, Croymans DM (2021) Behavioural nudges increase COVID-19 vaccinations. In: Nature, 02 [DOI] [PMC free article] [PubMed]

[CR7] 7.Kim SW, Gil JM. Research paper classification systems based on TF-IDF and LDA schemes. Hum Cent Comput Inf Sci. 2019;9:30. doi: 10.1186/s13673-019-0192-7. [DOI] [Google Scholar]

[CR8] 8.Ye Z, He B, Huang X, Lin H (2010) Revisiting Rocchio’s relevance feedback algorithm for probabilistic models. In: Information retrieval technology—6th Asia information retrieval societies conference, year, pp 151–161. 10.1007/978-3-642-17187-1_14

[CR9] 9.Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res

[CR10] 10.Zhang X, Saleh H, Younis EM, Sahal R, Ali AA (2020) Predicting coronavirus pandemic in real-time using machine learning and big data streaming system, In: Hindwai, Complexity, vol. 2020, Article ID 6688912, 10 p. 10.1155/2020/6688912

[CR11] 11.Bania RK (2020) COVID-19 public tweets sentiment analysis using TF-IDF and inductive learning models. In: INFOCOMP, vol 19(2), pp 23–41

[CR12] 12.Jain Y, Tirth V (2020) Sentiment analysis of tweets and texts using python on stocks and COVID-19. Int J Comput Intell Res 16(2), 87–104. 10.37622/IJCIR/16.2.2020.87-104

[CR13] 13.Chakraborty K, Bhatia S, Bhattacharyya S, Platos J, Bag R, Hassaniene AE (2020) Sentiment analysis of covid-19 tweets by deep learning classifiers—a study to show how popularity is affecting accuracy in social media. Appl Soft Comput. 10.1016/j.asoc.2020.106754 [DOI] [PMC free article] [PubMed]

[CR14] 14.Iyer KBP, Kumaresh S (2020) Twitter sentiment analysis on coronavirus outbreak using machine learning algorithms. In: EJMCM, Volume 207, Issue 203, pp 202663–2676

[CR15] 15.Chintalapudi N, Battineni G, Amenta F (2021) Sentimental analysis of COVID-19 tweets using deep learning models. In: PMC, Apr 1. 10.3390/idr13020032 [DOI] [PMC free article] [PubMed]

[CR16] 16.Lamsal R (2020) Design and analysis of a large-scale COVID-19 tweets dataset. In: Springer Nature. 10.1007/s10489-020-02029-z [DOI] [PMC free article] [PubMed]

[CR17] 17.TextBlob. https://textblob.readthedocs.io/en/dev/, last accessed on (May-2022)

[CR18] 18.Kaur H, Ahsaan SU, Alankar B, Chang V (2021) A proposed sentiment analysis deep learning algorithm for analyzing COVID-19 tweets. In: PMC, 2021 [DOI] [PMC free article] [PubMed]

[CR19] 19.Poddar S, Mondal M, Misra J, Ganguly N, Ghosh S (2021) Winds of change: impact of COVID-19 on vaccine-related opinions of twitter users. In: arxiv, Sat, 20. 10.48550/arXiv.2111.10667

[CR20] 20.Apache Spark. https://spark.apache.org/, last accessed on (May-2022)

[CR21] 21.Kherwa P, Bansal P (2019) Topic modeling: a comprehensive review, In: EAI endorsed transactions on scalable information systems 24. 10.4108/eai.13-7-2018.159623

[CR22] 22.openICPSR. https://www.openicpsr.org/openicpsr/project /120321/version/V6/view ?path=/openicpsr/120321/fcr:versions/V6 /Twitter-COVID-dataset-Jan-2021, last accessed on (May-2022)

PERMALINK

Opinion classification at subtopic level from COVID vaccination-related tweets

Mrinmoy Sadhukhan

Pramita Bhattacherjee

Tamal Mondal

Sudakshina Dasgupta

Indrajit Bhattacharya

Abstract

Introduction

Literature survey

Data collection

Table 1.

Table 2.

Table 3.

Table 4.

Proposed technique

Outline of proposed methodology

Fig. 1.

Feature extraction using TF-IDF

Table 5.

Query vector formation

Table 6.

Generation of subtopic using LDA

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Table 7.

Utilization of Apache Kafka and Apache Spark to build the classification model at subtopic level

Results and discussion

Table 8.

Table 9.

Fig. 6.

Fig. 7.

Conclusion

Data availability

Declarations

Conflict of interest

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases