COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets

SreeJagadeesh Malla; Alphonse PJA

doi:10.1016/j.asoc.2021.107495

. 2021 May 21;107:107495. doi: 10.1016/j.asoc.2021.107495

COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets

SreeJagadeesh Malla ^1,^⁎, Alphonse PJA ¹

PMCID: PMC9761198 PMID: 36568257

Abstract

On 11 March 2020, the (WHO) World Health Organization declared COVID-19 (CoronaVirus Disease 2019) as a pandemic. A further crisis has manifested mass fear and panic, driven by lack of information, or sometimes outright misinformation, alongside the coronavirus pandemic. Twitter is one of the prominent and trusted social media in this current outbreak. Over time, boundless COVID-19 headlines and vast awareness have been spreading, with tweets, updates, videos, and explosive posts. Few studies have been performed on the pandemic to detect and interrelate various disease types, including current coronavirus. However, it is pretty tricky to discriminate and detect a specific category. This work is motivated by the need to inform society about limiting irrelevant information and avoiding spreading negative emotions. In this context, the current work focuses on informative tweet detection in the pandemic to provide relevant information to the government, medical organizations, victims services, etc. This paper used a Majority Voting technique-based Ensemble Deep Learning (MVEDL) model. This MVEDL model is used to identify COVID-19 related (INFORMATIVE) tweets. The state-of-art deep learning models RoBERTa, BERTweet, and CT-BERT are used for best performance with the MVEDL model. The “COVID-19 English labeled tweets” dataset is used for training and testing the MVEDL model. The MVEDL model has shown 91.75 percent accuracy, 91.14 percent F1-score and outperforms the traditional machine learning and deep learning models. We also investigate how to use the MVEDL model for sentiment analysis on 226668 unlabeled COVID-19 tweets and their informative tweets. The application section discussed a comprehensive analysis of both actual and informative tweets. According to our knowledge, this is the first work on COVID-19 sentiment analysis using a deep learning ensemble model.

Keywords: COVID-19, Informative tweets, Deep learning, RoBERTa, CT-BERT, BERTweet, Majority voting, Health emergency, Sentiment analysis

1. Introduction

SARS-CoV-2 is a new viral disease that first appeared in 2019. Later it was named COVID-19. In late December 2019, COVID-19 was detected first in Wuhan, China, and quickly spread worldwide. The World Health Organization (WHO), which is relentlessly trying to control the spread of the COVID-19 outbreak, declared the pandemic on 30 January 2020. COVID-19 is an infectious disease transmitted by contacts and small droplets when people cough, sneeze, or talk—by Quarantine, Limiting activities, and separating suspects from others who are not ill to be unable to spread the infection or contamination. Most countries have been locked down for strict quarantine implementation. Precautions such as clean, safe distance, wear the mask, do not touch the eyes, nose, or mouth, etc., are proposed. Due to the COVID-19 lockdown and other precautions during the COVID-19 pandemic, the global economic crisis began. The automotive industry, Tourism, Restaurants, Retail, Transportation, and Energy, etc., are the top-rated service sectors are affected by the COVID-19 recession. The vaccination process has been started in many countries to prevent people from becoming seriously ill with COVID-19.

In this pandemic situation, social media sites such as Instagram, Facebook, WhatsApp, Twitter, etc., help gather insightful messages allied to COVID-19 disease. During this pandemic condition, the situations correlate with specific social media messages. The content includes epidemic signs, communities affected by disease outbreaks and other medical services. Today, most NLP researchers focus on social media text classification. This paper has discussed the messages of the most popular social media site Twitter.

Twitter is one of the famous social media quotes most widely used for sharing short messages. These short messages are called tweets with a length of up to 280 characters. The Twitter API supports open access to Twitter in an advanced and exclusive way. Active Twitter user tweets multiple types of information in a large amount of data at a tremendous pace in a health emergency, consisting of tweets related to both disease and non-disease. Informative tweets provide information about suspected, confirmed, recovered, and death cases and the location or travel history of the patients and contain symptoms of illness like cold, fever, headache, running nose, body pains, etc. The COVID-19 related tweets [1] are not following the ”INFORMATIVE” annotation guidelines, are labeled with the ”UNINFORMATIVE”. Another type of tweets is ”Fake” messages. These fake news are purposefully created messages to mislead the social media society, and uninformative messages are not fake, but not related to the COVID-19 pandemic as shown in Table 1. Precautions and prevention for various diseases are to be noticed by various social and government organizations or departments’. For this purpose they needs resources like social media for creating awareness and providing medical kits, medicines. The tweet data is classified with unique resources to help the suffering and suspected users to know their status in a health emergency . Tweets relate to various diseases should be classified for health events to enable the authorities to provide healthcare facilities to prevent the public from developing the disease, leading to the final phase of breathing. To diagnose these kinds of circumstances, we need a common automated framework to gather the above situational tweeting. The NLP researchers analyze the Twitter text by using traditional methods as well as advanced Artificial Intelligence techniques. During this decade, the use of AI techniques increases compared to conventional techniques with better performance and speed. Many of the deep learning strategies are capable of generating better results. Deep learning models’ performance depends on the defined problem. Therefore, it has steadily increased the need for robust techniques to detect informative tweets in the disease-relevant corpus.

Table 1.

Types of COVID-19 Twitter tweets.

COVID-19 Informative Tweet (Real Information and also related to COVID-19 pandemic)

• Over 200,000 in the US now infected. Over 5000 in the US now dead. Coronavirus infections due to increase drastically while the death toll is projected to reach 100k to 240k in the next weeks. My question: why hasna€^TMt

COVID-19 Uninformative Tweet (Real Information ,but not related to COVID-19 pandemic)

• Stop and Shop Donates Meal For COVID-19 Healthcare Workers * Stop and Shop announced it will be providing 5000 meals daily for frontline healthcare workers battling novel .

COVID-19 Fake Tweet

• Obama Calls Trump’s Coronavirus Response A Chaotic Disaster https://t.co/DeDqZEhAsB.

COVID-19 Real Tweet

• Schools are struggling to cope with a lack of #COVID19 tests - with new infections increasing since it became compulsory for pupils to return. But when should you get your child tested for the virus? Here’s our explainer.

Paper	Year	Important discussed topic	Advantage/Limitations
[6]	2021	Identify damage assessment tweets during a disaster	The model used linear regression, SVR and random forest technology
[7]	2020	Identify multi-modal informative disaster tweets	Model based on BERT and DenseNet
[8]	2020	Dense classification with contextual representation	ELMo embedded classifier is applied.
[9]	2020	Detecting Informative Tweets	ensemble model of CNN, ANN, fine-tuned VGG-16 architecture
[10]	2020	Reduces dynamic routing computational complexity	Capsule networks have several advantages over CNN
[11]	2019	Model Combination of capsules encoded features and capsule networks	benefit of simplistic capsule networks compared to existing HMC methods.
[12]	2018	Sentiment analysis decision support systems	DSocial model was used for automating the processing of social network information
[13]	2017	Voting Algorithm for sentiment classification	Model Used SVM,Naive bayes, Bagging.
[14]	2009	Ensemble method for sentiment classification	Rule-based classification, supervised learning and machine learning

Paper	Year	Important discussed topic	Advantages/Limitations
[27]	2021	COVID-19 dataset from Feb 2020 to March 2020 was used for classification of sentiments	BiLSTM,CNN, distilBERT,BERT,XLNET and ALBERT ware used.
[28]	2021	Negative emotions of COVID-19 pandemic ware discussed	The keywords can be used to remove content related to COVID-19 from some relevant tweets.
[29]	2021	Detecting fake news from COVID-19	Modified-LSTM and Modified GRU used for improve accuracy.
[30]	2021	An automatic lung segmentation of CT-images of COVID-19 patients	A new fully connected (FC) layer of paralleling quantum-installed self-controlled network (PQIS-Net) gives better results.
[31]	2020	Sentiment analysis on latest COVID-19 dataset with 226668 tweets	Implementing a fuzzy rule base for Gaussian membership for analysis.
[32]	2020	Deep sentiment classification on COVID-19 comments	LSTM Recurrent Neural Network achieved higher accuracy than other machine-learning algorithms for COVID-19 — sentiment classification .
[33]	2020	The classification of the positions of critical patients	Consider different classifiers of Bayesian, linear and support vector machine (SVM).
[34]	2020	Classify data incredible or non-trustworthy.	Ensemble learning model(SVM and Random Forest) had better performance than individual models.
[35]	2020	comparative analysis of quantum backpropagation multilayer perceptron (QBMLP) and continuous variable quantum neural networks	Promising results on convoluted and sporadic data.
[36]	2020	Analysis on the largest English Twitter depression dataset(COVID 19)	Pre-trained transformer classification models BERT, RoBERTa and XLNet ware used.
[37]	2020	Auto-assign sentences for COVID-19 press briefings corpus	CNN + BERT (combined) outperforms CNN combined with other embeddings (Word2Vec, Glove, ELMo)
[38]	2019	Automatically sense the public opinion on vaccination from tweets	bag-of-words(n-grams as tokens), and for classification(SVM) is used.
[39]	2021	Automatically sense the public opinion on vaccination from tweets	bag-of-words (n-grams as tokens), and for classification(SVM) is used.

Paper	Year	Author	Model	Accuracy	F1-Score
[40]	2020	(Kumar and Singh, 2020)	CT-BERT +RoBERTa	91.50	90.96
[41]	2020	(Møller et al. 2020)	CT-BERT	91.40	90.96
[42]	2020	(Maveli, 2020)	RoBERTa + XLNet +BERTweet	90.40	90.11
[43]	2020	(Bao et al. 2020)	RoBERTa +MLP	90.30	90.05
[44]	2020	(Nguyen, 2020)	Majority Voting	90.15	90.08
[3]	2020	(Jagadeesh and Alphonse, 2020)	RoBERTa	89.35	89.14
[45]	2020	(Babu and Eswari, 2020)	CT-BERT+RoBERTa+ SVM(TFIDF)	89.35	88.87

Batch size	Learning rate	TP	FP	FN	TN	Accuracy	F1-Score	Recall	Precision
	1e−5	890	54	143	913	90.14	90.26	94.41	86.45
	2e−5	860	84	123	933	89.64	90.26	91.74	88.35
8	3e−5	864	80	121	935	89.95	90.29	92.11	88.54
	4e−5	838	106	111	945	89.14	89.70	89.91	89.48
	5e−5	832	112	114	942	88.7	89.28	89.37	89.20

	1e−5	864	80	133	923	89.35	89.65	92.02	87.40
	2e−5	867	77	126	930	89.85	90.15	92.35	88.06
16	3e−5	858	86	135	921	88.94	89.28	91.45	87.21
	4e−5	852	92	102	954	90.30	90.77	91.20	90.34
	5e−5	849	95	119	937	89.30	89.75	90.79	88.73

	1e−5	873	71	138	918	89.55	89.77	92.82	86.93
	2e−5	866	78	117	939	90.25	90.59	92.33	88.92
32	3e−5	864	80	126	930	89.7	90.02	92.07	88.06
	4e−5	868	76	141	915	89.14	89.39	92.33	86.64
	5e−5	850	94	119	937	89.35	89.79	90.88	88.73

Tweets	True label	Predicted label
@USER @USER Absolutely! They’ve been blaming NHS structure, public, NHS staff 4 changing PPE 2 often & now claim those poor NHS staff who’ve died looking after Covid patients probably got it outside work! They’re completely devoid of any decency or respect for the dead or living! #ToryScum	INFORMATIVE	UNINFORMATIVE

I’ll pull a Kurt Kloss and ask peeps here if their companies have a written COVID policy on what happens when a coworker tests positive post the mandated work from home period. Curious to know if anyone’s read published solutions to the new workplace normal.	INFORMATIVE	UNINFORMATIVE

@USER In authoritarian countries, officials tend to sanitize ugly truths to please their big boss. Signs of that when sec. duque said that confirmed covid 19 in a Pinoy without travel history doesnt mean local transmission. Putok sa buho yung virus?	INFORMATIVE	UNINFORMATIVE

I have a sore throat, dry cough, headache & im feeling weak.Yesterday on @USER permanet secretary for ministry of health said if anyone suspects they have #COVID2019 they must stay at home & call a number there’s a team deployed to assist from home. I need the contact	INFORMATIVE	UNINFORMATIVE

Batch size	Learning rate	TP	FP	FN	TN	Accuracy	F1-Score	Recall	Precision
	1e−5	855	89	89	967	91.10	91.57	91.57	91.57
	2e−5	846	98	86	970	90.80	91.33	90.82	91.85
8	3e−5	853	91	93	963	90.8	91.27	91.36	91.19
	4e−5	863	81	110	946	90.45	90.83	92.11	89.58
	5e−5	875	69	125	931	90.3	90.56	93.10	88.16

	1e−5	860	84	111	945	90.25	90.64	91.83	89.48
	2e−5	874	70	126	930	90.2	90.46	93.00	88.06
16	3e−5	880	64	119	937	90.85	90.73	93.60	88.73
	4e−5	854	90	122	944	89.45	89.90	91.29	88.55
	5e−5	875	69	120	936	90.55	90.82	93.13	88.63

Tweets	True label	Predicted label
@USER @USER 2 weeks later: Artist Torey Lanez is the first celebrity to pass away from CoVid19. ?? He really gonna be roasting with that fever of 105.	UNINFORMATIVE	INFORMATIVE

@USER I don’t listen to any news since COVID19 killed the first 50 patients. I read news, but I don’t have to listen to the ??????? .	UNINFORMATIVE	INFORMATIVE

In light of the confirmation of a #COVID19 case in Uganda, President Museveni will today,at 4pm, address the country on what further steps to take in a bid to curb the possible spread of the pandemic. #MonitorUpdates #CoronavirusPandemic HTTPURL	UNINFORMATIVE	INFORMATIVE

I did notice from a recent Folha article that 9 out of the first 10 Coronavirus deaths in Brazil were all people who had died in private hospitals. This could be for several reasons but it’s definitely something to watch out for.	UNINFORMATIVE	INFORMATIVE

Batch size	Learning rate	TP	FP	FN	TN	Accuracy	F1-Score	Recall	Precision
	1e−5	854	90	127	929	89.14	89.71	91.16	87.97
	2e−5	862	82	122	934	89.80	90.15	91.92	88.44
8	3e−5	861	83	132	924	89.25	89.57	91.75	87.50
	4e−5	856	88	155	901	87.85	88.11	91.10	85.32
	5e−5	837	107	144	912	87.45	87.90	89.49	86.36

	1e−5	853	91	131	925	88.90	89.06	91.04	87.59
	2e−5	859	85	119	937	89.80	89.28	91.68	88.73
16	3e−5	839	105	118	938	88.85	90.18	89.93	88.82
	4e−5	847	97	131	925	88.6	89.37	90.50	87.59
	5e−5	869	75	136	920	89.45	89.02	92.46	87.12

	1e−5	867	77	150	906	88.64	88.86	92.16	85.79
	2e−5	864	80	151	905	88.44	88.68	91.87	85.70
32	3e−5	846	98	133	923	88.44	88.87	90.40	87.40
	4e−5	863	81	135	921	89.20	89.50	91.91	87.21
	5e−5	864	80	144	912	88.80	89.06	91.93	86.36

COVID-19 dataset	UNINFORMATIVE	INFORMATIVE
Training Tweets	3697	3303
Validation Tweets	528	472
Test Tweets	1056	944

Model	F1-score	Accuracy
Random Forest Classifier	81.58	82.06
SVM Classifier	82.71	82.04
CNN	83.73	83.06
BERT	88.97	89.10
DistilBERT	89.08	88.40
RoBERTa [3]	89.18	89.50
RoBERTa + MLP [43]	90.05	90.30
CT-BERT model [41]	90.96	91.40
CT-BERT + RoBERTa [40]	90.96	91.50
Proposed Model(MVEDL)	91.14	91.75

Tweet type	TOTAL TWEETS		INFORMATIVE TWEETS
	TEXTBLOB	Percentage	TEXTBLOB	Percentage
Total tweets in the dataset	226 668	100	19877	100
Total tweets with sentiment	226 668	100	19877	100
No. of positive tweets	98 504	43.46	10975	55.21
No. of negative tweets	53 566	23.63	4398	22.13
No. of neutral tweets	74 598	32.91	4504	22.66

Negative		Positive		Neutral
Words	Count	Words	Count	Words	Count
covid19	18605	covid19	40593	covid19	27976
covid	17236	covid	26675	coronavirus	20074
coronavirus	11362	coronavirus	22514	covid	15004
corona	5437	new	11848	corona	7407
virus	3518	people	11401	pandemic	3612
deaths	3002	cases	8300	people	3539
trump	2997	corona	8202	virus	2784
pandemic	2499	deaths	7292	deaths	2681
sick	2101	corona	6895	trump	2496
home	2016	virus	5190	cases	2255
death	1922	pandemic	5125	lockdown	1848
bad	1910	positive	3999	health	1819
new	1898	trump	3970	death	1785
cases	1894	great	3598	news	1763
patients	1796	health	3512	world	1672
ill	1755	death	3442	home	1510
work	1606	patients	3103	patients	1493
health	1525	news	3018	hydroxy
				chloroquine	1389
china	1519	safe	2707	china	1357
dead	1515	lockdown	2635	vaccine	1313

Positive		Negative		Neutral
Words	Count	Words	Count	Words	Count
covid19	5916	covid19	2068	covid19	2095
cases	5666	covid	1519	coronavirus	1481
new	4469	deaths	1002	deaths	1118
deaths	3447	coronavirus	966	cases	1092
coronavirus	3372	cases	852	covid	952
positive	2293	people	708	died	489
covid	2134	died	576	death	434
confirmed	1779	hospital	547	people	351
total	1656	dead	511	hospital	329
people	1109	100000	486	total	311
tested	1102	new	410	home	309
reported	955	tested	384	100000	256
died	926	virus	340	ewing	220
death	816	home	321	corona	208
100000	680	americans	312	dies	200
reports	664	sick	309	symptoms	188
county	656	death	301	trump	187
recovered	640	symptoms	298	reported	160
health	613	negative	290	2020	157
hospital	609	positive	269	county	152

PERMALINK

COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets

SreeJagadeesh Malla

Alphonse PJA

Abstract

1. Introduction

Table 1.

2. Related work

Table 2.

Table 3.

Table 4.

3. Framework methodology

Fig. 1.

3.1. Tweets collection and data preprocessing

3.2. Data preprocessing

3.3. RoBERTa

Table 6.

Table 5.

Table 7.

3.4. CT-BERT

Table 9.

Table 8.

Table 10.

3.5. BERTweet

Table 12.

Table 11.

Table 13.

3.6. Majority voting

4. Experimental results and analysis

4.1. Dataset

Table 14.

4.2. Experiment setup

4.3. Performance measures

4.3.1. Confusion matrix

Fig. 2.

4.4. Performance analysis

4.4.1. Performance metrics evaluation among state-of-art machine learning models

Fig. 3.

4.4.2. Performance metrics evaluation among state-of-art deep learning models

Fig. 4.

Fig. 6.

4.4.3. Performance metrics evaluation among ensemble deep learning models

Fig. 5.

4.5. Proposed model comparison with ensemble deep learning techniques

Fig. 7.

Table 15.

Fig. 8.

Table 16.

5. Real-time application

Fig. 9.

Table 17.

Fig. 10.

Table 18.

Table 19.

6. Conclusion

Declaration of Competing Interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases