Skip to main content
IEEE - PMC COVID-19 Collection logoLink to IEEE - PMC COVID-19 Collection
. 2020 Nov 18;8:209127–209137. doi: 10.1109/ACCESS.2020.3039168

Unlink the Link Between COVID-19 and 5G Networks: An NLP and SNA Based Approach

Mohammed Bahja 1, Ghazanfar Ali Safdar 2,
PMCID: PMC8545258  PMID: 34812369

Abstract

Social media facilitates rapid dissemination of information for both factual and fictional information. The spread of non-scientific information through social media platforms such as Twitter has potential to cause damaging consequences. Situations such as the COVID-19 pandemic provides a favourable environment for misinformation to thrive. The upcoming 5G technology is one of the recent victims of misinformation and fake news and has been plagued with misinformation about the effects of its radiation. During the COVID-19 pandemic, conspiracy theories linking the cause of the pandemic to 5G technology have resonated with a section of people leading to outcomes such as destructive attacks on 5G towers. The analysis of the social network data can help to understand the nature of the information being spread and identify the commonly occurring themes in the information. The natural language processing (NLP) and the statistical analysis of the social network data can empower policymakers to understand the misinformation being spread and develop targeted strategies to counter the misinformation. In this paper, NLP based analysis of tweets linking COVID-19 to 5G is presented. NLP models including Latent Dirichlet allocation (LDA), sentiment analysis (SA) and social network analysis (SNA) were applied for the analysis of the tweets and identification of topics. An understanding of the topic frequencies, the inter-relationships between topics and geographical occurrence of the tweets allows identifying agencies and patterns in the spread of misinformation and equips policymakers with knowledge to devise counter-strategies.

Keywords: 5G conspiracy, corona-5G link, COVID-19, radiation scare, topic modelling, tweet analysis

I. Introduction

Information can be a boon, when it is reliable, authorised, and validated. On the contrary, it can be a curse if it is misused or overloaded. Information is a key resource in handling serious issues such as pandemics which requires effective dissemination of quality and reliable information across all stakeholders including public. However, increase in the use of social media technologies such as Facebook, WhatsApp, YouTube etc., have empowered individuals in disseminating information reflecting their personal thoughts and perceptions about an issue, sometimes even in spreading hatred or myths or conspiracies. A small dissemination of misinformation on online platforms can lead to serious issues which not only affects people but also the governments and the society as a whole [1].

The recent incidents of attacks on 5G towers/masts signify the importance of understanding the nature and attitudes of the public in sharing misinformation and the process of how myths related to 5G were developed and rapidly spread. The inconsistencies between the trusted and the non-trusted sources have to be investigated and reasoned in order to prevent the spread of misinformation and change peoples’ attitudes towards unauthorised information sources and wrong information. By adopting innovative technologies like Artificial Intelligence (AI), Machine Learning (ML), and NLP, the online information, their sources and the pattern and process of sharing of the information can be analysed. A recent study by Groza [2] used Description Logics for detecting inconsistencies between trusted medical sources and non-trusted ones. The study has identified that non-trusted information comes in natural language, while trusted information comes in a more formal language. Therefore, applying semantic reasoning and NLP techniques can identify the relationships between the types of information and how they were shared by the public.

The work presented in this paper focuses on investigating the factors that led to violent attacks on 5G infrastructure by reviewing and analysing the tweets related to 5G and COVID-19 in the UK using NLP techniques and text-mining techniques. Investigating these factors may not only contribute to the information management and awareness creation during COVID-19 pandemic, but also can be used to develop strategies to prevent information misuse in future in similar situations.

Statistical analysis of the tweets can unearth vital information related to the geographical spread of misinformation, and the frequently occurring terms and themes in the misinformation. This knowledge can be used to develop customized approaches to counter misinformation. For instance, geographic location of the tweets and its occurrence frequency (discussed in V-D) can identify the hotspots or regions that are more prone to consumption of misinformation. Granular data about the nature of the information from a geographical information can help local governmental agencies to create strategic awareness programs to counter the spread of misinformation and potentially reduce the further spread of fake news and conspiracy theories. The objective of the paper is to apply NLP techniques on the tweet dataset related to COVID-19 and 5G to perform statistical analysis and identify themes and topics from the tweets to understand the spread of misinformation.

The rest of the paper is organised as follows. Section II discusses the recent events of spread of misinformation related to the COVID-19 pandemic and its link to the 5G technology, i.e. the Myth. Section III presents some well-known relevant models adopted in our experiments and subsequent outcomes/analysis. The experiments conducted are presented in Section IV, the results and analysis are outlined in Section V. Finally, Section VI discusses our work and the limitations of our study, followed by conclusion in Section VII.

II. The Myth

Various myths and misinformation have been circulating on online platforms in relation to the recent COVID-19 outbreak, which have resulted in severe losses. For instance, rumours such as drinking raw alcohol as a cure for COVID-19 in Iran has resulted in many deaths [3]; similarly, conspiracy theory linking 5G with COVID-19, has resulted in more than 20 attacks on masts in the UK [4]. In this context, Singh et al. [5] identified that a meaningful spatio-temporal relationship exists between myths and are linked to poor quality information on Twitter discussions. Therefore, there is an immediate need for containing the spread of misinformation on online platforms and increased public awareness through various channels by using an evidence-based approach.

Jelnov [6], worked on delinking the myth and reported that the virus is not very dangerous by correlating the log of tests and reported cases, as well as the reported cases and deaths per capita. Their work suggested mortality rate of 0.4% from COVID-19 in a cross-country comparison. However, Constantinou et al. [7] argued that science has been failing to convince people about COVID-19 findings and suggested that measures need to be taken. They identified that myths and conspiracy theories were believed even by highly educated individuals and that such beliefs could be predictors of health-related risky behaviour, such as refusing social distancing, pushing for mass gatherings for demonstrations, and refusing future vaccinations.

In a different context, Laato et al. [8] investigated why people share misinformation during COVID-19 Pandemic, and revealed that a person’s trust in online information and perceived information overload to be strong predictors of unverified information sharing. In addition, these factors, along with a person’s perceived COVID-19 severity and vulnerability influence cyberchondria [8]. Similarly, in a study conducted by Allington and Dhavan [9], a strong acceptance was exhibited by the public (in the UK) in relation to the conspiracy belief that ’the symptoms of COVID-19 seem to be connected to 5G mobile network radiation’, in contrast to other conspiracy beliefs such as ’the virus was created in a lab’ and ’COVID-19 pandemic was planned by pharmaceutical companies’. Similarly, Cushion et al. identified that the UK public were more involved in identifying and circulating fake news, rather than identifying the important information about UK death toll and the impact of COVID-19 on the UK population [10].

III. NLP Models

A. LDA Models

Latent Dirichlet Allocation (LDA) [11] is a generative model of topic modelling widely used in the literature and has shown good performance in analysing large, noisy datasets [12]. The LDA method is an unsupervised approach and can identify themes and topics from a dataset without requiring the dataset to be annotated [13]. The approach followed by LDA assumes that each topic is a distribution of words and each document has a certain distribution of topics. Variations of LDA modelling are identified based on the number of words used to define a topic and are referred as n-gram modelling [14]. For instance, unigram models identify topics from distribution of single words and bigram models identify topics from a distribution of pair of words. LDA models enable identification of topics in the document thereby generating observations and in turn a group of observations can be associated to identify recurring topics in the document.

B. Sentiment Analysis

Sentiment analysis (SA) is a natural language processing method to identify the sentiment or opinion contained in a given piece of data [15]. As opinion is subjective, sentiment analysis extracts the subjectivity in the given text [16]. SA classifies the opinion or the identified subjectivity in the text into different classes, most frequently, into binary classifications, such as positive sentiment or negative sentiment. SA enables a computational study of people‘s opinion, sentiment, attitude, and emotion towards an entity. The entity can be about another individual or a public figure, a product, such as cinema or electronic device or service providers, such as restaurants and hospitals. Recent advancements in machine learning is well explored for SA and several studies are available based on techniques such as support vector model [17], [18], naives bayes [19], [20], strength of association [21], and advanced deep learning approaches [22][24]. SA is explored for applications such as user reviews [25], feedback forum analysis [26], patient experience [27], social media data analysis [28], market intelligence [29], public mood observations, and similar applications.

C. Social Network Analysis

Social network analysis (SNA) approach provides methods to determine relationships between entities (e.g., people or groups) [30]. In our study, SNA is performed using centrality-based social network method for network analysis [31] and on co-occurrence analysis [32] of words to visualize the network. The co-occurrence analysis identifies the frequency of keywords that belong to similar themes and topics and describes the relationship among the keywords. Further, the co-occurrence network for noun bi-grams is constructed to visualize the relationships between the different terms in the network.

IV. Experiments

A. Dataset Collection and Pre-Processing

The data presented by [33] is used in our experiments. The dataset is the first publicly available coronavirus related multi-lingual twitter dataset. The tweets were collected from 28 January 2020 onwards using application programming interface (API) provided by twitter. With various COVID-19 pandemic related keywords, tweets from as early as 21 January, 2020 were recognized in the dataset. Over 50 million COVID-19 related tweets are indexed by the dataset. For more details on the dataset, please refer to [34].

The focus of our work requires COVID-19 tweets in the context of 5G technology, therefore the pre-processing step involved filtering out the tweets. The keywords presented in Table 1 were used to filter out COVID-19 tweets pertaining to 5G. The tweets identified are dated from 21 January, 2020 until 18 April, 2020. Further pre-processing operations included removal of duplicates by tweet ID and duplicated contents. Post filtering and removal of duplicates, a total of 82,043 tweets were available for analysis. Other standard data cleaning steps mentioned in Table 2 were applied during the pre-processing stage.

TABLE 1. Keywords to Filter the Tweets Corpus.

Keywords
fifth generation
Wireless communications
Towers radiation
Radiation poisoning
5G radiation

TABLE 2. Data Cleaning Prior to Analysis.

Pre-processing step Function
Fix abbreviations Replace short words with full words
Remove irrelevant characters Remove redundant characters including links, email IDs
Fix word lengthening Removal of additional, repeated characters
Stopword removal Removal of words including “the”, “an” and customized stopwords
SpellCheck Fix wrong spelling
Punctuation removal Remove punctuations
Lemmatization Group words with similar meanings to a single item

B. Optimal Topic Number

The first stage of our analysis concerned finding the optimal number of topics that represents the contents of the evaluated dataset and focused on topics consisting of nouns only. Multiple LDA models with a topic number Inline graphic ranging from Inline graphic to Inline graphic were evaluated based on coherence scores as a metric. The coherence score assesses the quality of the learned topics by measuring the relative distances between the words in a topic [35]. A high coherence score of the topics indicate the high probability of the words belonging to a particular topic.

From the evaluation of our LDA model, it was observed that the highest coherence score of 0.52 is obtained for the topic number, N=35. Therefore, the optimal number of topics to represent our dataset under evaluation is Inline graphic. Figure 1 shows the coherence value score against the Inline graphic evaluation.

FIGURE 1.

FIGURE 1.

Coherence score for different number of topics.

Finally, after 150 iterations of LDA analysis at a Inline graphic, the LDA model for topics with nouns only was built. A intertopic distance map presented in [36] allows to visualize the distance between the identified topics. The distance map is a reflection of how similar or distinct the topics are from each other and the relative size of the topics. Figure 2 visualizes the intertopic distance map of the 35 topics recognized by the LDA model.

FIGURE 2.

FIGURE 2.

Intertopic distance map for N=35 number of topics.

V. Results and Analysis

A. LDA Analysis

The LDA analysis via unigram and bigram modelling was applied on the dataset under eight different study conditions. Each variation attempts to identify topics from full-text, nouns, adjective, verbs and adverbs. Despite the optimal number of topics was identified to be Inline graphic, however, our LDA implementation was restricted to 20 topics because after approximately 20 topics the distribution of words tend to be repetitive and do not provide meaningful insights than identified. The eight LDA analysis studies identified 20 topics each and each topic represented by distribution of 20 words. Once the topics were identified in each LDA attempt, the word distribution in each topic were manually analyzed and labelled. Table 3 shows an example labelling of the word distribution of an identified topic.

TABLE 3. Manual Labelling of a Topic for a Word Distribution Identified by the LDA Model.

Words belonging to a topic Topic Label
coronavirus pastor covid conspiracy 5G conspiracy
medium chris radiation comment
cause video china world
holmes virus technology corona
eamonn news effect use
china huawei network security 5G threat
coronavirus decision technology trump
boris trade canada johnson
supply deal role risk
britain threat contract government

The entire analysis comprising of eight topics pertaining to LDA model is presented in Table 4, inclusive of: unigram model; unigram nouns; unigram nouns and adjectives; unigram nouns, adjectives, verbs and adverbs; unigram nouns and 5G keywords; bigram plain text; bigram nouns; and bigram nouns and adjectives. It can be observed from the Table 4 that several topics are repeated and are closely related. The reason for similar and repeated topics is the homogeneity of the dataset under evaluation. Topic modelling shows better performance with diverse dataset that contain heterogeneous categories (e.g: news articles) with little to no restriction of target areas. In our study, we place a restriction of only 5G related COVID-19 tweets thereby restricting the heterogeneity of the dataset and leading to clustering and duplication of topics. For instance, Figure 3 shows the inter-topic distance map of four of the eight LDA analysis studies. The distribution of topics varies largely in each study, however, it can also be observed that the intertopic distance is relatively low and the topics tend to cluster. The inter-topic distance map is inspired by the LDAvis method of topic visualization presented by Sievert and Shirley [36].

TABLE 4. Topics Identified for Different Attempts.

Topics Identified
Uni-gram topics Bi-gram topics
Clean-text Nouns Nouns-adjectives Noun-Adj Verb-Adverb Noun-5G Keyword Clean-text Nouns Nouns-adjectives
5G rollout 5G conspiracy corona spread coronavirus spread 5G radiation spread 5G conspiracy theory conspiracy theory dismiss conspiracy
5G tower 5G radiation 5G rollout radiation effects 5G rollout threat Huawei threat radiation effects 5G spreads corona
5G network spread Chinese companies threat Huawei threat 5G tower corona vaccination 5G conspiracy huawei links conspiracy theory
5G phone usage 5G secretive lockdown causes Chinese companies 5G effects 5G tower war speculation 5g tower conspiracy
Chinese companies China control radiation effects %G conspiracy Huawei politics radiation effects huawei global threat radiation global threat
Chinese product security Debunk conspiracy coronavirus effects Radiation effects Huawei threat Huawei threat corona diagnosis huawei threat
5G causes 5G towers 5G tower huawei threat Huawei global dominance 5G threat huawei threat corona epidemic effects
5G Conspiracy Diagnosis treatment radiation effects world technology during corona technology during corona 5G conspiracy 5G tower 5g tower threats
5G radiation effects Dismiss conspiracy chinese mobile companies 5G conspiracy corona spread corona spread radiation fears conspiracy theory
5G links 5G causes technologies during pandemic 5G conspiracy Blame for virus 5G Conspiracy huawei link 5G threat UK
Investigate 5G 5G towers 5G effects 5G rollout 5G effects health 5G threats conspiracy theory corona awareness
5G spread Radiation effects chinese companies Tower effects 5G global threat 5G conspiracy 5G causes 5G links
5G causes Radiation causes 5G conspiracy huawei threat 5G tower 5G conspiracy theory huawei global threat conspiracy theory
5G radiation 5G threat virus spread corona effects 5G threat Huawei links corona news discrimination corona
Chinese companies Huawei threat 5G conspiracy radiation effects 5G threat conspiracy theory 5G links huawei links
5G links radiation effects 5G radiation lockdown war huawei link corona awareness radiation threats
5G spread tower conspiracy 5G conspiracy corona effects Huawei threat corona spread Huawei global threat conspiracy theory
Chinese companies Huawei threat 5G tower effects Huawei threat Technology effects Huawei global risk 5G radiation radiation threats
5G conspiracy 5G conspiracy 5G conspiracy war Huawei threat conspiracy theory corona news china conspiracy
5G radiation effects corona effects Chinese companies corona spread conspiracy 5G links huawei conspiracy theory huawei threat

FIGURE 3.

FIGURE 3.

Intertopic distance map of topics identified of four LDA analysis studies: (a) Unigram Nouns (b) Unigram Noun-adjective pair (c) Unigram noun-5G words (d) Bigram nouns.

The intertopic distance map shown in V-A includes a graph with axes placed as principal component 1 (PC1) and principal component 2 (PC2) and correspond to the principal components of the topic space. In the context of the presented research, the x-axis can be seen as representing the subject of “coronavirus” and the y-axis represents the subject “5G”. Each quadrant can be interpreted to indicate the relevance of topics to different subjects. The topics in first quadrant implies that the tweets are related to both the topics of coronavirus and 5G. Similarly, on the second quadrant the topics are more related to 5G, the topics in third quadrant are not significantly related to neither of the topics, and topics in the fourth quadrant imply that the tweets are more related to coronavirus and less related to 5G. The topic centres are determined by computing the Jensen-Shanon divergence [37] between topics that measures the similarity between two probability distributions and with multi-dimensional scaling the inter-topic distances are projected onto two dimensions. The areas of the circle are proportional to the prevalence of the topics in the corpus.

B. Sentiment Analysis

In the second part of our work, we performed sentiment analysis (SA) on the identified topics from the LDA models. SA on the LDA topics classifies whether the topic carries a positive or negative sentiment. Identifying the overall sentiment score of the topics provides insights into the ‘emotions’ carried by the topics. A negative trending score implies tweets carry negative emotions such as unhappiness, anger, fear, and others. The Valence Aware Dictionary and sentiment Reasoner (VADER) model of SA presented in [38] was implemented for identifying the sentimentality of the topics. VADER is a popular model of SA and has demonstrated good performance in various studies. In the second iteration of our study, the VADER model was applied on the topics identified. The approach of identifying the general sentiment of a specific topic helps in determining the general emotion behind the tweets. For all tweets Inline graphic of a topic Inline graphic, a mean sentiment score Inline graphic is calculated by finding the sentiment score for all tweets belonging to the topic.

Figure 4 illustrates the sentiment scores identified of all the topics belonging to our eight step analysis. The topics identified with positive sentiment scores have a value above zero and are plotted in the figure. The positive scored topics are highlighted with the green background. It can be observed that the majority of the topics belong to negative scores as shown in the graph indicating that the 5G related COVID-19 tweets have largely carried negative sentiment that may include emotions such as anger, hatred or fear.

FIGURE 4.

FIGURE 4.

Sentiment score distribution of the topics identified. Positive sentiment topics are represented above the zero axis (red line).

C. Social Network Analysis (SNA)

A co-occurrence network consists of nodes and edges. Nodes are objects or agents that are connected through the edges that defines the connection between the nodes. To identify co-occurring words, bigrams are identified from the tweets through nodes and edges. The identified bigrams describes the co-occurrence between them.

Prior to the network analysis, several pre-processing steps were performed on the dataset for the network construction. The pre-processing steps included filtering out tweets prior to January 2020, eliminating inconsistencies in the naming of the term “coronavirus”, removal of one-letter words, lemmatizing nouns, and other steps such as eliminating set entities, combining/trimming named entities, and space removal. The processed dataset is analyzed to identify the bigrams and build the co-occurrence network. Table 5 provides the statistics of the network analysis. It can be observed that the number of nodes is lower than the number of edges. The connection between the nodes (i.e. edges) leads to the identification of bigrams and thus Inline graphic unique bigrams are identified. In Table 6, the top 20 nodes and the bigrams identified from the nodes are listed.

TABLE 5. Descriptive Statistics of the Nodes co-Occurrence.

Measure Statistic
0 N bigrams 205562
1 N unique bigrams 130381
2 Mean bigram frequency 1.57663
3 Number of nodes 19102
4 Number of edges 120014
5 Average degree 12.5656

TABLE 6. Top 20 Bigrams and Their Occurrence Frequency.

Bigram Count
0 (cause, coronavirus) 1268
1 (conspiracy, theory) 1012
2 (coronavirus, conspiracy) 514
3 (coronavirus, coronavirus) 497
4 (coronavirus, cause) 268
5 (spread, coronavirus) 268
6 (link, coronavirus) 255
7 (china, huawei) 227
8 (immune, system) 223
9 (coronavirus, outbreak) 212
10 (theory, coronavirus) 204
11 (base, station) 179
12 (kill, people) 171
13 (people, coronavirus) 170
14 (people, cause) 170
15 (huawei, network) 157
16 (coronavirus, people) 141
17 (connection, coronavirus) 139
18 (coronavirus, virus) 134
19 (china, coronavirus) 132

Further analysis of the network is performed by determining the node importance in the network. The analysis in our study is centrality measures based and is applied to determine the influential keywords from the dataset. The centrality measures are determined via degree centrality and betweenness centrality [39]. The degree centrality of a node is the number of all edges of a node relative to the combined number of edges normalized by dividing by the maximum possible degree in a simple graph n-1. The degree centrality indicates the importance or the influence of a particular node on the network. The betweenness centrality is defined as the fraction of all possible shortest paths between any pair of nodes that pass through the node, i.e. the more frequently a node acts as a “bridge” to connect other nodes to each other, the higher the centrality. Table 7 shows the degree centrality and betweenness centrality of the top-20 nodes.

TABLE 7. Top 20 Nodes With Centrality Measures.

Betweenness centrality Degree centrality
Node Betweenness Node Degree centrality
0 coronavirus 0.209812 coronavirus 0.198995
1 china 0.126235 china 0.149940
2 people 0.043556 people 0.082771
3 network 0.036148 network 0.071096
4 virus 0.031834 virus 0.068164
5 cause 0.031083 cause 0.059526
6 huawei 0.026467 huawei 0.059159
7 technology 0.026042 technology 0.056437
8 world 0.024576 world 0.055966
9 use 0.021242 use 0.050835
10 conspiracy 0.018418 conspiracy 0.046071
11 country 0.017610 country 0.045600
12 radiation 0.016119 time 0.043191
13 time 0.015268 radiation 0.041935
14 phone 0.013227 government 0.037380
15 news 0.012584 tower 0.035757
16 vaccine 0.012084 phone 0.035653
17 tower 0.011107 system 0.034763
18 government 0.010763 vaccine 0.033558
19 system 0.010098 thing 0.032773

Figure 6 shows the network visualization of different nodes within the network. The nodes are connected through the edges (i.e. lines). The thicker the lines, the stronger is the relationship between the two nodes. The network reflects the bigram-noun model. The nodes in the diagram are similar to the words identified in the noun-bigram topic modelling version. It can be noted that the node “coronavirus” has the highest number of edges in the graph and has stronger relationship with nodes such as “telecom”.

FIGURE 6.

FIGURE 6.

Co-occurrence network indicating the relationship between the nodes and the strength between each of the nodes.

D. Statistical Analysis

The most commonly occurring words on 5G related COVID-19 tweets were analyzed from the dataset and the LDA model outcomes.

1). Tweet Frequency – Geographical Occurrence

The occurrence of 5G related COVID-19 tweets according to countries was analyzed. Table 8 lists the top countries with the highest number of tweets on 5G-COVID related subject. It can be noted that highest amount of tweets were observed in USA, UK, Canada. The geographical occurrence of the tweets is significant as it correlates to the spread of misinformation and damaging consequences. For instances, in the UK where a high occurrence of tweets happened a significant number of cases of attacks on 5G masts were reported [40], [41]. The amount of tweets, geolocation of tweets and the speed of spread can be a vital tool for agencies to counter misinformation and focus to create awareness at target areas.

TABLE 8. Top Three Countries With Highest Number of 5G-COVID19 Tweets.
Country Number of tweets
USA ~ 7500 tweets
UK ~ 5000 tweets
Canada ~ 2000 tweets

2). Word Frequency - Dataset

The first analysis identified the most frequently occurring words in the dataset. Wordcloud analysis was applied to identify the most frequent words. Figure 5 is a WordCloud visualization of the most frequent words. It can be observed that apart from the obvious ‘5G’ and ‘coronavirus’ words, the most frequent words include ‘China’, ‘Huawei’, ‘network’, ‘technology’, ‘radiation’, ‘tower’, etc.

FIGURE 5.

FIGURE 5.

WordCloud of the frequently occurring terms.

3). Word Frequency – LDA Models

Similarly, word frequency analysis were performed on the topics identified in each of the LDA model. In Figure 7, word frequency count for four out of the eight LDA model variations evaluated is displayed. It can be noted again that, apart from the 5G and COVID-19 related terms, some of the most frequently occurring words are 5G technology related words such as ‘Huawei’, ‘network’, ‘technology’, ‘radiation’, ‘conspiracy’.

FIGURE 7.

FIGURE 7.

Frequently occurring terms identified during LDA analysis for different attempts: (a) Noun (b) Noun-adjective pair (c) Bigram word (d) Bigram adjectives.

The most frequent words indicate that the technology related words are frequently propagated in the tweets. Further, occurrence of terms like ‘conspiracy’ and frequent nouns such as ‘China’, ‘Huawei’, ‘Trump’, ‘USA’ indicate the manifestation of political clashes into conspiracies and spread of misinformation.

VI. Discussion - Limitation of Study

A pandemic situation provides a conducive environment for spread of false information. Non-scientific claims such as 5G radiation effects found significant boost during the COVID19 crisis. Social media platforms such as Twitter provide miscreants an effective tool to accelerate the rate of misinformation spread. In this paper, we attempted to analyse the tweets related to 5G and COVID19 scare with a goal to understand the tweet trends. The LDA analysis of the tweets enables us to identify several topics from the tweets. An overview of the topics identified indicates that the majority of the topics speak about the conspiracy behind the COVID19 pandemic and is evidenced by large corpus of tweets that believe that the 5G technology causes COVID19.

Our analysis observed that China and Huawei were frequently discussed in the tweets. Similarly, frequently occurring terms and discussed topics include 5G towers, radiation effects, network and radiation. The majority of topics are related to 5G radiation and tower effects and conspiracy theories against China and Huawei.

Our study satisfies the evaluation metrics proposed by Camache et al.’s four dimensions of social network analysis [42]. The evaluation metrics are – (a) Pattern and knowledge discovery: our study identifies the themes and topics from the tweet corpus; (b) Scalability: the presented approaches can work on larger datasets and potentially allow application of deep learning based techniques; (c) Information fusion and integration: text data from different social media platforms can be included for further analysis and is our potential future work; (d) Visualization: the LDAvis approach and the other data visualizations presented in our study provides an insightful representation of the themes and topics of the tweet data.

It is believed that an understanding of the themes and trends from the tweets is crucial for policymakers to counter the misinformation with correct targeted information. Further, identifying the geographical location of the tweets and themes of the tweet propagated in the region can be a useful information for agencies to design awareness programs specific to the target area and the population.

The results presented in the study has limitations mainly due to the homogenous data used in the analysis. As the tweets are narrowed down to the 5G and COVID19 topic, the LDA analysis identifies themes that are clustered and overlap. However, the analysis provides vital information about the recurring themes across the tweets. Further, sentiment analysis tools provide additional information about the overall emotion associated with the topics.

One of the limitation of the study is the relatively lower number of tweets available for analysis. The smaller corpus is due to the restriction of the tweets to just the topic of COVID and 5G. A larger number of tweet corpus can provide more robust and insightful analysis.

The study can be enhanced with the application of machine learning and deep learning techniques for further analysis. Techniques such as word2vec models [43] are to be explored for more detailed analysis. Due to the homogeneity of the dataset focusing on 5G, applying deep learning based advanced approaches might not give robust results. Inclusion of other conspiracy topics apart from 5G into the dataset can help in improving the variety and veracity of the dataset to apply deep learning based NLP methods.

VII. Conclusion

NLP based analysis of social media data provides opportunities to understand the nature and spread of misinformation. The COVID-19 tweets linking the pandemic to 5G were analysed to identify the recurring themes and topics within the tweets. Models including LDA, sentiment analysis and social network analysis were applied for the analysis of the tweets and identification of topics. An understanding of the topic frequencies, the inter-relationships between topics and geographical occurrence of the tweets enables to detect agencies and patterns in the spread of misinformation and equips with policymakers with knowledge to devise counter-strategies. The research work certainly can benefit and improve further by focusing on more granular analysis of the data and longitudinal analysis of the nature of information spread.

Biographies

graphic file with name bahja-3039168.gif

Mohammed Bahja is currently a Lecturer in computer science with the University of Birmingham. He has participated in variety of multidisciplinary projects, including the EU Funded Projects of Policy Compass, the MINICHIP Decision Support System, and the Green Datacentre. He has strong connections with the ICT Industry for public and private sphere. His research interests include applied data science, natural language processing, such as patient experience, crime detection, and horizon scanning, distributed software engineering with intersection to e-government and e-health systems, mixed reality, cloud, services, microservices, the IoT and blockchain, software architectures, security software engineering, sustainability software engineering, and computational intelligence in software engineering.

graphic file with name safda-3039168.gif

Ghazanfar Ali Safdar received the B.Sc. degree (Hons.) in electrical engineering from the University of Engineering and Technology, Pakistan, the M.Eng. degree in computer science and telecommunications from ENSIMAG, INPG, France, and the Ph.D. degree from Queen’s University Belfast, U.K., in 2005. He was a Research Fellow in project related to wireless networks security funded by EPSRC with Queen’s University Belfast. He was a Research and Development Engineer with the Carrier Telephone Industries (SIEMENS), Pakistan, and Schlumberger, France. He is currently a Senior Lecturer in computer networking with the University of Bedfordshire, U.K. He also holds an associate position with the Higher Education Academy, U.K. He has authored/coauthored four books, 11 books chapters, and around 80 research articles in leading journals and peer-reviewed conferences. His main research interests include cognitive radio networks, energy saving MAC protocols, security protocols for wireless networks, LTE networks, interference mitigation, device to device communications, network modeling, and performance analysis. He was a part of the technical committees with several International conferences, session chairs, and so on. He received several awards, including the Ph.D. degree from Queen’s University Belfast, in 2005, for his work in power-saving MAC protocols from the IEEE 802.11 Family of Wireless LANs, the Vice Chancellor Best Course of the Year Award, 100% National Students Satisfaction, and the Best Teacher of the Year Nomination Awards. He serves as the Editor-in-Chief for EAI Endorsed Transactions on Energy Web and Information Technology, the Area Editor for Wireless Networks (Springer), and the Topic Area Editor for Journal of Sensors and Actuator Networks (MDPI). He also serves as a Regular Reviewer for several esteemed journals, book proposals, conference papers, and so on.

References

  • [1].Banaji S., Bhat R., Agarwal A., Passanha N., and Pravin M. S., “WhatsApp vigilantes: An exploration of citizen reception and circulation of WhatsApp misinformation linked to mob violence in India,” Dept. Media Commun., London School Econ. Political Sci., London, U.K., Tech. Rep., 2019. [Google Scholar]
  • [2].Groza A., “Detecting fake news for the new coronavirus by reasoning on the covid-19 ontology,” 2020, arXiv:2004.12330. [Online]. Available: http://arxiv.org/abs/2004.12330
  • [3].Tanne J. H., Hayasaki E., Zastrow M., Pulla P., Smith P., and Rada A. G., “Covid-19: How doctors and healthcare systems are tackling coronavirus worldwide,” BMJ, vol. 368, p. 368, Mar. 2020, doi: 10.1136/bmj.m1090. [DOI] [PubMed] [Google Scholar]
  • [4].Coronavirus: Scientists Brand 5G Claims ‘Complete Rubbish’. Accessed: May 7, 2020. [Online]. Available: https://www.bbc.com/news/52168096
  • [5].Singh L., Bansal S., Bode L., Budak C., Chi G., Kawintiranon K., Padden C., Vanarsdall R., Vraga E., and Wang Y., “A first look at COVID-19 information and misinformation sharing on Twitter,” 2020, arXiv:2003.13907. [Online]. Available: http://arxiv.org/abs/2003.13907
  • [6].Jelnov P., “Confronting Covid-19 myths: Morbidity and mortality,” GLO Discuss. Paper Series, Global Label Org., Tech. Rep. 516, 2020, vol. 516. [Google Scholar]
  • [7].Constantinou M., Kagialis A., and Karekla M., “Is science failing to pass its message to people? Reasons and risks behind conspiracy theories and myths regarding COVID-19,” Reasons Risks Behind Conspiracy Theories and Myths Regarding COVID-19, Rochester, NY, USA, Tech. Rep., 2020, doi: 10.2139/ssrn.3577662. [DOI] [Google Scholar]
  • [8].Laato S., Najmul Islam A. K. M., Nazrul Islam M., and Whelan E., “Why do people share misinformation during the COVID-19 pandemic?” 2020, arXiv:2004.09600. [Online]. Available: http://arxiv.org/abs/2004.09600
  • [9].Allington D. and Dhavan N., “The relationship between conspiracy beliefs and compliance with public health guidance with regard to COVID-19,” Kings College London, London, U.K., Tech. Rep., 2020. [Google Scholar]
  • [10].Cushion S., Soo N., Kyriakidou M., and Morani M., “Research suggests UK public can spot fake news about COVID-19, but don’t realise the UK’s death toll is far higher than in many other countries,” LSE COVID-19 Blog, London School Econ. Political Sci., London, U.K., Tech. Rep., 2020. [Online]. Available: https://blogs.lse.ac.uk/covid19/2020/04/28/research-suggests-uk-public-can-spot-fake-news-about-covid-19-but-dont-realise-the-uks-death-toll-is-far-higher-than-in-many-other-countries/ [Google Scholar]
  • [11].Blei D. M., Ng A. Y., and Jordan M. I., “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [Google Scholar]
  • [12].Jelodar H., Wang Y., Yuan C., Feng X., Jiang X., Li Y., and Zhao L., “Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey,” Multimedia Tools Appl., vol. 78, no. 11, pp. 15169–15211, Jun. 2019. [Google Scholar]
  • [13].Arun R., Suresh V., Madhavan C. V., and Murthy M. N., “On finding the natural number of topics with latent Dirichlet allocation: Some observations,” in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining. Berlin, Germany: Springer, 2010, pp. 391–402. [Google Scholar]
  • [14].Wang X., McCallum A., and Wei X., “Topical N-grams: Phrase and topic discovery, with an application to information retrieval,” in Proc. 7th IEEE Int. Conf. Data Mining (ICDM), Oct. 2007, pp. 697–702. [Google Scholar]
  • [15].Liu B., Sentiment Analysis and Opinion Mining (Synthesis Lectures on Human Language Technologies), vol. 5, no. 1. San Rafael, CA, USA: Morgan & Claypool, 2012. [Google Scholar]
  • [16].Liu B., “Sentiment analysis and subjectivity,” Handbook Natural Lang. Process., vol. 2, pp. 627–666, Feb. 2010. [Google Scholar]
  • [17].Manek A. S., Shenoy P. D., Mohan M. C., and R V. K., “Aspect term extraction for sentiment analysis in large movie reviews using gini index feature selection method and SVM classifier,” World Wide Web, vol. 20, no. 2, pp. 135–154, Mar. 2017. [Google Scholar]
  • [18].Luo F., Li C., and Cao Z., “Affective-feature-based sentiment analysis using SVM classifier,” in Proc. IEEE 20th Int. Conf. Comput. Supported Cooperat. Work Design (CSCWD), May 2016, pp. 276–281. [Google Scholar]
  • [19].Troussas C., Virvou M., Espinosa K. J., Llaguno K., and Caro J., “Sentiment analysis of facebook statuses using naive bayes classifier for language learning,” in Proc. IISA, Jul. 2013, pp. 1–6, doi: 10.1109/IISA.2013.6623713. [DOI] [Google Scholar]
  • [20].Kang H., Yoo S. J., and Han D., “Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews,” Expert Syst. Appl., vol. 39, no. 5, pp. 6000–6010, 2012. [Google Scholar]
  • [21].Montejo-Ráez A., Díaz-Galiano M. C., Martínez-Santiago F., and Ureña-López L. A., “Crowd explicit sentiment analysis,” Knowl.-Based Syst., vol. 69, pp. 134–139, Oct. 2014. [Google Scholar]
  • [22].Yang L., Li Y., Wang J., and Sherratt R. S., “Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning,” IEEE Access, vol. 8, pp. 23522–23530, 2020. [Google Scholar]
  • [23].Rosa R. L., Schwartz G. M., Ruggiero W. V., and Rodriguez D. Z., “A knowledge-based recommendation system that includes sentiment analysis and deep learning,” IEEE Trans. Ind. Informat., vol. 15, no. 4, pp. 2124–2135, Apr. 2019. [Google Scholar]
  • [24].Feldman R., “Techniques and applications for sentiment analysis,” Commun. ACM, vol. 56, no. 4, pp. 82–89, Apr. 2013. [Google Scholar]
  • [25].Gezici B., Bolucu N., Tarhan A., and Can B., “Neural sentiment analysis of user reviews to predict user ratings,” in Proc. 4th Int. Conf. Comput. Sci. Eng. (UBMK), Sep. 2019, pp. 629–634. [Google Scholar]
  • [26].Carrillo-de-Albornoz J., Rodríguez Vidal J., and Plaza L., “Feature engineering for sentiment analysis in e-health forums,” PLoS ONE, vol. 13, no. 11, Nov. 2018, Art. no. e0207996, doi: 10.1371/journal.pone.0207996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Bahja M. and Lycett M., “Identifying patient experience from online resources via sentiment analysis and topic modelling,” in Proc. 3rd IEEE/ACM Int. Conf. Big Data Comput., Appl. Technol. (BDCAT), 2016, pp. 94–99. [Google Scholar]
  • [28].Yue L., Chen W., Li X., Zuo W., and Yin M., “A survey of sentiment analysis in social media,” Knowl. Inf. Syst., vol. 60, pp. 617–663, Aug. 2019, doi: 10.1007/s10115-018-1236-4. [DOI] [Google Scholar]
  • [29].Rambocas M. and Pacheco B. G., “Online sentiment analysis in marketing research: A review,” J. Res. Interact. Marketing, vol. 12, no. 2, pp. 146–163, Jun. 2018. [Google Scholar]
  • [30].Kumari R., Jeong J. Y., Lee B.-H., Choi K.-N., and Choi K., “Topic modelling and social network analysis of publications and patents in humanoid robot technology,” J. Inf. Sci., Dec. 2019, Art. no. 016555151988787.
  • [31].Wang Y., “The panorama of the last Decade’s theoretical groundings of educational leadership research: A concept co-occurrence network analysis,” Educ. Admin. Quart., vol. 54, no. 3, pp. 327–365, Aug. 2018. [Google Scholar]
  • [32].Stroele V., Campos F., David J. M. N., Braga R., Abdalla A., Lancellotta P. I., Zimbrao G., and Souza J., “Data abstraction and centrality measures to scientific social network analysis,” in Proc. IEEE 21st Int. Conf. Comput. Supported Cooperat. Work Design (CSCWD), Apr. 2017, pp. 281–286. [Google Scholar]
  • [33].Chen E., Lerman K., and Ferrara E., “Tracking social media discourse about the COVID-19 pandemic: Development of a public coronavirus Twitter data set,” JMIR Public Health Surveill., vol. 6, no. 2, May 2020, Art. no. e19273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].COVID-19-TweetIDs. Accessed: Feb. 10, 2020. Accessed: Feb. 2020. [Online]. Available: https://github.com/echen102/COVID-19-TweetIDs
  • [35].Newman D., Lau J. H., Grieser K., and Baldwin T., “Automatic evaluation of topic coherence,” in Proc. Annu. Conf. North Amer. (ACL). Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 100–108. [Google Scholar]
  • [36].Sievert C. and Shirley K., “LDAvis: A method for visualizing and interpreting topics,” in Proc. Workshop Interact. Lang. Learn., Vis. Interfaces, 2014, pp. 63–70. [Google Scholar]
  • [37].Lamberti P. W., Majtey A. P., Borras A., Casas M., and Plastino A., “Metric character of the quantum Jensen-Shannon divergence,” Phys. Rev. A, Gen. Phys., vol. 77, no. 5, May 2008. [Google Scholar]
  • [38].Hutto C. J. and Gilbert E., “Vader: A parsimonious rule-based model for sentiment analysis of social media text,” in Proc. 8th Int. Conf. Weblogs Social Media. Palo Alto, CA, USA: AAAI Press, 2014, p. 82. [Google Scholar]
  • [39].Zhang J. and Luo Y., “Degree centrality, betweenness centrality, and closeness centrality in social network,” in Proc. 2nd Int. Conf. Modeling, Simulation Appl. Math. (MSAM). Bangkok, Thailand: Atlantis Press, 2017, pp. 1–4. [Google Scholar]
  • [40].77 Cell Phone Towers Have Been Set on Fire so Far Due to a Weird Coronavirus 5G Conspiracy Theory. Accessed: May 2020. [Online]. Available: https://www.businessinsider.com/77-phone-masts-fire-coronavirus-5g-conspiracy-theory-2020-5?r=US&IR=T
  • [41].Coronavirus: Man Jailed for 5G Phone Mast Arson Attack. Accessed: Jun. 2020. [Online]. Available: https://www.bbc.co.uk/news/uk-england-merseyside-52966950
  • [42].Camacho D., Panizo-LLedot Á., Bello-Orgaz G., Gonzalez-Pardo A., and Cambria E., “The four dimensions of social network analysis: An overview of research methods, applications, and software tools,” Inf. Fusion, vol. 63, pp. 88–120, Nov. 2020. [Google Scholar]
  • [43].Jang B., Kim I., and Kim J. W., “Word2vec convolutional neural networks for classification of news articles and tweets,” PLoS ONE, vol. 14, no. 8, Aug. 2019, Art. no. e0220976, doi: 10.1371/journal.pone.0220976. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Ieee Access are provided here courtesy of Institute of Electrical and Electronics Engineers

RESOURCES