Abstract
Nowadays, the world is experiencing a pandemic crisis due to the spread of COVID-19, a novel coronavirus disease. The contamination rate and death cases are expeditiously increasing. Simultaneously, people are no longer relying on traditional news channels to enlighten themselves about the epidemic situation. Alternately, smart cities citizens are relying more on Social Network Service (SNS) to follow the latest news and information regarding the outbreak, share their opinions, and express their feelings and symptoms. In this paper, we propose an SNS Big Data Analysis Framework for COVID-19 Outbreak Prediction in Smart Sustainable Healthy City, where Twitter platform is adopted. Over 10000 Tweets were collected during two months, 38% of users aged between 18 and 29, while 26% are between 30 and 49 years old. 56% of them are males and 44% are females. The geospatial location is USA, and the used language is English. Natural Language Processing (NLP) is deployed to filter the tweets. Results demonstrated an outbreak cluster predicted seven days earlier than the confirmed cases with an indicator of 0.989. Analyzing data from SNS platforms enabled predicting future outbreaks several days earlier, and scientifically reduce the infection rate in a smart sustainable healthy city environment.
Keywords: COVID-19, Smart Healthy City, Big Data Analysis, SNS, NLP
1. Introduction
By the end of 2019, the Wuhan Municipal Health Commission in China declared accumulated cases of patients with pneumonia symptoms [1]. A few days later, the World Health Organization (WHO) recorded the first virus outbreak. This flagship technical report named the novel virus COVID-19 (Corona Virus Disease 2019). Since then, global media and healthcare organizations have raced to publish and deliver information and guidance to the public. On January 13, 2020, Thailand officially confirmed one COVID-19 case, the first contamination outside China. Other countries were infected as well such as South Korea and Japan. On March 11, 2020, WHO declared a COVID-19 pandemic, calling for a public quarantine, raising the virus countermeasures to the highest levels, and shutting down several international airports to control the outbreak considering the alarming levels of spread and severity.
With the continuous spread of the virus across the globe, the concept of Smart Healthy Cities becomes under the scoop. A Smart Sustainable City (SSC) is the approach of allotting the progress of Information and Communication Technology (ICT) and other innovative means to serve and improve the Quality of Life (QoL) of its citizens [2], all while assuring the availability of the services for current and future generation. Enhancing the QoL is only possible with advanced transportation, governance, finance, education, and health services. Under the same conditions, the United for Smart Sustainable Cities (U4SSC) initiative [3] reported the healthcare system as one of the crucial and main Key Performance Indicators (KPI) for a successful SSC. The noticeable progress of healthcare services and technologies, named Smart Healthcare, have a direct contribution with the improvement of smart cities in general. The better the service provided by smart healthcare are, the more competently the life is in a smart city. Respectively, the smart sustainable cities are developed into Sustainable Smart Healthy Cities (SSHC). Researchers and SSHC’s citizens were looking forward to the measures will be taking to cope with the sudden virus outbreak and how it will affect the sustainability and development of their smart cities. Singapore, as an example of SSHC deployed state-led technologies such as TraceTogether application and SafeEntry system for digital surveillance during the outbreak of the virus [4], which seemingly contributed to controlling the fast propagation of the virus. Notwithstanding, the COVID-19 epidemic have affected some characteristics of SSHC and identified some of its weaknesses [5]. Due to the large population of SSHC, the human proximity narrowed down the social distancing measures. Moreover, SSHC also faced the problem of micro-mobility spaces [[6], [7], [8]]; in order to reduce the infection rate and to keep up with the sustainability demand, smart cities need to provide more spaces for walking and diminish the public transportation usage.
Antithetically, following the advancement of smart cities in the area of 5 G networks, people are more connected via their smart devices including smartphones, tablets, computers, smart TVs and so on. The study in [9] disclosed that the average time spent on the smartphone by an adult is 232 minutes per day. Similarly, 90% of adolescents in Norway were found to use SNS platforms such as Facebook, Twitter, and Instagram daily [10]. People are exposed to billions of information and news regarding their interests, making SNS platforms the main source of news particularly during a pandemic situation due to social distancing measures and quarantine [11]. Multiple Social Networking Services reported an increase in usage. As one of the most used social media platforms, Facebook has reported a 50% increase in overall messaging when WHO announced the pandemic status [12]. WhatsApp, too, has disclosed a 40% increase in the same period. As a general overview of the statistics, usage of social media has increased by more than 9% during the COVID-19 pandemic [13]; this consists of 321 million new users, with the worldwide total at 3.80 billion users. People are no longer relying on classic media sources such as government news or local healthcare providers to inform themselves during a pandemic situation such as the current COVID-19. Instead, they tend to trust social media platforms and messenger services more, where they can express their opinion and enlighten themselves about the current status momentarily. For instance, online headlines and hashtags during the month of January mostly deal with fear and prejudice against the Chinese people [14].
Online platforms were the main source of information as people usually do not watch television news or read the newspapers, leading to a COVID-19 infodemic situation. The term infodemic was defined by Zarocostats et al. [15] as the high demand for timely, trustworthy information about a novel virus. Fig. 1 presents the graphical chart of the number of mentions of the hashtag (a type of metadata tag used on social networks) “#coronavirus” on the Twitter platform from January (the beginning of the outbreak) until March (the peak of the outbreak). We noticed that the searching rate using the keyword “coronavirus” was even higher than the number of hashtags mentions. This means that Twitter users were actively reading the news and viewing content regarding COVID-19 to keep themselves informed.
Fig. 1.
Coronavirus hashtag’s report on Twitter.
SNS platforms such as Twitter contain valuable data and information. Users, for instance, may disclose their symptoms online and request advice from other users. They may as well mention their locations, sharing various types of recent information and news regarding the virus outbreak. SNS users produce extensive spatial and temporal data. These data are the main pillar for infodemiology scientists. The term infodemiology refers to the novel science portfolio of information science and epidemiology to address the pressing concerns for public health and policy decisions [[16], [17], [18]]. With an increasing number of people turning to social media for information and to express their sentiments, fears, and opinions during this epidemic [19], SNS platforms function as a convenient source of information [43], yet only a few systematic studies have been conducted using data generated by social media.
A smart and fully sustainable healthy city commits to utilize ICT and alike technologies in order to improve the QoL of its citizens and enhance its services while maintaining the required sustainability. Thus, predicting future clusters of the virus outbreak and preventing fake news from further propagation are critical actions to reduce the burdens faced by SSHC. To this end, and to exploit the openly shared SNS data, we propose an SNS big data analysis framework related to COVID-19 using Twitter API as a platform, Python, and NLP methods to predict potential future cases and virus outbreak hotspots based on the users' openly shared data including location and symptoms. SNS users provide regular updates about their health status and concerns regarding the virus, including their locations such as home, work, or school, and frequently visited places. Moreover, we analyze the shared information on SNS to detect false information and suppress its spread as it is considered one of the main and direct reasons the virus transmission increases. Fake news is covering official healthcare providers’ voices and government advisers on COVID-19, making it extremely difficult to deliver the right instructions [20]. Researchers from the University of East Anglia modeled the effect of fake news and misinformation on the propagation of infectious diseases, and results showed that reducing misleading advice and misinformation by just 10% noticeably reduced people’s risky behaviors that directly cause the virus spread. Other misleading information such as news of fake lockdowns directly contributes to public panic. We have witnessed several cases around the world where people were rushing to buy food supplies following fake news of lockdown, breaking social distancing rules and leading to further virus propagation [45].
Very limited researches have been conducted on SNS data for COVID-19 alike crisis. Besides, existing researches’ focus point is either sentimental analysis, fake news detection, or virus outbreak prediction. However, via this proposed study, we strongly believe that big data analysis of SNS information is crucial and critical application for SSHC during the time of epidemic to control the situation, manage the circulated and shared information, prevent misinformation from spreading further, and predict mainly future potential outbreak locations, hotspots, and patients.
The following are the main contributions of our study:
-
•
We proposed an SNS big data analysis framework based on users' openly shared information wherein we analyze the keywords and sentences related to COVID-19 tweeted and openly by the users relying on public accounts. This investigation stimulated and improved government and healthcare providers’ perception of people's opinions as well as the tendency to stick to social distancing and follow the rules given by the World Health Organization.
-
•
Predicting COVID-19 outbreak and detecting future potential patient’s clusters by analyzing the tweets based on the keywords database: These keywords include basic COVID-19 symptoms such as dry cough and high body temperature. SNS users tend to share their symptoms as well as those of their companions to get advice from other users or simply to express themselves. We accumulated a database for two months containing public tweets, to predict the whereabouts of a possible outbreak.
-
•
Detecting fake news and managing infodemic by comparing the tweets to fact-checking websites to detect the circulated fake news and calculate the propagation’s coefficient: Fake news contributes directly to the virus spread. Thus, we believe that infodemic management is mandatory to control the epidemic in SSHC.
The rest of this paper is organized as follows: in the second section, we present a summary of other existing studies; Section 3 presents our proposed framework overview and explains in detail our flow diagram; the fourth section contains the numerical results and analysis based on the Twitter case study; finally, we conclude this work in the fifth section.
2. Related works
The recent advancement of smart cities and 5 G technologies contributes directly to the increase of SNS and mobile users. Since the Quality of Experience guaranteed by the 5 G network is relatively high, users in the SSHC rely more and more on their phones to be informed daily instead of classic news channels. Taking the Twitter platform as an example, Monthly Active Users (MAU) in Japan numbered over 21 million, and they communicate and share information via Twitter [21]. An average of 6000 tweets per second was recorded in May 2020, and 200 billion tweets per year [22]. People rely more on SNS platforms to inform themselves about the pandemic. Nonetheless, very few studies considered investigating and analyzing the news and information broadcast on social media to understand the propagation of the virus. In this section, we discuss some of the works that have explored social networks to derive the outbreak patterns and people's willingness to contribute to the virus control [23].
2.1. Seminal contribution
To the best of our knowledge, only five studies have considered analyzing social network data and news to understand the propagation of misinformation and its correlation with the virus outbreak. For instance, Yoo et al. [24] investigated the effects of SNS communication during an epidemic period such as Middle East Respiratory Syndrome (MERS) and argued that it can predict preventive behavioral intentions in South Korea. The authors examined the theoretical expression and reception effect model using data collected from a nationally representative panel survey. Results showed that public health organizations need to adapt SNS monitoring tools to a better understanding of diverse social groups’ opinions and requests in order to adjust their messages based on it. This research certainly proposed the management of SNS platforms such as Twitter for outbreak cluster control, however, the authors didn’t consider the propagation of fake news. Moreover, the proposed method cannot make casual inferences regarding the relationship among key variables. Shahi et al. [25] presented a multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19. Their proposal included a classifier for fake news automatic detection. The authors collected articles and news related to COVID-19 from two sources - “Poynter” and “Snopes” - and performed explanatory analysis on it. They have a well-covered article that is linked to social media accounts including Twitter, Facebook, Reddit, YouTube, and Instagram. The proposed machine learning classifier recorded an F1- score of 0.76. This result helps in the initial screening of propagation of fake news and misinformation during the pandemic period, nonetheless, this work didn’t provide a method for Twitter infodemic managing and considered human annotation categories only for tree languages.
Following the same concept of detecting fake news on social media, and in order to gain early insight into the social opinion during a global pandemic, Shahi et al. [26] conducted an exploratory study on the propagation and authors of misinformation on Twitter. They divided the false claims they found into two categories - false news and partially false news - and found that verified accounts including celebrities and organizations are involved as well in creating or spreading misinformation by retweeting them. The results of this mapping study showed the huge gap in the current scientific coverage of the topic, covered the most shared fake news, yet the used dataset excluded the less viral misinformation. Similarly, Massaad et al. [27] designed social media data analytics in order to describe and analyze the volume, content, and geospatial distribution of tweets associated with telehealth during the COVID-19 pandemic. They have inquired about Twitter public data to access tweets and analyze them using Natural Language Processing (NLP) and unsupervised learning methods. The study was conducted in one country, results showed that such platform must be used to evaluate the needs of societies and to embrace the healthcare response during a pandemic. Yet, fake news was not taking under consideration and the data based used for analyzing the results included all the information form Twitter. Likewise, Pui et al. [28] discussed how the content of tweets about COVID-19 can help understand the virus outbreak based on sentimental analysis, however, this work lacks results development. While Jia et al. [29] emphasized on the usefulness of the geographic information system to fight this epidemic. The authors discussed in their paper how the national intelligent syndromic surveillance system could identify early risk by detecting patient’s symptoms and locations.
Other researchers focused on the big data analysis of social media for various motivations but did not deal with the COVID-19 topic. Nonetheless, we included their works in the related work section as they helped build the main pillar of data collection and analysis on social media platforms. Analogously, Htet et al. [30] conducted social media data analysis using a maximum entropy classifier on the big data processing framework. They considered Twitter as the platform and retrieved the health condition, education status, and states of business using data mining techniques, and a maximum entropy classifier was used to perform sentiment analysis on their tweets, yet the Hadoop MR method used has some drawbacks in the case of the transaction between input and output. Following the same concept, Barbosa et al. [31] presented a robust sentiment detection method using Twitter as a platform from Biased and noisy data. The authors leveraged sources of noisy labels as training data. The results obtained showed lower error rates as an effective polarity classifier was produced even when only a small number of data were provided, nonetheless, sentences with antagonistic sentiments cannot be analyzed.
2.2. Key consideration for COVID-19 prediction model
In this section, we discuss similar contributions and related work, compare related research studies (Table 1 ), and present our main key consideration for our proposed SNS-based COVID-19 outbreak prediction in a smart healthy city. Based on the abovementioned related works, we could say that authors focused more on fake news detection and public sentiment determination. In our approach, however, we tried to detect and predict future outbreak hotspots based on openly public published posts on Twitter. In our approach, we considered sentiment detection, understanding of people’s fear, and fake news detection as well as two other considerations. Our main key consideration can be described as follows:
-
•
Sustainability: In a SSHC environment, authorities can make use of the available connectivity and data online to control the outbreak of any infectious disease such as COVID-19. Users of SNS tend to share valuable data and information online; such data, when correctly filtered and analyzed aid in better understanding of the virus propagation patterns and help contain it. We focused on using this approach to proposing a sustainable way to predict the virus outbreak for the authorities to make accurate and meaningful decisions. Detecting the virus outbreak cluster will directly impact the management of the pandemic in SSHC, thus, maintaining the required level of QoL while improving the health care service.
-
•
Security and privacy: Other approaches do not consider the data security and the privacy of users as they are using the raw data provided by the SNS platforms’ API without filtering it. In order to maintain data protection agreement in different regions, our proposed approach analyze only the openly shared data by users on Twitter. Moreover, we consider in this study the privacy of users by adding a filter functionality using NLP methods before forwarding the data to be processed; in the filter phase, we omit the user name and SNS identification. This step allowed us to analyze the data anonymously and retrieve knowledge without violating the users’ privacy.
-
•
Availability: This approach is based on the data collected from openly shared tweets on Twitter. Using the API provided by Twitter, we can generate and collect data from various users in multiple locations and from diverse backgrounds, which generates a large database of information.
-
•
Integrity: Data integrity refers to the assurance and maintenance of accuracy of data over its life cycle [[46], [47]]. In this approach, we filtered the data thoughtfully before proceeding to the analysis phase, which guarantees accurate and meticulous data for our prediction model.
Table 1.
Related work comparison
| Research work | Year | Other Platform | Method/ Software/ Hardware |
SNS Monitoring | Sentiment Detection | Fake News Detection | Limitations | |
|---|---|---|---|---|---|---|---|---|
| Yoo et al. [24] | 2016 | Yes | No | Structural equation modeling. MPlus V6.1 |
Yes | No | No | Cannot make causal inferences regarding the relationship among the key variable. |
| Shahi et al. [25] | 2020 | Yes | No | NLP, Python. |
No | No | Yes | Human annotated categories only for tree languages. |
| Shahi et al. [26] | 2020 | Yes | No | Python | No | No | Yes | The dataset excludes less viral misinformation. |
| Massaad et al. [27] | 2020 | Yes | No | Google Colab. Python. |
No | Yes | No | The data collected only in one country. |
| Pui et al. [28] | 2020 | Yes | No | Google Colab Python |
No | Yes | No | The results were not developed |
| Htet et al. [30] | 2018 | Yes | No | HBase. Raspberry pi |
No | Yes | No | Hadoop MR used has some drawbacks in the case of the transaction between input and output. |
| Barbosa et al. [31] |
2010 | Yes | No | Support vector machines | No | Yes | No | Sentences with antagonistic sentiments cannot be analyzed |
| Our Contribution | 2020 | Yes | No | Google Colab Python |
Yes | Yes | Yes | Tested only one SNS platform |
3. Proposed SNS big data analysis framework
Very few infodemiology studies have applied network analyses as a method of pandemic management. To this end, we proposed a Big Data analysis framework based on information shared openly on the SNS platform to track, understand, and predict future virus outbreaks as well as manage the infodemic and prevent fake news from spreading. This framework includes four layers: 1) User Layer, 2) Fog Layer, 3) Cloud Layer, and 4) Application Layer. The overview of the framework is depicted in Fig. 2 below.
Fig. 2.
SNS Big Data Analysis Framework Overview.
3.1. General overview
The proposed framework contains mainly four layers, and it is designed to analyze data from SNS platforms based on the information shared by the users:
User layer: The first layer of our proposed framework is the user’s layer. SNS users in this layer use their smartphones, computers, and/or tablets to surf the platforms. Users can consume data - which means being a mere receiver of information - participate in the creation of new data, and/or share existing data. Each user utilizes one or multiple SNS platforms such as Facebook, Instagram, Twitter, YouTube, and so on to:
-
•
Receive information regarding the virus: Users utilize the platform to receive, search, and read news and information regarding COVID-19. The data could reach the user in the form of posts, messages, videos, and news articles shared by other users. Users are capable of searching on SNS platforms using keywords or hashtags to retrieve their desired information. Taking Twitter as an example, in 2019, the platform recorded 330 million monthly active users, 40% of whom use the service daily. Note, however, that Twitter announced that over 500 million people access its platform without logging into an account [32]. These numbers suggest that people are using Twitter as receivers, and that they are capable of getting the information they need without participating.
-
•
Share information regarding the virus: Users can share any post or article they want on their profile. This action requires having an active account and helps in understanding the user’s opinion later in the analysis phase.
-
•
Create content and information regarding the virus: SNS platforms give users the power to express themselves and their thoughts virtually. Users contribute to creating new information and data directly by posting their viewpoints, sharing their knowledge and experiences, and expressing their feelings regarding the epidemic situation. Such data, after being analyzed, strengthen our understanding of how people react through a serious pandemic, including their willingness to stick to the authority’s instruction in order to control the virus outbreak. Moreover, checking if the shared information contains any misleading information is a critical point before people start to reshare and believe it.
-
•
Update their health status and concerns: SNS users tend to share their concerns about their health status and conditions online. The authors in [42] surveyed 1,040 US adults and found that roughly 33% are using SNS platforms such as Twitter to share medical questions, obtain health information, and track their symptoms. 80% of participants aged 18 ∼ 24 said that they are more likely to share information about their health condition and symptoms on social media. Analyzing these data can help healthcare providers and authorities predict potential future COVID-19 outbreak hotspots and make fast decisions to stop it from spreading further. We will explain in the next section how this functionality is crucial for virus control.
-
•
Share their locations and visited places: Another valuable information that users tend to share in their SNS accounts is their location. When they post a picture or a text post, they are most likely to mention the place where they are. Facebook, Instagram, and Snapchat mainly suggest locations for people upon uploading in their accounts. Moreover, users upload the other people as well they are with at the time of posting. This information can directly help in tracing the virus spread as we will discuss in the following section:
-
•
Express their opinion and anger: Opinion, sentiment, and fear can be retrieved from the contextual posts of the user as well as his/her likes of other posts and articles.
-
•
Socialize with other people: During the quarantine imposed by most countries around the world, schools and works have shifted to online, and they no longer require physical presence; thus leading people to spend more time at home. They turn to social media in order to socialize and keep in touch with their loved ones and friends.
All of this information is constantly being uploaded to the fog layer where social networking platforms run their databases.
Fog Layer: The fog layer is where the social media platforms reside and run their databases such as Facebook, Instagram, Twitter, YouTube, and other chatting applications such as WhatsApp and Facebook messenger. The data uploaded by the users are being stored in fog databases. Some of the SNS platforms offer an open API to access their database content for research purposes, whereas others need to be collected manually in order to study them. Thus, in this paper, we work with the open Twitter API and make use of their database to retrieve tweets related to COVID-19 and analyze them.
Cloud Layer: Two main functionalities are running at the cloud layer: Filtering and Analyzing. In the filtering phase, the data will be cleaned, and this includes taking out hashtags, links, lowering all the sentences, and other functionalities that we will explain in detail in the next section. Subsequentially, the cleaned data will be forwarded to the analyzing phase, where it will be processed and analyzed in order to retrieve knowledge and valuable information from it. Details about this process are discussed in the next section. The results will be forwarded to the Application layer.
Application Layer: In the last layer, the results of the analyzing phase and the knowledge gleaned are transmitted to the authorities in order to make use of the results retrieved from the cloud layer [[37], [38], [39], [40], [41]]. Based on the results received, authorities such as public healthcare providers and governments should be able to make an accurate decision regarding the virus outbreak and infodemiology. The results help the authorities predict future outbreak waves based on the user’s shared symptoms and locations, control the information shared in SNS platforms in order to track fake news and stop it from spreading further, and understand people’s fear and opinion regarding the epidemic in order to maintain a sustainable, healthy smart city.
The proposed framework and solution are not only felicitous for the COVID-19 epidemic case but must also be used to monitor future virus outbreaks and manage the situation and public fear before it is too late.
3.2. Structural design with methodological flow
Fig. 3 below presents the methodological flow of our proposed framework and aids in better understanding of it. We took Twitter as the SNS platform as it provides researchers with a free API to access all the openly shared tweets. It starts with a user who decides to share a piece of certain information on Twitter - this information could be simple opinion, location (dinner at a certain place or at home with a certain person), symptoms (users who share that their symptoms are similar to COVID-19 symptoms online before getting checked), and/or general information about the virus that could be created or shared from another resource such as web news pages or other Tweets. All of these data are being collected and crawled and shared with the cloud server where two functions are mainly done: preprocessing and postprocessing.
Fig. 3.
Process Flow of the Proposed Framework.
Preprocessing: Preprocessing methods include filtering the data communicated from the fog layer; this step consumes roughly 85% of knowledge discovery time [33] to filter the data. We followed several steps as follows:
-
a)
Keywords extraction: Since the platform we were working on is Twitter, we selected a set of the most used hashtags and collected our tweets based on it. These hashtags include: #coronavirus, #wuhanvirus, #COVID-19, #ncov, #lockdown, #2019ncov, and #corona. These hashtags are the most used on Twitter regarding the COVID-19 topic. Algorithm 1 shows the process of selecting tweets based on hashtags; we started by connecting the Twitter open API, created a table where we put all the hashtags, and retrieved all the tweets where those hashtags are mentioned.
-
b)
Tokenization: Tokenization is the process wherein words are transformed from normal strings into meaningful terms in order to facilitate the analysis phase [34]. The tokenization action removes any special characters, numbers, and symbols from the text. These characters are called tokens. Algorithm 1 depicts the tokenization process. After establishing the connection and collecting the tweets based on the given hashtags, we created a table and associated the special characters, such as “RT” mention, which refers to a retweeted tweet, “@” symbol, which is used to tag a user, and “#” symbol used to highlight the keywords and numbers. The function Replace is responsible for putting a space in place of those special characters.
-
c)
Lemmatization and clustering: In linguistics, lemmatization is the process of grouping together the inflected forms of a word so that they can be analyzed as a single item and identified by the word's lemma, or dictionary form. In data analysis science, this method can be referred to as clustering. A lot of tweets are similar, and this may create data noise. Thus, clustering tweets based on their similarities saves us more time in the post-analysis phase. The clustering algorithm is based on a machine learning concept called Term Frequency - Inverse Document Frequency (TF-IDF). The TF-IDF is a matrix where a Tokenized Tweets table resulting from algorithm 1 is referred to as t and the words within it are referred to as w. The frequency is the ratio number of the current word to the total number of tweets in the Tokenized Tweets table. For frequency of word , is the total number of in the Tokenized Tweets. The sum of is the total number of all words in the Tokenized Tweets table. Equation 1 explains this step:
| (1) |
Inverse Document Frequency is a logarithm of the ratio of the number of all tweets of the Tokenized Tweets table to the number of words within tweets with term . Equation 2 depicts this phase:
| (2) |
TF-IDF is the product of to as shown in equation 3:
| (3) |
Clustering all the tweets and organize is a necessary step to expedite the post-analysis process.
-
d)
Feature extraction: Subsequent to the keywords’ extraction, tokenization, and lemmatization phases, the remaining necessary step is feature extraction, which is the aspect wherein the main characteristics of the tweet are being excerpted. The results will be used as the fuel of the post-processing stage. These features include user location, mentioned symptoms in the tweets, subjectivity, and polarity. Starting with the user location, we used the location provided by the user in the location field from the tweet meta-data. Aside from the city or country, we focused more on precise locations such as restaurant name, workplace, school name, and so on. Algorithm 2 presents the pseudo-code used to retrieve the user location. We used the Tokenization Table obtained from algorithm 1, fetched through it all the qualified Tweets, and searched if the location field is provided. If the case is true, we created a table with all the users and their respective locations. This table will serve to predict future possible outbreak cases. As for the subjectivity and polarity analysis, it is a method that provides clear understanding of the user’s attitude, opinion, and sentiment using NLP techniques. As part of sentiment analyses using NLP, the subjectivity analysis classifies texts as opinionated or non-opinionated [35]. Adjectives, adverbs, and certain verbs and nouns were used as indicators of a subjective opinion. On the other hand, polarity analysis was performed after the subjectivity analysis in order to define if the opinion is positive or negative.
Commensurate with the subjectivity analysis, to extract the polarity result, we used python TextBlob for NLP. The results varied between -1 and 1 where negative scores represent a negative opinion, and positive scores denote a positive opinion. Scores with 0 suggest a neutral point of view (Table 2 ).
Table 2.
Subjectivity and Polarity Metric
| Subjectivity | Polarity | |||
|---|---|---|---|---|
| Scale | 0 | 1 | −1 | 1 |
| Explanation | Objective | Subjective | Negative | Positive |
Algorithm 3 presents the pseudo-code of the abovementioned steps, whereas Fig. 4 shows an example of 100 tweets retrieved using the previously mentioned methods. This method supports detecting fake news as well. Fake news mainly shares the same characteristics, tend to be more negative, and show a very subjective score. Based on that, we retrieved the suspected tweets and compared them with fake news databases provided by multiple researchers in order to delete them from our database if the similarity is high, and tweets from the same user will not be used in the future. Fig. 5 presents the detailed methodological flowchart of the proposed framework.
Fig. 4.
Example of Subjectivity and Polarity results.
Fig. 5.
Detailed Methodology Flowchart.
4. Analysis and discussion
After gathering the data from Twitter, filtering it, and organizing it based on sentiment analysis, we will apply in this section deep learning on the data in order to retrieve and extract knowledge to build a possible prediction of the virus outbreak based on people's locations and shared symptoms and questions. This method will allow us to have an overview of potential future COVID-19 cases and take fast actions to prevent further spread.
4.1. Numerical results
In order to test the feasibility of our proposed model, we simulated it on an i-7 computer with 64 bits. The data were collected for this specific prediction simulation during a two months period from August 1, 2020 until September 30, 2020. Data collection was done using Python for Twitter on Google collab. In order to simplify the task, we took the United States of America as the location where we retrieved and analyzed the data. The total number of tweets collected is 10000. 38% of users age is between 18 and 29 years old, while 26% are between 30 and 49 years old. 56% male and 44% female. We used these data to feed the deep learning algorithm in order to extract a prediction from it. Afterward, we compared the results with the actual COVID-19 cases in the USA. For this proposition, we used the SIR (Susceptible, Infected, Recovered) model wherein we divided the data collected and filtered from Twitter into 3 categories based on the symptoms and confirmation shared by the users, and these categories are shown in Fig. 6 .
Fig. 6.
Categories of USA population in COVID-19 case.
The module is based on the assumption of transmitting the COVID-19 virus throughout the direct or indirect contact from the infected case’s category to susceptible case people [36]. In our proposition, we used as data set the filtered information retrieved from Twitter accounts based on the USA. Infected cases who previously shared their locations and people they have met are a potential virus transmitter to susceptible cases. After dividing the data into three categories, we analyzed it based on the movement and contact of the infected cases, removed the death and recovered cases from the calculation, and excluded newborn cases in this simulation as the period is relatively small.
In order to calculate the progress of individuals in the susceptible cases category by time-series, we used the following differential equation:
| (4) |
S and I refer to Susceptible and Infected cases, respectively. is the reproduction rate per day of the differential equation used to regulate the susceptible infectious contact. In the early stages of the outbreak, the value of I is negligible, and the value of S is approximated to be equal to 1. By the time the virus outbreak progresses, the value of I becomes larger, whereas the value of S gradually declines. Thus, the accession becomes linear, and the infected category can be calculated as follows:
| (5) |
Where represents the daily rate regulator of new infections based on the quantification of the number of infected cases in the transmission. Moreover, removed cases R, which represents the category of people who have been cured of the virus or who died, can be calculated as follows:
| (6) |
Based on the equations above, the prediction of future potential outbreak using Twitter data can be calculated as follows:
| (7) |
We computed the data previously retrieved and filtered from Twitter in order to run a simulation and compared our predicted results with the accurate confirmed results in the USA between 01/08/2020 and 31/08/2020. The regression model shows an accuracy of 0.81. Fig. 7 shows the prediction results.
Fig. 7.
Prediction results compared with confirmed results.
In order to evaluate the accuracy of the proposed model, we calculated the median prediction factor. Results showed that the factor varies between 0.8 and 1.2 as depicted in Fig. 8 , presenting a fair model for predicting future outbreak cases.
Fig. 8.
Median Prediction Factor.
The median factor can be calculated as follows:
| (8) |
Where, refers to the predicted cases and is the confirmed cases by legal authorities.
Big data SNS analysis can provide us with a strong tool to manage and control a pandemic such as COVID-19; people in the areas of smart city and IoT rely more on their smartphone and connected devices to inform themselves about everything including a serious global pandemic. The data shared on SNS platforms is very meaningful as it can inform us about the user’s condition, help us understand the user’s sentiments, and predict future outbreak hotspots based on the user’s movements, locations, and symptoms. Moreover, based on the sentiment analysis performed on the SNS platform, we can detect fake news and information as they mainly share the same vocabulary and grammatical characteristics. Detecting the suspected fake information and comparing it with the Fake news database provided by various researchers support our filtering system to delete those data and information from the analysis phase.
Although we only provided one example of SNS big data analysis based on only one SNS platform in this work, we strongly believe that it is the main pillar for controlling the current COVID-19 outbreak as well as future pandemics in smart cities. Researchers should make use of the openly shared data on SNS platforms and focus more on improving SNS big data analysis methods for a better understanding of users’ fear and sentiments and to provide oversight and manage serious pandemics. Our future research focal point is an automated fake news and information detection tool on SNS platforms, which will support this proposed model and provide a better data set to analyze.
4.2. Comparative analysis and discussion
Based on the above-mentioned methods and the results of median prediction factor, we retrieve the confirmed COVID-19 cases of the last 7 days and compare it with the predicted cases in the last 14 days. Results depicted in Fig. 9 show that our proposed method can predict the next COVID-19 outbreak and hotspots with approximately 7 days earlier than the government authorities. The dots represent the outbreak cluster. Following SNS big data analysis method, authorities can benefit from an earlier prediction of COVID-19 and alike crises situation and hotspots outbreak, which support virus control and provides a better understanding on people opinions and their tendency to follow the government guidance.
Fig. 9.
Predicted and Confirmed COVID-19 Cases Based on Days.
We compared as well our results with other state-of-arts outcomes, although those papers do not use SNS information as a data base to calculate and predict the virus outbreak, they tend to use other methods based on machine learning models powered by Grey Wolf Optimization (GWO) method including linear, logarithmic, quadratic, cubic, compound, power, exponential, and logistic as depicted and explained in [35]. The authors calculated the prediction results of machine learning models for USA COVID-19 cases fitted by GWO, we compare those results with our model and explain them in Table 3 and visually depicted in Fig. 10 .
Table 3.
Comparative result analysis based on different models.
| Model Name | Model Description | Prediction Results |
|---|---|---|
| Logistic | 0.999 | |
| Linear | 0.557 | |
| Logarithmic | 0.289 | |
| Quadratic | 0.88 | |
| Cubic | 0.982 | |
| Compound | 0.977 | |
| Power | 0.702 | |
| Exponential | 0.977 | |
| SNS Analysis | 0.989 |
Fig. 10.
Comparative result analysis.
Based on the above-mentioned results, our model achieves a 0.989 prediction result as depicted in Table 3 compared to other machine learning methods, which make it suitable to be used in COVID-19 outbreak prediction [44] and control by the authorities. Moreover, we compared the previous-mentioned related works based on four key considerations including sustainability, security, availability, and integrity. Unlike other works who used Twitter as a data base for COVID-19 and alike crises outbreak prediction, our model covers the four key consideration as depicted in Table 4 .
Table 4.
Comparison based on key considerations.
| Research work | Year | Method/Software/Hardware | Sustainability | Security | Availability | Integrity | |
|---|---|---|---|---|---|---|---|
| [24] | 2016 | Yes | Structural equation modeling. MPlus V6.1 |
No | No | Yes | Yes |
| [25] | 2020 | Yes | NLP, Python. |
Yes | No | Yes | Yes |
| [26] | 2020 | Yes | Python | No | Yes | Yes | No |
| [27] | 2020 | Yes | Google Colab. Python. |
No | No | Yes | Yes |
| [28] | 2020 | Yes | Google Colab Python |
Yes | Yes | No | No |
| [30] | 2018 | Yes | HBase. Raspberry pi |
Yes | No | Yes | Yes |
| [31] | 2010 | Yes | Support vector machines | Yes | Yes | No | Yes |
| Our Contribution | 2020 | Yes | Google Colab Python |
Yes | Yes | Yes | Yes |
SNS big data analysis is a critical solution that urges more development in order to be applied on SSHC. Using this proposed framework, we can predict future potential COVID-19 cases and outbreak cluster seven days earlier than the confirmed cases were announced, allowing governments to apply national or regional lockdown a few days in advance, and control the virus propagation. As far as we know, until the moment of writing this paper, and compared with other researches, we are the only work that combined sentimental analysis, fake news detection, deleting fake news as well as users’ identity from our analyzed database, and predicting the virus outbreak before the confirmed cases. The results of this research proved to be adaptable for SSHC environment as it achieves a 0.989 prediction results, which contribute directly in the maintenance and sustainability of the services provided by smart cities to its citizens, thus, improving the QoL and KPI of SSHC. However, we have only simulated our proposed scheme on Twitter API data. Moreover, we have analyzed tweets from one geospatial location and based on one langue (English). In our future work, we aim to cover these limitations and address them thoughtfully by analyzing data from multiple SNS platforms such as Facebook and YouTube and including more linguistic and geospatial options.
5. Conclusion
To avoid the calamity of COVID-19 during the global pandemic outbreak, a SSHC urges fast and accurate SNS big data analysis management to handle a pandemic such as COVID-19. Motivated by the advancement of these technologies and the rapid development of Social Networking Service (SNS), this paper proposed an infodemiology study to predict the epidemic outbreak and track its spread across a global shared framework, we considered Twitter as a case study for our framework as it provide researchers with an API to collect various and openly shared tweets, we collected tweets from USA users during one month period, and analyzed them using preprocess methods including key word extraction, redundancy removal, tokenization, lemmatization, and feature extraction. The proposed framework achieved a results of seven days earlier outbreak prediction with a 0.989 indicator results, thus enhancing the control of this pandemic situation, aids in reducing the infection cases by predicting potential virus carriers several days earlier and supports understanding of users’ fear and sentiment and their tendency to follow authorities’ regulations. SNS big data are a tool that should be used in future research as they provide us with an oversight of the situation and help control a serious pandemic such as COVID-19, and to maintain the KPI in SSHC by improving the healthcare services provided to its citizens. Future research direction will cover the limitations of this work and address the dilemma of analyzing multiple SNS platforms in different linguistics and geospatial options.
Declaration of Competing Interest
The authors report no declarations of interest.
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) with a grant funded by the Korea government (NRF-2019R1A2B5B01070416).
References
- 1.World Health Organization . 2020. Archived: WHO Timeline- COVID-19.https://www.who.int/news-room/detail/27-04-2020-who-timeline---covid-19 (Last Accessed 22 February 2021) [Google Scholar]
- 2.Park J.H., Rathore S., Singh S.K., Salim M.M., Azzaoui A.E.L., Kim T.W., Pan Y., Park J.H. A Comprehensive Survey on Core Technologies and Services for 5G Security: Taxonomies, Issues, and Solutions. Human-centric Computing and Information Sciences. 2021;11(3) doi: 10.1109/HCIS.2021.11.003. [DOI] [Google Scholar]
- 3.Smiciklas John, Prokop Gundula, Stano Pawel, Sang Ziqin. Collection Methodology for Key Performance Indicators for Smart Sustainable Cities, CBD, ECLAC, FAO, ITU, UNDP, UNECA, UNECE, UNESCO, UN Environment, UNEP-FI, UNFCCC, UN-Habitat, UNIDO, UNU-EGOV, UN-Women and WMO. 2021. United for Smart Sustainable Cities (U4SSC) initiative report. [Google Scholar]
- 4.Das D., Zhang J.J. Pandemic in a smart city: Singapore’s COVID-19 management through technology & society. Urban Geography. 2020:1–9. doi: 10.1080/02723638.2020.1807168. [DOI] [Google Scholar]
- 5.Megahed N.A., Ghoneim E.M. Antivirus-built environment: Lessons learned from Covid-19 pandemic. Sustainable Cities and Society. 2020;61 doi: 10.1016/j.scs.2020.102350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Badshah A., Ghani A., Qureshi M.A., Shamshirband S. Smart security framework for educational institutions using internet of things (IoT) Computers, Materials & Continua. 2019;61(1):81–101. doi: 10.32604/cmc.2019.06288. [DOI] [Google Scholar]
- 7.Singh S.K., Pan Y., Park J.H. OTS Scheme Based Secure Architecture for Energy-Efficient IoT in Edge Infrastructure. Computers, Materials & Continua. 2021;66(3):2905–2922. [Google Scholar]
- 8.Jha S., Nkenyereye L., Joshi G.P., Yang E. Mitigating and Monitoring Smart City Using Internet of Things. Computers, Materials & Continua. 2020;65(2):1059–1079. doi: 10.32604/cmc.2020.011754. [DOI] [Google Scholar]
- 9.Ellis D.A., Davidson B.I., Shaw H., Geyer K. Do smartphone usage scales predict behavior? International Journal of Human-Computer Studies. 2019;130:86–92. doi: 10.1016/j.ijhcs.2019.05.004. [DOI] [Google Scholar]
- 10.Brunborg G.S., Andreas J.B. Increase in time spent on social media is associated with modest increase in depression, conduct problems, and episodic heavy drinking. Journal of adolescence. 2019;74:201–209. doi: 10.1016/j.adolescence.2019.06.013. [DOI] [PubMed] [Google Scholar]
- 11.Impact of Covid-19 pandemic on social media, (Last Accessed 2 March 2021) https://en.wikipedia.org/wiki/Impact_of_the_COVID19_pandemic_on_social_media#Increase_in_usage.
- 12.Taylor D. Victoria News. 2020. COVID-19: Social media use goes up as country stays indoors".https://www.vicnews.com/news/covid-19-social-media-use-goes-up-as-country-stays-indoors/ (Last Accessed 28 February 2021) [Google Scholar]
- 13.Kemp S. 2020. Digital 2020: 3.8 Billion people use social media.https://wearesocial.com/digital-2020 (Last Accessed 28 February 2021) [Google Scholar]
- 14.Chung R.Y.N., Li M.M. Anti-Chinese sentiment during the 2019-nCoV outbreak. The Lancet. 2020;395(10225):686–687. doi: 10.1016/S0140-6736(20)30358-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zarocostas J. How to fight an infodemic. The lancet. 2020;395(10225):676. doi: 10.1016/S0140-6736(20)30461-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. Journal of medical Internet research. 2009;11(1):e11. doi: 10.2196/jmir.1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Horvitz E., Mulligan D. Data, privacy, and the greater good. Science. 2015;349(6245):253–255. doi: 10.1126/science.aac4520. [DOI] [PubMed] [Google Scholar]
- 18.Hu Z., Yang Z., Li Q., Zhang A., Huang Y. 2020. Infodemiological study on COVID-19 epidemic and COVID-19 infodemic. [DOI] [Google Scholar]
- 19.Park H.W., Park S., Chong M. Conversations and medical news frames on twitter: Infodemiological study on covid-19 in south korea. Journal of Medical Internet Research. 2020;22(5) doi: 10.2196/18897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Oxford Analytica . Emerald Expert Briefings. 2020. Misinformation will undermine coronavirus responses. (oxan-db) [DOI] [Google Scholar]
- 21.Sasaki Y., Kawai D., Kitamura S. The anatomy of tweet overload: How number of tweets received, number of friends, and egocentric network density affect perceived information overload. Telematics and Informatics. 2015;32(4):853–861. doi: 10.1016/j.tele.2015.04.008. [DOI] [Google Scholar]
- 22.Sayce D. 2020. The Number of Tweets per day in 2020.https://www.dsayce.com/social-media/tweets-day/ (Last Accessed 28 February 2021) [Google Scholar]
- 23.Sun C., Zhai Z. The efficacy of social distance and ventilation effectiveness in preventing COVID-19 transmission. Sustainable cities and society. 2020;62 doi: 10.1016/j.scs.2020.102390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yoo W.H., Choi D.H., Park K.H. The effects of SNS communication: How expressing and receiving information predict MERS-preventive behavioral intentions in South Korea. Computers in Human Behavior. 2016;62:34–43. doi: 10.1016/j.chb.2016.03.058. ISSN 0747-5632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Shahi G.K., Nandini D. FakeCovid--A Multilingual Cross-domain Fact Check News Dataset for COVID-19. arXiv preprint arXiv:2006.11343. 2020 doi: 10.36190/2020.14. [DOI] [Google Scholar]
- 26.Shahi G.K., Dirkson A., Majchrzak T.A. An exploratory study of covid-19 misinformation on twitter. arXiv preprint arXiv:2005.05710. 2020 doi: 10.36190/2020.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Massaad E., Cherfan P. Social media data analytics on telehealth during the COVID-19 pandemic. Cureus. 2020;12(4) doi: 10.7759/cureus.7838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Arianto D., Pui N. 2020. Social media analysis: Utilization of social media data for research on COVID-19. 2020. [Google Scholar]
- 29.Jia P., Yang S. Publisher Correction: China needs a national intelligent syndromic surveillance system. Nature Medicine. 2020;26(7):1149. doi: 10.1038/s41591-020-0977-2. [DOI] [PubMed] [Google Scholar]
- 30.Htet H., Myint Y. Social Media (Twitter) Data Analysis using Maximum Entropy Classifier on Big Data Processing Framework (Case Study: Analysis of Health Condition, Education Status, State of Business) Journal of Pharmacognosy and Phytochemistry. 2018;7:695–700. [Google Scholar]
- 31.Barbosa L., Feng J. Coling 2010: Posters. 2010. Robust sentiment detection on twitter from biased and noisy data; pp. 36–44. August. [Google Scholar]
- 32.Lin Y. 2021. Twitter Statistics: 10 Twitter Statistics You Need to Know in 2021.https://www.oberlo.com/blog/twitter-statistics (Last Accessed 26 February 2021) [Google Scholar]
- 33.Al-Khafaji H.K., Habeeb A.T. Efficient algorithms for preprocessing and stemming of tweets in a sentiment analysis system. Journal of Computer Engineering. 2017;19(3):44–50. doi: 10.9790/0661-1903024450. [DOI] [Google Scholar]
- 34.Singh V., Saini B. Department of Computer Engineering, National Institute of Technology Kurukshetra; Haryana, India: 2014. An Effective tokenization algorithm for information retrieval systems. [DOI] [Google Scholar]
- 35.Kharde V., Sonawane P. Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971. 2016 doi: 10.5120/ijca2016908625. [DOI] [Google Scholar]
- 36.Ardabili Sina F., MOSAVI Amir, Ghamisi Pedram, Ferdinand Filip, Koczy Annamaria R. Varkonyi, Reuter Uwe, Rabczuk Timon, Atkinson Peter M. medRxiv. 2020 doi: 10.1101/2020.04.17.20070094. 04.17.20070094. [DOI] [Google Scholar]
- 37.Lee Y., Rathore S., Park J.H., Park J.H. A blockchain-based smart home gateway architecture for preventing data forgery. Human-centric Computing and Information Sciences. 2020;10(1):1–14. doi: 10.1186/s13673-020-0214-5. [DOI] [Google Scholar]
- 38.Jo J.H., Sharma P.K., Sicato J.C.S., Park J.H. Emerging technologies for sustainable smart city network security: Issues, challenges, and countermeasures. Journal of Information Processing Systems. 2019;15(4):765–784. doi: 10.3745/JIPS.03.0124. [DOI] [Google Scholar]
- 39.Singh S.K., Jeong Y.S., Park J.H. A deep learning-based IoT-oriented infrastructure for secure smart city. Sustainable Cities and Society. 2020;60 doi: 10.1016/j.scs.2020.102252. [DOI] [Google Scholar]
- 40.Park J.S., Park J.H. Future Trends of IoT, 5G Mobile Networks, and AI: Challenges, Opportunities, and Solutions. Journal of Information Processing Systems. 2020;16(4):743–749. doi: 10.3745/JIPS.03.0146. [DOI] [Google Scholar]
- 41.Park J.H., Salim M.M., Jo J.H., Sicato J.C.S., Rathore S., Park J.H. CIoT-Net: a scalable cognitive IoT based smart city network architecture. Human-centric Computing and Information Sciences. 2019;9(1):1–20. doi: 10.1186/s13673-019-0190-9. [DOI] [Google Scholar]
- 42.Anderson K., Smita L., Garrett D. PwC Health Research Institute; 2012. Social Media “Likes” Healthcare: From Marketing to Social Business; pp. 1–40.https://adindex.ru/files2/access/2013_06/99606_tpc-health-care-social-media-report.pdf (Last Accessed 26 February 2021) [Google Scholar]
- 43.Roser M., Ritchie H., Ortiz-Ospina E., Hasell J. Coronavirus pandemic (COVID-19) Our world in data. 2020 https://ourworldindata.org/coronavirus [Online Resource] [Google Scholar]
- 44.Ardabili S.F., Mosavi A., Ghamisi P., Ferdinand F., Varkonyi-Koczy A.R., Reuter U.…Atkinson P.M. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249. doi: 10.1101/2020.04.17.20070094. [DOI] [Google Scholar]
- 45.How fake news has exploited COVID-19, https://www.pwc.co.uk/issues/crisis-and-resilience/covid-19/how-fake-news-has-exploited-covid19-cyber.html.
- 46.Cha, J., Singh, S. K., Kim, T. W., & Park, J. H. Blockchain-empowered cloud architecture based on secret sharing for smart city. Journal of Information Security and Applications, 57, 102686. 10.1016/j.jisa.2020.102686. [DOI]
- 47.Salim M.M., Rathore S., Park J.H. Distributed denial of service attacks and its defenses in IoT: a survey. The Journal of Supercomputing. 2019:1–44. doi: 10.1007/s11227-019-02945-z. [DOI] [Google Scholar]













