Skip to main content
IEEE - PMC COVID-19 Collection logoLink to IEEE - PMC COVID-19 Collection
. 2021 Mar 17;8(4):1030–1041. doi: 10.1109/TCSS.2021.3063820

Ranking of Importance Measures of Tweet Communities: Application to Keyword Extraction From COVID-19 Tweets in Japan

Ryosuke Harakawa 1,, Masahiro Iwahashi 1
PMCID: PMC8545007  PMID: 35783148

Abstract

This article presents a method that detects tweet communities with similar topics and ranks the communities by importance measures. By identifying the tweet communities that have high importance measures, it is possible for users to easily find important information about the coronavirus disease (COVID-19). Specifically, we first construct a community network, whose nodes are tweet communities obtained by applying a community detection method to a tweet network. The community network is constructed based on textual similarities between tweet communities and sizes of tweet communities. Second, we apply algorithms for calculating centrality to the community network. Because the obtained centrality is based on tweet community sizes as well, we call it the importance measure in distinction to conventional centrality. The importance measure can simultaneously evaluate the importance of topics in the entire data set and occupancy (or dominance) of tweet communities in the network structure. We conducted experiments by collecting Japanese tweets about COVID-19 from March 1, 2020 to May 15, 2020. The results show that the proposed method is able to extract keywords that have a high correlation with the number of people infected with COVID-19 in Japan. Because users can browse the keywords from a small number of central tweet communities, quick and easy understanding of important information becomes feasible.

Keywords: Community detection, coronavirus, coronavirus disease (COVID-19), network analysis, network centrality, semantic understanding

I. Introduction

The outbreak of coronavirus disease (COVID-19) has seriously affected human health and economic activity around the world. During the COVID-19 epidemic, users search social media networks, such as Twitter,1 Weibo,2 and YouTube,3 as well as the traditional media, such as television and radio for information. In particular, Twitter is a very popular social media network [1], [2] and has become an important source of information [3]. Therefore, we use Twitter as the platform for this research. In Twitter, various tweets (i.e., short text messages), including information (misinformation, in some cases), have been disseminated widely [4][6]. This makes it difficult for users to understand the situation and acquire the relevant knowledge, about COVID-19.

One of the effective solutions to this problem is the visualization of an overview of a large amount of content [7][12]. Qian et al. [7] explained that it is very time-consuming for users who are not familiar with a topic to browse large amounts of content and quickly gain a general understanding of it; therefore, it is important to automatically mine multiview opinions on the target topic. Our recent study [9] proposed a method for extracting tweet communities4 from a tweet network that represents the similarity between tweets. In this article, we define a set of tweets with similar topics as a tweet community. The obtained tweet communities enable us to gain a general understanding of many tweets. However, there remains the problem that it is difficult for users to browse all tweet communities as the number of tweet communities increases.

To solve this problem, we propose a method that detects tweet communities, with similar topics, from a tweet network and ranks the communities by importance measures. By identifying the tweet communities that have high importance measures, it is possible for users to easily find important information about COVID-19. Inspired by reports that network representation is useful for multimedia content analysis, including clustering [13] and tweet community extraction [9], we also employ a network-based approach. We aim that the importance measure represents the importance of each tweet community. The importance measure should simultaneously evaluate: i) importance of topics in the entire data set and ii) occupancy (or dominance) of tweet communities in the network structure. If the centrality is large but the size is small, the tweet community may be trivial because of oversplitting. On the other hand, it is not guaranteed that the tweet community, whose size is large but centrality is small, includes important topics in the entire data set. As the importance measure, we develop new centrality considering a size of a tweet community because the centrality and size satisfy i) and ii), respectively.

Specifically, the core algorithms and the novelty are as follows.

  • 1)

    We construct a community network, whose nodes are the tweet communities. Each node in the community network has a topic, which represents the meaning of the tweets in the corresponding tweet community. In the community network, a node that has edges with high weights includes a central topic in the entire data set, and the node is dominant in the network structure.

  • 2)

    We calculate the centrality of each node in the community network. Because the centrality is calculated using tweet community sizes as well, we call it the importance measure in distinction to conventional centrality.

  • 3)

    The novelty of our work lies in its calculation of the importance measure of a community network, rather than a tweet network. This helps us hierarchize the tweet communities to find important information about COVID-19, even as the number of tweet communities increases.

We conducted experiments by collecting 76000 Japanese tweets about COVID-19 dated between March 1, 2020 and May 15, 2020. In these experiments, we defined words that have a strong correlation with the number of people infected with COVID-19 as keywords. The results show that keywords extracted from only the central tweet communities that were detected by our method were similar to those extracted from all collected tweets. This implies that users can gain a general understanding of many tweets about COVID-19 by browsing only a small number of central tweet communities. This is useful for users to understand the situation and acquire the relevant knowledge about COVID-19 even when there is a flood of information (including misinformation, in some cases).

The rest of this article is organized as follows. Section II describes related work, comprising existing information technologies (in particular, data mining methods) that target COVID-19. In Section III, the proposed method for extracting tweet communities and ranking them by importance measures is explained. Section IV presents the experimental results for real tweets about COVID-19 in Japan and discusses the effectiveness of our method for extracting keywords that are related to the number of infected people. Finally, conclusions and future work are discussed in Section V.

II. Related Work

This section describes information technologies for COVID-19 —in particular, pioneering studies that utilize social media networks—to clarify the contribution of our study. Researchers have raised concerns about misinformation, myths, and conspiracies related to COVID-19 [4][6], [14], [15]. The term infodemic means the phenomenon characterized by a flood of information and misinformation. Cinelli et al. [4], Shahi et al. [5], and Medford et al. [6] reported that COVID-19 has caused an infodemic on social media networks, such as Twitter, Instagram,5 YouTube, and Reddit.6 Shahi et al. [5] highlighted the necessity of proposing actions for authorities to counter misinformation and hints for social media users on how to stop the spread of misinformation. Singh et al. [14] found myths in Twitter by manually defining myths, according to their frequency of appearance in different websites using the search phrase “Coronavirus Common Myths,” and defining how dangerous they were. Ferrara [15] reported that accounts that automatically post tweets (namely, bots) are used to promote conspiracy theories in the United States, in stark contrast with human users, who focus on public health and welfare.

Tracking and predicting events about COVID-19 on social media networks have been studied [16][19]. Hamzah et al. [16] proposed a Web platform called CoronaTracker. CoronaTracker provides a predictive model to forecast COVID-19 outbreaks within and outside China, based on daily observations. Furthermore, it can classify the news related to COVID-19 into negative and positive sentiments, to understand the influence of the news on people’s behavior, both politically and economically. Zhong et al. [17] proposed a susceptible–infected–removed (SIR) model-based method [20] for predicting the number of infected cases in China. Zheng et al. [18] also predicted the trend of COVID-19 in China. They combined an improved susceptible–infected model and a long short-term memory (LSTM) network [21] with news information extracted via natural language processing (NLP), to estimate the number of infected cases. Dynamic topic modeling [19] was proposed for analyzing the COVID-19 Twitter narrative among U.S. governors and presidential cabinet members, to track the evolution of subtopics related to risk, testing, and treatment.

Some researchers have constructed COVID-19-related data sets [22], [23]. Chen et al. [22] constructed a multilingual Twitter data set for stimulating the research community. In [23], an Arabic data set of tweets on COVID-19 since January 1, 2020 was presented.

Research on inferring or classifying the topics behind Twitter or Weibo posts has been conducted [24], [25]. Wicke and Bolognesi [24] analyzed the discourse around COVID-19 by applying latent Dirichlet allocation [26], a well-known topic modeling method, to a corpus of tweets sent during March and April 2020. Li et al. [25] classified Weibo posts about COVID-19, according to seven types of situational information, to find specific features for predicting the reposted amount of each type of information.

In addition, some review articles have been published [27], [28], discussing information technologies, including artificial intelligence [27] and data science [28], for tackling the COVID-19 epidemic.

Our work is the first attempt to clarify topics (in particular, keywords that have a high correlation with the number of people infected with COVID-19 in Japan) on the basis of complex network analysis with NLP. As described in Section I, the technical novelty of our method is that we hierarchize the tweet communities by calculating the importance measures of each tweet community, rather than those of each tweet.

III. Ranking of Tweet Communities

To gain a general understanding of many tweets about COVID-19, we present a method that detects tweet communities with similar topics and ranks these communities by importance measures. In Section III-A, our method for Twitter data acquisition is described. The proposed method consists of two phases: construction of a community network (Section III-B) and ranking of tweet communities (Section III-C) [see Fig. 1].

Fig. 1.

Fig. 1.

Overview of Sections III-B and III-C. In Section III-B, two types of networks are constructed. First, we construct a tweet network whose nodes are tweets, which represents similarities between the tweets. Second, we construct a community network whose nodes are tweet communities, which represents similarities between the tweet communities. In Section III-C, importance measures of each tweet community are calculated. We display tweet communities in descending order of the importance measures. This can overcome the difficulty that users cannot judge which communities should be read in many communities.

A. Data Acquisition

From March 1, 2020 to May 15, 2020, we collected 1000 Japanese tweets per day. In Japan, a state of emergency was declared by the government on April 7. Therefore, people’s tension had been increased, especially during the above period. By using the query “a novel coronavirus” (

graphic file with name harak10-3063820.jpg

in Japanese), we collected tweets by a keyword search, via an open-source Twitter tool.7 Moreover, because personal communication is not relevant to the task of tweet community detection, we removed reply tweets, as in our previous study [9]. Furthermore, we removed URL strings beginning with an “http” or “pic” prefix. (In Twitter, an attached image is represented as a shortened URL that starts with “pic.”) In this way, we constructed 76 data sets for the experiment (one for each day).

B. Construction of Community Network

As in our previous work on tweet community detection [9], we employ a network-based approach. In the experiment presented in Section IV, we performed the subsequent processing on each of the 76 data sets separately.

First, for each data set, we represented each tweet as Inline graphic, where Inline graphic is the number of tweets in one data set). Here, we collected only Japanese tweets and performed the following processing. Using a natural language processing tool called Janome (https://mocobeta.github.io/janome/en/), we performed the morphological analysis and extracted only nouns. Then, we removed stop words defined in https://www.kaggle.com/lazon282/japanese-stop-words. Also, we removed words that consist of only one character because they are likely to be trivial symbols and numbers. Note that Japanese nouns do not change inflection (for example, we do not separate singular nouns from plural ones). Thus, we do not perform lemmatization.

We then extracted textual features Inline graphic that represented the semantics of each tweet Inline graphic. Because there are tweets whose grammar and context are poor, features considering only word frequencies will be more suitable than embedding-based features considering the word order. In fact, the article [29] reports that the term frequency–inverse document frequency (TF–IDF) features [30] have more discriminative power than Doc2Vec [31]. Motivated this fact, we use TF-IDF features as Inline graphic.

Following the report that a Inline graphic-nearest neighbors ( Inline graphic-NN) network is usually suitable for adapting to data set properties [13], we constructed a Inline graphic-NN network using Inline graphic. Specifically, for each tweet Inline graphic, we calculated cosine similarities between Inline graphic (TF–IDF features of Inline graphic) and Inline graphic (TF–IDF features of the other tweets Inline graphic). From Inline graphic, we selected Inline graphic tweets in descending order of the similarities. By connecting unweighted edges between Inline graphic and the selected Inline graphic, we constructed the Inline graphic-NN network. The obtained Inline graphic-NN network represented the relationships between tweet semantics. The Inline graphic-NN network based on TF–IDF features was also used for tweet community extraction in our recent study [9]. In this article, we define the obtained Inline graphic-NN network as a tweet network Inline graphic.

Using Inline graphic, we detect tweet communities with similar topics. Following the reports that the Louvain method [32] works well for multimedia content clustering [9], [11], [33], [34], we apply the Louvain method [32] to Inline graphic. The Louvain method is based on a quality measure of community detection results called modularity [35]. The modularity Inline graphic is defined as

B.

Here

B.

where Inline graphic is 1 if Inline graphic and Inline graphic belong to the same tweet community and 0 otherwise. Also, Inline graphic denotes the existence of an edge between Inline graphic and Inline graphic; thus, Inline graphic is 1 if an edge between Inline graphic and Inline graphic in Inline graphic exists and 0 otherwise. By recursively maximizing Inline graphic, we can successfully obtain tweet communities Inline graphic (where Inline graphic is the number of communities) containing semantically similar tweets. The details of the algorithm are shown in Algorithm 1.

Algorithm 1 Detection of Tweet Communities by the Louvain Method [32]

  • Input: Tweet network Inline graphic whose nodes are tweets Inline graphic.

  • Output: Tweet communities Inline graphic.

  • 1:

    Assign each node Inline graphic to each tweet community.

  • 2:

    while Improvement of Inline graphic (in Eq. (1)) of Inline graphic is obtained do

  • 3:

    while Improvement of Inline graphic of Inline graphic is obtained do

  • 4:

    /* Local maximization of modularity */

  • 5:

    for each node of Inline graphic do

  • 6:

    Evaluate the gain of Inline graphic when a node is set to each tweet community including neighborhood nodes.

  • 7:

    Reassign a node to the tweet community for which the positive gain of Inline graphic is maximum.

  • 8:

    end for

  • 9:

    Calculate Inline graphic of Inline graphic.

  • 10:

    end while

  • 11:

    Update the obtained tweet communities as Inline graphic.

  • 12:

    /* Updating a new network */

  • 13:

    Update Inline graphic with a self-loop whose nodes are the obtained tweet community, where each edge weight is the sum of the edge weights of the original network.

  • 14:

    end while

  • 15:

    Return the tweet communities Inline graphic.

Finally, we construct a community network Inline graphic, in which tweet communities with central topics (where topics are the meanings that represent tweets in the community network) are densely linked to other communities. Each node of Inline graphic is one of the obtained tweet communities; therefore, we can write Inline graphic. The edge weight Inline graphic, from node Inline graphic to node Inline graphic, is defined as follows:

B.

where Inline graphic is the number of tweets contained in Inline graphic and Inline graphic denotes a tweet in the tweet community Inline graphic. We do not place an edge between Inline graphic and Inline graphic if none of the tweets in Inline graphic and Inline graphic are connected by edges in the tweet network. Equation (2) can simultaneously evaluate: i) importance of topics in the entire data set and ii) occupancy (or dominance) in the network structure. Specifically, the numerator shows i). Although the denominator is a normalization term, the logarithm function reduces the overnormalization when the community sizes are large. Therefore, the denominator shows ii).

C. Ranking of Tweet Communities

Having obtained the community network Inline graphic, we hierarchize the tweet communities Inline graphic. The input of our algorithm is Inline graphic, and the output is the result of sorting Inline graphic in descending order of their importance measures Inline graphic. As described above, performing the importance measure calculation on a community network, rather than a tweet network, is the novelty of this study.

Specifically, we calculate centrality, i.e., degree centrality, closeness centrality, betweenness centrality, and hyperlink-induced topic search (HITS) centrality [36], of Inline graphic. We call the obtained centrality importance measures and denote them by Inline graphic. They are calculated as follows.

Degree Centrality: The degree centrality is the most primitive centrality measure that is defined as

C.

where Inline graphic is a weighted degree of Inline graphic in Inline graphic. Inline graphic with a high degree centrality is similar to the neighbor nodes.

Closeness Centrality: The closeness centrality is defined as

C.

where Inline graphic is the shortest path distance from Inline graphic to Inline graphic. Thus, the closeness centrality represents the accessibility of each node in Inline graphic.

Betweenness Centrality: The betweenness centrality is defined as

C.

where Inline graphic denotes the number of shortest paths from Inline graphic to Inline graphic and Inline graphic denotes the number of such paths that pass through Inline graphic. In this article, we calculate the shortest paths considering edge weights in Inline graphic. Thus, Inline graphic represents the importance of Inline graphic in information propagation in Inline graphic.

HITS Centrality: The HITS algorithm is equivalent to principal component analysis (PCA) of the network structure [37]. First, we represent Inline graphic in the form of an adjacency matrix Inline graphic. The elements of Inline graphic are the edge weights of Inline graphic. The HITS algorithm calculates the principal eigenvector Inline graphic of Inline graphic. The Inline graphicth element of Inline graphic becomes Inline graphic. Note that there is eigenvector centrality as well-known centrality. The eigenvector centrality is equivalent to the principal eigenvector of Inline graphic. In this study, Inline graphic is an undirected graph; thus, Inline graphic is a symmetric matrix. According to the basic linear algebra, eigenvectors of Inline graphic and Inline graphic are the same. Therefore, in this study, HITS centrality is equivalent to eigenvector centrality.

In Inline graphic, a node that has edges with high weights includes a central topic (where topics are meanings that represent tweets in the community network) in the entire data set, and it is dominant in the network structure. Therefore, displaying Inline graphic in descending order of Inline graphic enables users to easily find important information about COVID-19, even if many tweet communities are obtained.

IV. Experimental Results and Discussion

In this section, experimental results for real Twitter data are presented and discussed to verify the effectiveness of the proposed method.

A. Quantitative Discussion

We attempt to quantitatively discuss the point that our method enables users to easily find important information about COVID-19 from many tweets. To do this, we evaluate the accuracy of the extraction of keywords about COVID-19, as explained next.

1). Ground Truth:

We define the keywords about COVID-19 by focusing on their correlation with the number of infected people. First, we collected the number of new COVID-19 infections per day in Japan from March 1, 2020 to May 15, 2020. The number of new COVID-19 infections is published by Google based on the Wikipedia statistics.8 There is a case where reports from health centers in every place to the Ministry of Health, Labour and Welfare are delayed because of holidays of the health centers. This results in the fluctuation of the number of new COVID-19 infections depending on the day of the week. To remove the influence of this fluctuation on the subsequent analysis, we calculated three-day moving averages [see Fig. 2].

Fig. 2.

Fig. 2.

Number of new COVID-19 infections per day in Japan from March 1, 2020 to May 15, 2020. (a) Raw data. (b) Three-day moving averages.

Here, Inline graphic denotes the 76-D vector that contained the number of infections (after the moving average) for each day. Second, for each day, we counted the number of times that each word appeared in tweets. If a word appeared multiple times in one tweet, we counted it only once to reduce the influence of tweets in which the same word is repeated many times. Also, we ignored the query words used in data acquisition because they appeared in all tweets. Thus, for each word, we obtained a 76-D vector Inline graphic that contained the number of times that the word appeared in tweets each day.

Furthermore, we calculated the Pearson correlation coefficient (CC) between Inline graphic and each Inline graphic. The article [38] reports that Inline graphic (where Inline graphic is the absolute value of CC) shows substantial correlation. In the medicine field, Inline graphic can be interpreted as “Fair” correlation among “None,” “Poor,” “Fair,” “Moderate,” “Very Strong,” and “Perfect” correlations. In the psychology field, we can interpret Inline graphic as “Moderate” correlation among “Zero,” “Weak,” “Moderate,” “Strong,” and “Perfect” correlations. In the politics field, Inline graphic can be interpreted as “Strong” correlation among “None,” “Negligible,” “Weak,” “Moderate,” “Strong,” “Very Strong,” and “Perfect” correlations. According to this report, we defined words with Inline graphic as the ground truth of the keywords. Hereafter, this set of keywords is denoted by Inline graphic. We defined the keywords in this way because we considered that words with a high correlation with the number of infections contained semantics relevant to the surrounding situation and necessary knowledge, such as the countermeasures.

2). Comparative Methods:

In this experiment, we compared the following ten cases.

Cases 1–4: The cases in which tweet communities are displayed in descending order of the proposed importance measures. Cases 1–4 use degree centrality, closeness centrality, betweenness centrality, and HITS centrality, respectively.

Cases 5–8: The cases in which tweet communities are displayed in descending order of comparative measures. The comparative measures calculate (2) by replacing Inline graphic with Inline graphic. Thus, these cases only consider the importance of topics in the entire data set. Cases 5–8 use degree centrality, closeness centrality, betweenness centrality, and HITS centrality, respectively.

Cases 9: The case in which tweet communities are displayed in descending order of tweet community sizes. Thus, this case only considers the occupancy of tweet communities in the network structure.

Case 10: The case in which tweet communities are displayed in a random order.

For each case, we extracted as keywords the words that appeared in tweets contained in the displayed tweet communities, in the same manner as the extraction of Inline graphic. Furthermore, we denote the keywords obtained in cases 1–10 by Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic, respectively.

3). Evaluations:

For the quantitative discussion, we use the Jaccard index, recall, and precision. The Jaccard index is a frequently used metric that represents the overlap between two sets. The recall represents the comprehensiveness. The precision represents the ratio of correct keywords (i.e., keywords included in the ground truth) to keywords that are displayed to users. They are defined as follows:

3).

Following the principle of tenfold cross validation, we randomly extracted 900 tweets from each data set, calculated the above metrics for the extracted tweets, and repeated this ten times. By calculating the mean and standard deviation of the ten validations, we attempt to assess the effectiveness of the proposed method accurately.

Next, we show the Jaccard index, recall, and precision for cases 1–10. Note that the size-based method (case 9) displays large tweet communities; therefore, the number of displayed keywords is likely to be larger than other methods. This may make the comparison unfair. Based on the report on the human short-term memory [39], we assume that users memorize only seven frequent keywords in each tweet community. To perform the fair comparison based on this practical assumption, Fig. 3 shows the evaluation results for cases 1–4, 9, and 10. For the Inline graphic-NN network construction in Section III-B, we should avoid Inline graphic that erroneously detects tweet communities. A large Inline graphic is not suitable because it results in a too dense network to reveal the community structure. For this reason, we here set Inline graphic to 3. From Fig. 3, we can observe that the proposed method (cases 1–4) achieves better results than the size-based method (case 9) and the random method (case 10), for every metric. In particular, the superiority of the proposed method to the random method is statistically significant. For the Jaccard index, the p-values of Welch’s t-test when comparing cases 1–4 with case 10 are 0.001, 0.003, 0.002, and 0.002, respectively. For the recall, those are 0.002, 0.004, 0.002, and 0.002, respectively. For the precision, those are 0.024, 0.017, 0.003, and 0.007, respectively. Furthermore, Fig. 4 shows the evaluation results for cases 5–10. The performance of minor versions of the proposed method (cases 5–8) is worse than the proposed method (cases 1–4 in Fig. 3). Although the performance in cases 5–8 is superior to the random method (case 10), the performance is worse than the size-based method (case 9). This shows the necessity of simultaneously evaluating the importance of topics in the entire data set and occupancy of tweet communities in the network structure. Thus, the validity of the proposed method can be confirmed. Also, we can observe that results with degree centrality, closeness centrality, betweenness centrality, and HITS centrality are almost the same. This may be because the global structure and the local structure are similar due to the small size of the community network. If the community network becomes large and dense, HITS centrality would become powerful because it is equivalent to PCA and can exploit the global structure even for the large and dense network.

Fig. 3.

Fig. 3.

(a) Jaccard index. (b) Recall. (c) Precision. The horizontal axis shows the number of tweet communities displayed to users. The data points and error bars show the mean and standard deviation of the results of ten validations. The means for each case are shown in parentheses. Seven most frequent keywords in each tweet community are displayed to users. We show results when Inline graphic was set to 3 in the Inline graphic-NN network construction in Section III-B.

Fig. 4.

Fig. 4.

(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. Seven most frequent keywords in each tweet community are displayed to users. We show the results when Inline graphic was set to 3 in the Inline graphic-NN network construction in Section III-B.

Table I shows the examples of the keywords extracted as the ground truth ( Inline graphic), those identified by the proposed method ( Inline graphic), and those found by random selection ( Inline graphic). Our method extracted keywords about the declaration of a state of emergency (“official announcement,” “declaration,” and “emergency”). As explained above, a state of emergency was declared by the government in Japan in April 7. “Business suspension” may appear because of the request for business suspension by the government, to prevent the spread of COVID-19 infection. Conversely, our method and the random method incorrectly detected “society” and “digital” as keywords. These words seem to be too general to capture COVID-19-related topics. We notice that the random method cannot detect any correct keywords. From this fact, we can confirm the superiority of our method for ranking tweet communities in descending order of the importance measures.

TABLE II. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 7.
Keywords shown in Fig. 7(b)
English translation Original Japanese
infection
expansion
cancellation
prevention
schedule
influence
notification
Keywords shown in Fig. 7(c)
English translation Original Japanese
pneumonia
infection
misinformation
information
countermeasure
welfare
influence
Keywords shown in Fig. 7(d)
English translation Original Japanese
infection
Hyogo (a prefecture in Japan)
confirmation
man
Nishinomiya (a city in Hyogo Prefecture)
within the prefecture
Osaka
TABLE III. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 8.
Keywords shown in Fig. 8(b)
English translation Original Japanese
mask
infection
countermeasure
distribution
prevention
Abe (name of the Prime Minister in Japan at that time)
government
Keywords shown in Fig. 8(c)
English translation Original Japanese
mask
Abe (name of the Prime Minister in Japan at that time)
government
infection
household
Prime Minister
countermeasure
Keywords shown in Fig. 8(d)
English translation Original Japanese
infection
confirmation
news
announcement
within the prefecture
NHK (the abbreviation of Japan Broadcasting Corporation)
man
TABLE IV. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 9.
Keywords shown in Fig. 9(b)
English translation Original Japanese
infection
deceased
patient
Tokyo
hospital
occurrence
announcement
Keywords shown in Fig. 9(c)
English translation Original Japanese
infection
news
NHK (the abbreviation of Japan Broadcasting Corporation)
confirmation
deceased
Tokyo
Hokkaido
Keywords shown in Fig. 9(d)
English translation Original Japanese
infection
information
video
relevance
self-restraint
publication
countermeasure
TABLE I. Examples of Keywords Extracted as the Ground Truth ( Inline graphic), Keywords Identified by the Proposed Method ( Inline graphic), and Keywords Found by Random Selection ( Inline graphic). The Number of Displayed Tweet Communities, the Number of Displayed Keywords From Each Tweet Community, and the Value of Inline graphic Are 15, 7, and 3, Respectively.
Ground truth
English translation of keywords Keywords in Japanese CC (correlation coefficient)
business suspension 0.67
official announcement 0.61
hand sanitizer gel 0.61
hospital 0.57
emergency 0.46
declaration 0.46
state 0.46
China −0.58
Italy −0.48
cruise −0.48
princess −0.44
diamond −0.41
Inline graphic: The asterisk shows incorrectly detected keywords.
English translation of keywords Keywords in Japanese CC (correlation coefficient)
business suspension 0.52
emergency 0.44
declaration 0.43
official announcement 0.41
* society * * 0.40
influence −0.40
Inline graphic: The asterisk shows incorrectly detected keywords.
English translation of keywords Keywords in Japanese CC (correlation coefficient)
* digital * * 0.44

4). Verification Using Another Inline graphic Value:

To test another value of Inline graphic, Fig. 5 shows the evaluation results where Inline graphic was set to 6. Even in this setting, we can observe the effectiveness of the proposed method (cases 1–4). In particular, we can confirm the statistical significance of the proposed method to the random method (case 10). For the Jaccard index, the p-values of Welch’s t-test when comparing cases 1–4 with case 10 are 0.000, 0.000, 0.005, and 0.000, respectively. For the recall, those are 0.000, 0.000, 0.005, and 0.000, respectively. For the precision, those are 0.000.

Fig. 5.

Fig. 5.

(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. Seven most frequent keywords in each tweet community are displayed to users. We show the results when Inline graphic was set to 6 in the Inline graphic-NN network construction in Section III-B.

In general, it is difficult to find the best Inline graphic for the topic extraction. To overcome this difficulty, we previously proposed a method [9] that collaboratively integrates community detection results by multiple Inline graphic values. Our future work includes the investigation of suitable Inline graphic values.

5). Performance Limitation in the Proposed Method:

In Sections IV-A3 and IV-A4, evaluations were performed when only seven most frequent keywords in each tweet community were displayed to users. Here, we perform evaluations when all keywords in each tweet community were displayed to users. This condition is not practical because we assume that users take a long time to read many keywords. Therefore, the evaluations here aim at verifying the performance limitation in the proposed method. Fig. 6 shows the evaluation results (where Inline graphic was set to 3). We can observe that the performance of the proposed method (cases 1–4) is almost the same as that of the size-based method (case 9). As described in Section IV-A3, the size-based method displays more keywords than the proposed method. Thus, the correct keywords are likely to be included in the displayed many keywords. This results in the performance that is comparable with the proposed method. In summary, the proposed method is especially effective in the practical condition where users can read a limited number of keywords. In the case where users can all keywords, the size-based method is substantially effective as well.

Fig. 6.

Fig. 6.

(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. All keywords in each tweet community are displayed to users. We show the results when Inline graphic was set to 3 in the Inline graphic-NN network construction in Section III-B.

B. Examples of Displayed Tweet Communities

In this section, we show the examples of the tweet communities that are displayed to users. Figs. 79 shows three tweet communities, in descending order of importance measures, for March 1, April 1, and May 1, respectively. Here, we show the results by the importance measures with HITS centrality (case 4).

Fig. 7.

Fig. 7.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on March 1, 2020. The dots represent tweets, and the colors represent the tweet communities to which the tweets belong. Red, blue, and green colors show the tweet communities with the largest, second largest, and third largest importance measures, respectively. (b) Seven most frequent words in the tweet community with the largest importance measure. (c) Seven most frequent words in the tweet community with the second largest importance measure. (d) Seven most frequent words in the tweet community with the third largest importance measure. The correspondence between the original Japanese and the English translation is shown in Table II.

Fig. 8.

Fig. 8.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on April 1, 2020. The notation in (b)–(d) is the same as in Fig. 7. The correspondence between the original Japanese and the English translation is shown in Table III.

Fig. 9.

Fig. 9.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on May 1, 2020. The notation in (b)–(d) is the same as in Fig. 7. The correspondence between the original Japanese and the English translation is shown in Table IV.

On March 1, Fig. 7(b) shows the news about cancellation of many events that attract large crowds for preventing the spread of infection of COVID-19. Fig. 7(c) shows the concern and warning to misinformation about COVID-19. In Fig. 7(d), the news and concern about the first COVID-19 infection in Nishinomiya City in Hyogo Prefecture appear.

Around April 1, the shortage of masks was a serious concern in Japan. To deal with this situation, Prime Minister Shinzo Abe declared that the government would issue two masks per household [see Fig. 8(b) and (c)]. Fig. 8(d) shows the news and people’s concern about infection spread all over the country.

On May 1, Fig. 9(b) and (c) shows the news about deaths due to COVID-19. More specifically, in Fig. 9(b), we can observe the report that those in their sixties or older account for about 90% of the total. The curation of COVID-19-related information of various regions, and video messages from celebrities, was confirmed [see Fig. 9(d)]. This tweet community includes tweets about countermeasures, including opinions from experts for COVID-19. This may show people’s wish, after the long period of self-restraint, to avoid the spread of infection of COVID-19.

From these results, we find that people’s attention changed over time, from concern about the infection to the countermeasures and the wish for the ending of the spread of COVID-19. As a consequence of this section, we confirmed that our method is useful for understanding the situation, and acquiring the relevant knowledge, through ranking of tweet communities by importance measures.

V. Conclusion and Future Work

This article presented a method that detects tweet communities with similar topics and ranks the communities by importance measures. By identifying only the communities with high importance measures, it becomes possible for users to easily find important information about COVID-19. Specifically, we construct a community network whose nodes are tweet communities, obtained by applying a community detection method to a tweet network. We then calculate the centrality to the community network as importance measures, to detect the most central tweet communities. We conducted experiments by collecting Japanese tweets about COVID-19 sent between March 1, 2020 and May 15, 2020. The results show that our method can successfully extract keywords, that is, words that are strongly correlated with the number of people infected with COVID-19. Because users can browse the keywords from a small number of central tweet communities, quick and easy understanding of important information became feasible.

We discuss how to use the proposed method for fighting COVID-19. The proposed method will be beneficial for quick and objective decision-making based on public opinions. A small number of tweet communities detected by our method help individuals and organizations find meaningful keywords like Figs. 79. This makes it possible to quickly make decisions without taking a long time to manually search a flood of information. It is also notable that such decision-making is based on objective data. If individuals and organizations read a small number of tweets selected subjectively, they may make wrong decisions contrary to public opinions. Our method helps solve this problem. Moreover, our method will accelerate various data mining research for fighting COVID-19, such as opinion mining and sentiment analysis. In such research, tweets that are irrelevant to COVID-19 may increase computational cost and may cause noisy results. Because our method can extract relevant tweets from many tweets, our method helps overcome these drawbacks.

Furthermore, we focus on misinformation that is unique to specific regions and/or time periods. In fact, misinformation that 5G networks are the cause of COVID-19 was observed in specific regions such as Europe, America, and the Middle East. The proposed method will be used to handle such type of misinformation in Twitter as follows.

  • 1)

    If misinformation is mentioned in many tweets, the proposed method can visualize it as tweet communities. Thus, a user can browse tweet communities, including misinformation.

  • 2)

    The user selects a tweet community in which they would like to verify whether misinformation is included or not.

  • 3)

    The proposed method is applied to tweets in different regions and/or time periods. Then, tweet communities that are similar to the user’s selected tweet community are extracted.

  • 4)

    We display the difference between the extracted tweet communities and the user’s selected one. It will be useful to visualize the difference of most frequent words like Figs. 7, 8, and 9. In reference to the visualized difference, the user judges whether misinformation is included or not.

In the future, we will develop this system and evaluate the effectiveness. Note that this system will be useful for only misinformation that is mentioned in many tweets and are unique to specific regions and/or time periods. Thus, future work includes the detection of other type of misinformation.

The scope of this study is within COVID-19. Thus, the future work includes the application of the proposed importance measures to topics other than COVID-19. In our previous study [40], we confirmed that the community network is beneficial for efficient grouping of similar Web videos for retrieval. Specifically, we formulate grouping of similar Web videos as community detection in a network, whose nodes are Web videos and edges are hyperlinks weighted by video similarities. Then, we construct a community network in which each node includes multiple Web videos. By applying the community detection method [41] to the community network, we can efficiently group similar Web videos, while the accuracy of retrieval can be preserved. In this way, the versatility of the community network is confirmed. In the future, we will evaluate the proposed importance measures as well as the community network for other topics.

We believe that this study is one of the pioneering works on data mining for tackling the difficulty caused by COVID-19. However, we will develop more sophisticated methodologies. Future work includes the improvement of tweet community detection by using multimodal features, such as deep-learning-based text features, sentiment features, and visual features of attached images. After this improvement, we will develop a method for predicting the trend of people’s attention to COVID-19 over time; this is required because the method proposed in this article does not include a time series modeling scheme.

Acknowledgment

The authors would like to thank Edanz Group (https://en-author-services.edanzgroup.com/) for editing a draft of this article.

Biographies

graphic file with name harak-3063820.gif

Ryosuke Harakawa (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronics and information engineering from Hokkaido University, Sapporo, Japan, in 2013, 2015, and 2016, respectively.

He is currently an Assistant Professor with the Department of Electrical, Electronics, and Information Engineering, Nagaoka University of Technology, Nagaoka, Japan. His research interests include multimedia information retrieval and Web mining.

Dr. Harakawa is a member of the ACM, the IEICE, and the Institute of Image Information and Television Engineers (ITE).

graphic file with name iwaha-3063820.gif

Masahiro Iwahashi (Senior Member, IEEE) received the B.Eng., M.Eng., and D.Eng. degrees in electrical engineering from Tokyo Metropolitan University, Tokyo, Japan, in 1988, 1990, and 1996, respectively.

In 1990, he joined Nippon Steel Company Ltd. Since 1993, he has been with the Nagaoka University of Technology, Nagaoka, Japan, where he is currently a Professor with the Department of Electrical, Electronics and Information Engineering. His research interests include the areas of digital signal processing, multirate systems, and image compression.

Dr. Iwahashi is a Senior Member of the IEICE and a member of the Asia Pacific Signal and Information Processing Association (APSIPA) and the Institute of Image Information and Television Engineers (ITE).

Funding Statement

This work was supported by the Adaptable and Seamless Technology Transfer Program through Target-Driven Research and Development (A-STEP) from Japan Science and Technology Agency (JST) under Grant JPMJTM20DJ.

Footnotes

4

The terms community and cluster have the same meaning, in general; however, we use the term community in this article to avoid confusion with the term cluster used in information science and the word cluster that means a group of people infected with COVID-19.

Contributor Information

Ryosuke Harakawa, Email: harakawa@vos.nagaokaut.ac.jp.

Masahiro Iwahashi, Email: iwahashi@vos.nagaokaut.ac.jp.

References

  • [1].Kwak H., Lee C., Park H., and Moon S., “What is Twitter, a social network or a news media?” in Proc. ACM Int. Conf. World Wide Web (WWW), 2010, pp. 591–600. [Google Scholar]
  • [2].Java A., Song X., Finin T., and Tseng B., “Why we Twitter: Understanding microblogging usage and communities,” in Proc. 9th WebKDD 1st SNA-KDD Workshop Web Mining Social Netw. Anal., 2007, pp. 56–65. [Google Scholar]
  • [3].Alnajran N., Crockett K., McLean D., and Latham A., “Cluster analysis of Twitter data: A review of algorithms,” in Proc. 9th Int. Conf. Agents Artif. Intell., 2017, pp. 1–11. [Google Scholar]
  • [4].Cinelli M.et al. , “The COVID-19 social media infodemic,” 2020, arXiv:2003.05004. [Online]. Available: http://arxiv.org/abs/2003.05004 [DOI] [PMC free article] [PubMed]
  • [5].Kishore Shahi G., Dirkson A., and Majchrzak T. A., “An exploratory study of COVID-19 misinformation on Twitter,” 2020, arXiv:2005.05710. [Online]. Available: http://arxiv.org/abs/2005.05710 [DOI] [PMC free article] [PubMed]
  • [6].Medford R. J., Saleh S. N., Sumarsono A., Perl T. M., and Lehmann C. U., “An ‘infodemic’: Leveraging high-volume Twitter data to understand early public sentiment for the coronavirus disease 2019 outbreak,” Medrxiv, vol. 7, Oct. 2020, Art. no. ofaa258, doi: 10.1101/2020.04.03.20052936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Qian S., Zhang T., and Xu C., “Multi-modal multi-view topic-opinion mining for social event analysis,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 2–11. [Google Scholar]
  • [8].Fang Q., Xu C., Sang J., Hossain M. S., and Muhammad G., “Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media,” IEEE Trans. Multimedia, vol. 17, no. 12, pp. 2281–2296, Dec. 2015. [Google Scholar]
  • [9].Harakawa R., Takimura S., Ogawa T., Haseyama M., and Iwahashi M., “Consensus clustering of tweet networks via semantic and sentiment similarity estimation,” IEEE Access, vol. 7, pp. 116207–116217, 2019. [Google Scholar]
  • [10].Harakawa R., Ogawa T., and Haseyama M., “Extracting hierarchical structure of Web video groups based on sentiment-aware signed network analysis,” IEEE Access, vol. 5, pp. 16963–16973, 2017. [Google Scholar]
  • [11].Harakawa R., Ogawa T., and Haseyama M., “Tracking topic evolution via salient keyword matching with consideration of semantic broadness for Web video discovery,” Multimedia Tools Appl., vol. 77, no. 16, pp. 20297–20324, Aug. 2018. [Google Scholar]
  • [12].Harakawa R., Ogawa T., and Haseyama M., “A Web video retrieval method using hierarchical structure of Web video groups,” Multimedia Tools Appl., vol. 75, no. 24, pp. 17059–17079, Dec. 2016. [Google Scholar]
  • [13].Pitas I., Graph-Based Social Media Analysis. London, U.K.: Chapman & Hall, 2015. [Google Scholar]
  • [14].Singh L.et al. , “A first look at COVID-19 information and misinformation sharing on Twitter,” 2020, arXiv:2003.13907. [Online]. Available: http://arxiv.org/abs/2003.13907
  • [15].Ferrara E., “What types of COVID-19 conspiracies are populated by Twitter bots?” 1st Monday, vol. 25, no. 6, May 2020, doi: 10.5210/fm.v25i6.10633. [DOI] [Google Scholar]
  • [16].Hamzah F. B., Lau C., Nazri H., Ligot D. V., and Lee G., “Coronatracker: World-wide COVID-19 outbreak data analysis and prediction,” Bull. World Health Org., vol. 2, pp. 1–31, Dec. 2020. [Google Scholar]
  • [17].Zhong L., Mu L., Li J., Wang J., Yin Z., and Liu D., “Early prediction of the 2019 novel coronavirus outbreak in the mainland China based on simple mathematical model,” IEEE Access, vol. 8, pp. 51761–51769, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Zheng N., Du S., Wang J., Zhang H., Cui W., and Kang Z., “Predicting COVID-19 in China using hybrid ai model,” IEEE Trans. Cybern., vol. 50, no. 7, pp. 2891–2904, May 2020. [DOI] [PubMed] [Google Scholar]
  • [19].Sha H., Al Hasan M., Mohler G., and Brantingham P. J., “Dynamic topic modeling of the COVID-19 Twitter narrative among U.S. Governors and cabinet executives,” 2020, arXiv:2004.11692. [Online]. Available: http://arxiv.org/abs/2004.11692
  • [20].Ng T. W., Turinici G., and Danchin A., “A double epidemic model for the SARS propagation,” BMC Infectious Diseases, vol. 3, no. 1, p. 19, Dec. 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Hochreiter S. and Schmidhuber J., “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
  • [22].Chen E., Lerman K., and Ferrara E., “Tracking social media discourse about the COVID-19 pandemic: Development of a public coronavirus Twitter data set,” JMIR Public Health Surveill., vol. 6, no. 2, May 2020, Art. no. e19273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Alqurashi S., Alhindi A., and Alanazi E., “Large arabic Twitter dataset on COVID-19,” 2020, arXiv:2004.04315. [Online]. Available: http://arxiv.org/abs/2004.04315
  • [24].Wicke P. and Bolognesi M. M., “Framing COVID-19: How we conceptualize and discuss the pandemic on Twitter,” 2020, arXiv:2004.06986. [Online]. Available: http://arxiv.org/abs/2004.06986 [DOI] [PMC free article] [PubMed]
  • [25].Li L.et al. , “Characterizing the propagation of situational information in social media during COVID-19 epidemic: A case study on Weibo,” IEEE Trans. Comput. Social Syst., vol. 7, no. 2, pp. 556–562, Apr. 2020. [Google Scholar]
  • [26].Blei D. M., Ng A. Y., and Jordan M. I., “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [Google Scholar]
  • [27].Chamola V., Hassija V., Gupta V., and Guizani M., “A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact,” IEEE Access, vol. 8, pp. 90225–90265, 2020. [Google Scholar]
  • [28].Latif S.. (Apr. 2020). Leveraging Data Science to Combat COVID-19: A Comprehensive Review. [Online]. Available: https://www.techrxiv.org/articles/Leveraging_Data_Science_To_Combat_COVID-19_A_Comprehensive_Review/12212516 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Almatarneh S., Gamallo P., and Pena F. J. R., “CiTIUS-COLE at SemEval-2019 task 5: Combining linguistic features to identify hate speech against immigrants and women on multilingual tweets,” in Proc. 13th Int. Workshop Semantic Eval., 2019, pp. 387–390. [Google Scholar]
  • [30].Sebastiani F., “Machine learning in automated text categorization,” ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, Mar. 2002. [Google Scholar]
  • [31].Le Q. V. and Mikolov T., “Distributed representations of sentences and documents,” 2014, arXiv:1405.4053. [Online]. Available: http://arxiv.org/abs/1405.4053
  • [32].Blondel V. D., Guillaume J.-L., Lambiotte R., and Lefebvre E., “Fast unfolding of communities in large networks,” J. Stat. Mech., Theory Exp., vol. 2008, no. 10, Oct. 2008, Art. no. P10008. [Google Scholar]
  • [33].Takehara D., Harakawa R., Ogawa T., and Haseyama M., “Extracting hierarchical structure of content groups from different social media platforms using multiple social metadata,” Multimedia Tools Appl., vol. 76, no. 19, pp. 20249–20272, Oct. 2017. [Google Scholar]
  • [34].Harakawa R., Ogawa T., and Haseyama M., “Accurate and efficient extraction of hierarchical structure of Web communities for Web video retrieval,” ITE Trans. Media Technol. Appl., vol. 4, no. 1, pp. 49–59, 2016. [Google Scholar]
  • [35].Arenas A., Duch J., Fernandez A., and Gomez S., “Size reduction of complex networks preserving modularity,” New J. Phys., vol. 9, no. 176, pp. 604–632, 2007. [Google Scholar]
  • [36].Kleinberg J. M., “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, Sep. 1999. [Google Scholar]
  • [37].Saerens M. and Fouss F., “HITS is principal components analysis,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intell., Sep. 2005, pp. 782–785. [Google Scholar]
  • [38].Akoglu H., “User’s guide to correlation coefficients,” Turkish J. Emergency Med., vol. 18, no. 3, pp. 91–93, Sep. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Miller G. A., “The magical number seven, plus or minus two: Some limits on our capacity for processing information,” Psychol. Rev., vol. 63, no. 2, pp. 81–97, 1956. [PubMed] [Google Scholar]
  • [40].Harakawa R., Ogawa T., and Haseyama M., “[Paper] an efficient extraction method of hierarchical structure of Web communities for Web video retrieval,” ITE Trans. Media Technol. Appl., vol. 2, no. 3, pp. 287–297, 2014. [Google Scholar]
  • [41].Newman M. E. J. and Girvan M., “Finding and evaluating community structure in networks,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 69, no. 2, Feb. 2004, Art. no. 026113. [DOI] [PubMed] [Google Scholar]

Articles from Ieee Transactions on Computational Social Systems are provided here courtesy of Institute of Electrical and Electronics Engineers

RESOURCES