Abstract
This article presents a method that detects tweet communities with similar topics and ranks the communities by importance measures. By identifying the tweet communities that have high importance measures, it is possible for users to easily find important information about the coronavirus disease (COVID-19). Specifically, we first construct a community network, whose nodes are tweet communities obtained by applying a community detection method to a tweet network. The community network is constructed based on textual similarities between tweet communities and sizes of tweet communities. Second, we apply algorithms for calculating centrality to the community network. Because the obtained centrality is based on tweet community sizes as well, we call it the importance measure in distinction to conventional centrality. The importance measure can simultaneously evaluate the importance of topics in the entire data set and occupancy (or dominance) of tweet communities in the network structure. We conducted experiments by collecting Japanese tweets about COVID-19 from March 1, 2020 to May 15, 2020. The results show that the proposed method is able to extract keywords that have a high correlation with the number of people infected with COVID-19 in Japan. Because users can browse the keywords from a small number of central tweet communities, quick and easy understanding of important information becomes feasible.
Keywords: Community detection, coronavirus, coronavirus disease (COVID-19), network analysis, network centrality, semantic understanding
I. Introduction
The outbreak of coronavirus disease (COVID-19) has seriously affected human health and economic activity around the world. During the COVID-19 epidemic, users search social media networks, such as Twitter,1 Weibo,2 and YouTube,3 as well as the traditional media, such as television and radio for information. In particular, Twitter is a very popular social media network [1], [2] and has become an important source of information [3]. Therefore, we use Twitter as the platform for this research. In Twitter, various tweets (i.e., short text messages), including information (misinformation, in some cases), have been disseminated widely [4]–[6]. This makes it difficult for users to understand the situation and acquire the relevant knowledge, about COVID-19.
One of the effective solutions to this problem is the visualization of an overview of a large amount of content [7]–[12]. Qian et al. [7] explained that it is very time-consuming for users who are not familiar with a topic to browse large amounts of content and quickly gain a general understanding of it; therefore, it is important to automatically mine multiview opinions on the target topic. Our recent study [9] proposed a method for extracting tweet communities4 from a tweet network that represents the similarity between tweets. In this article, we define a set of tweets with similar topics as a tweet community. The obtained tweet communities enable us to gain a general understanding of many tweets. However, there remains the problem that it is difficult for users to browse all tweet communities as the number of tweet communities increases.
To solve this problem, we propose a method that detects tweet communities, with similar topics, from a tweet network and ranks the communities by importance measures. By identifying the tweet communities that have high importance measures, it is possible for users to easily find important information about COVID-19. Inspired by reports that network representation is useful for multimedia content analysis, including clustering [13] and tweet community extraction [9], we also employ a network-based approach. We aim that the importance measure represents the importance of each tweet community. The importance measure should simultaneously evaluate: i) importance of topics in the entire data set and ii) occupancy (or dominance) of tweet communities in the network structure. If the centrality is large but the size is small, the tweet community may be trivial because of oversplitting. On the other hand, it is not guaranteed that the tweet community, whose size is large but centrality is small, includes important topics in the entire data set. As the importance measure, we develop new centrality considering a size of a tweet community because the centrality and size satisfy i) and ii), respectively.
Specifically, the core algorithms and the novelty are as follows.
-
1)
We construct a community network, whose nodes are the tweet communities. Each node in the community network has a topic, which represents the meaning of the tweets in the corresponding tweet community. In the community network, a node that has edges with high weights includes a central topic in the entire data set, and the node is dominant in the network structure.
-
2)
We calculate the centrality of each node in the community network. Because the centrality is calculated using tweet community sizes as well, we call it the importance measure in distinction to conventional centrality.
-
3)
The novelty of our work lies in its calculation of the importance measure of a community network, rather than a tweet network. This helps us hierarchize the tweet communities to find important information about COVID-19, even as the number of tweet communities increases.
We conducted experiments by collecting 76000 Japanese tweets about COVID-19 dated between March 1, 2020 and May 15, 2020. In these experiments, we defined words that have a strong correlation with the number of people infected with COVID-19 as keywords. The results show that keywords extracted from only the central tweet communities that were detected by our method were similar to those extracted from all collected tweets. This implies that users can gain a general understanding of many tweets about COVID-19 by browsing only a small number of central tweet communities. This is useful for users to understand the situation and acquire the relevant knowledge about COVID-19 even when there is a flood of information (including misinformation, in some cases).
The rest of this article is organized as follows. Section II describes related work, comprising existing information technologies (in particular, data mining methods) that target COVID-19. In Section III, the proposed method for extracting tweet communities and ranking them by importance measures is explained. Section IV presents the experimental results for real tweets about COVID-19 in Japan and discusses the effectiveness of our method for extracting keywords that are related to the number of infected people. Finally, conclusions and future work are discussed in Section V.
II. Related Work
This section describes information technologies for COVID-19 —in particular, pioneering studies that utilize social media networks—to clarify the contribution of our study. Researchers have raised concerns about misinformation, myths, and conspiracies related to COVID-19 [4]–[6], [14], [15]. The term infodemic means the phenomenon characterized by a flood of information and misinformation. Cinelli et al. [4], Shahi et al. [5], and Medford et al. [6] reported that COVID-19 has caused an infodemic on social media networks, such as Twitter, Instagram,5 YouTube, and Reddit.6 Shahi et al. [5] highlighted the necessity of proposing actions for authorities to counter misinformation and hints for social media users on how to stop the spread of misinformation. Singh et al. [14] found myths in Twitter by manually defining myths, according to their frequency of appearance in different websites using the search phrase “Coronavirus Common Myths,” and defining how dangerous they were. Ferrara [15] reported that accounts that automatically post tweets (namely, bots) are used to promote conspiracy theories in the United States, in stark contrast with human users, who focus on public health and welfare.
Tracking and predicting events about COVID-19 on social media networks have been studied [16]–[19]. Hamzah et al. [16] proposed a Web platform called CoronaTracker. CoronaTracker provides a predictive model to forecast COVID-19 outbreaks within and outside China, based on daily observations. Furthermore, it can classify the news related to COVID-19 into negative and positive sentiments, to understand the influence of the news on people’s behavior, both politically and economically. Zhong et al. [17] proposed a susceptible–infected–removed (SIR) model-based method [20] for predicting the number of infected cases in China. Zheng et al. [18] also predicted the trend of COVID-19 in China. They combined an improved susceptible–infected model and a long short-term memory (LSTM) network [21] with news information extracted via natural language processing (NLP), to estimate the number of infected cases. Dynamic topic modeling [19] was proposed for analyzing the COVID-19 Twitter narrative among U.S. governors and presidential cabinet members, to track the evolution of subtopics related to risk, testing, and treatment.
Some researchers have constructed COVID-19-related data sets [22], [23]. Chen et al. [22] constructed a multilingual Twitter data set for stimulating the research community. In [23], an Arabic data set of tweets on COVID-19 since January 1, 2020 was presented.
Research on inferring or classifying the topics behind Twitter or Weibo posts has been conducted [24], [25]. Wicke and Bolognesi [24] analyzed the discourse around COVID-19 by applying latent Dirichlet allocation [26], a well-known topic modeling method, to a corpus of tweets sent during March and April 2020. Li et al. [25] classified Weibo posts about COVID-19, according to seven types of situational information, to find specific features for predicting the reposted amount of each type of information.
In addition, some review articles have been published [27], [28], discussing information technologies, including artificial intelligence [27] and data science [28], for tackling the COVID-19 epidemic.
Our work is the first attempt to clarify topics (in particular, keywords that have a high correlation with the number of people infected with COVID-19 in Japan) on the basis of complex network analysis with NLP. As described in Section I, the technical novelty of our method is that we hierarchize the tweet communities by calculating the importance measures of each tweet community, rather than those of each tweet.
III. Ranking of Tweet Communities
To gain a general understanding of many tweets about COVID-19, we present a method that detects tweet communities with similar topics and ranks these communities by importance measures. In Section III-A, our method for Twitter data acquisition is described. The proposed method consists of two phases: construction of a community network (Section III-B) and ranking of tweet communities (Section III-C) [see Fig. 1].
Fig. 1.

Overview of Sections III-B and III-C. In Section III-B, two types of networks are constructed. First, we construct a tweet network whose nodes are tweets, which represents similarities between the tweets. Second, we construct a community network whose nodes are tweet communities, which represents similarities between the tweet communities. In Section III-C, importance measures of each tweet community are calculated. We display tweet communities in descending order of the importance measures. This can overcome the difficulty that users cannot judge which communities should be read in many communities.
A. Data Acquisition
From March 1, 2020 to May 15, 2020, we collected 1000 Japanese tweets per day. In Japan, a state of emergency was declared by the government on April 7. Therefore, people’s tension had been increased, especially during the above period. By using the query “a novel coronavirus” (

in Japanese), we collected tweets by a keyword search, via an open-source Twitter tool.7 Moreover, because personal communication is not relevant to the task of tweet community detection, we removed reply tweets, as in our previous study [9]. Furthermore, we removed URL strings beginning with an “http” or “pic” prefix. (In Twitter, an attached image is represented as a shortened URL that starts with “pic.”) In this way, we constructed 76 data sets for the experiment (one for each day).
B. Construction of Community Network
As in our previous work on tweet community detection [9], we employ a network-based approach. In the experiment presented in Section IV, we performed the subsequent processing on each of the 76 data sets separately.
First, for each data set, we represented each tweet as
, where
is the number of tweets in one data set). Here, we collected only Japanese tweets and performed the following processing. Using a natural language processing tool called Janome (https://mocobeta.github.io/janome/en/), we performed the morphological analysis and extracted only nouns. Then, we removed stop words defined in https://www.kaggle.com/lazon282/japanese-stop-words. Also, we removed words that consist of only one character because they are likely to be trivial symbols and numbers. Note that Japanese nouns do not change inflection (for example, we do not separate singular nouns from plural ones). Thus, we do not perform lemmatization.
We then extracted textual features
that represented the semantics of each tweet
. Because there are tweets whose grammar and context are poor, features considering only word frequencies will be more suitable than embedding-based features considering the word order. In fact, the article [29] reports that the term frequency–inverse document frequency (TF–IDF) features [30] have more discriminative power than Doc2Vec [31]. Motivated this fact, we use TF-IDF features as
.
Following the report that a
-nearest neighbors (
-NN) network is usually suitable for adapting to data set properties [13], we constructed a
-NN network using
. Specifically, for each tweet
, we calculated cosine similarities between
(TF–IDF features of
) and
(TF–IDF features of the other tweets
). From
, we selected
tweets in descending order of the similarities. By connecting unweighted edges between
and the selected
, we constructed the
-NN network. The obtained
-NN network represented the relationships between tweet semantics. The
-NN network based on TF–IDF features was also used for tweet community extraction in our recent study [9]. In this article, we define the obtained
-NN network as a tweet network
.
Using
, we detect tweet communities with similar topics. Following the reports that the Louvain method [32] works well for multimedia content clustering [9], [11], [33], [34], we apply the Louvain method [32] to
. The Louvain method is based on a quality measure of community detection results called modularity [35]. The modularity
is defined as
![]() |
Here
![]() |
where
is 1 if
and
belong to the same tweet community and 0 otherwise. Also,
denotes the existence of an edge between
and
; thus,
is 1 if an edge between
and
in
exists and 0 otherwise. By recursively maximizing
, we can successfully obtain tweet communities
(where
is the number of communities) containing semantically similar tweets. The details of the algorithm are shown in Algorithm 1.
Algorithm 1 Detection of Tweet Communities by the Louvain Method [32]
Input: Tweet network
whose nodes are tweets
.Output: Tweet communities
.-
1:
Assign each node
to each tweet community. -
2:
while Improvement of
(in Eq. (1)) of
is obtained do -
3:
while Improvement of
of
is obtained do -
4:
/* Local maximization of modularity */
-
5:
for each node of
do -
6:
Evaluate the gain of
when a node is set to each tweet community including neighborhood nodes. -
7:
Reassign a node to the tweet community for which the positive gain of
is maximum. -
8:
end for
-
9:
Calculate
of
. -
10:
end while
-
11:
Update the obtained tweet communities as
. -
12:
/* Updating a new network */
-
13:
Update
with a self-loop whose nodes are the obtained tweet community, where each edge weight is the sum of the edge weights of the original network. -
14:
end while
-
15:
Return the tweet communities
.
Finally, we construct a community network
, in which tweet communities with central topics (where topics are the meanings that represent tweets in the community network) are densely linked to other communities. Each node of
is one of the obtained tweet communities; therefore, we can write
. The edge weight
, from node
to node
, is defined as follows:
![]() |
where
is the number of tweets contained in
and
denotes a tweet in the tweet community
. We do not place an edge between
and
if none of the tweets in
and
are connected by edges in the tweet network. Equation (2) can simultaneously evaluate: i) importance of topics in the entire data set and ii) occupancy (or dominance) in the network structure. Specifically, the numerator shows i). Although the denominator is a normalization term, the logarithm function reduces the overnormalization when the community sizes are large. Therefore, the denominator shows ii).
C. Ranking of Tweet Communities
Having obtained the community network
, we hierarchize the tweet communities
. The input of our algorithm is
, and the output is the result of sorting
in descending order of their importance measures
. As described above, performing the importance measure calculation on a community network, rather than a tweet network, is the novelty of this study.
Specifically, we calculate centrality, i.e., degree centrality, closeness centrality, betweenness centrality, and hyperlink-induced topic search (HITS) centrality [36], of
. We call the obtained centrality importance measures and denote them by
. They are calculated as follows.
Degree Centrality: The degree centrality is the most primitive centrality measure that is defined as
![]() |
where
is a weighted degree of
in
.
with a high degree centrality is similar to the neighbor nodes.
Closeness Centrality: The closeness centrality is defined as
![]() |
where
is the shortest path distance from
to
. Thus, the closeness centrality represents the accessibility of each node in
.
Betweenness Centrality: The betweenness centrality is defined as
![]() |
where
denotes the number of shortest paths from
to
and
denotes the number of such paths that pass through
. In this article, we calculate the shortest paths considering edge weights in
. Thus,
represents the importance of
in information propagation in
.
HITS Centrality: The HITS algorithm is equivalent to principal component analysis (PCA) of the network structure [37]. First, we represent
in the form of an adjacency matrix
. The elements of
are the edge weights of
. The HITS algorithm calculates the principal eigenvector
of
. The
th element of
becomes
. Note that there is eigenvector centrality as well-known centrality. The eigenvector centrality is equivalent to the principal eigenvector of
. In this study,
is an undirected graph; thus,
is a symmetric matrix. According to the basic linear algebra, eigenvectors of
and
are the same. Therefore, in this study, HITS centrality is equivalent to eigenvector centrality.
In
, a node that has edges with high weights includes a central topic (where topics are meanings that represent tweets in the community network) in the entire data set, and it is dominant in the network structure. Therefore, displaying
in descending order of
enables users to easily find important information about COVID-19, even if many tweet communities are obtained.
IV. Experimental Results and Discussion
In this section, experimental results for real Twitter data are presented and discussed to verify the effectiveness of the proposed method.
A. Quantitative Discussion
We attempt to quantitatively discuss the point that our method enables users to easily find important information about COVID-19 from many tweets. To do this, we evaluate the accuracy of the extraction of keywords about COVID-19, as explained next.
1). Ground Truth:
We define the keywords about COVID-19 by focusing on their correlation with the number of infected people. First, we collected the number of new COVID-19 infections per day in Japan from March 1, 2020 to May 15, 2020. The number of new COVID-19 infections is published by Google based on the Wikipedia statistics.8 There is a case where reports from health centers in every place to the Ministry of Health, Labour and Welfare are delayed because of holidays of the health centers. This results in the fluctuation of the number of new COVID-19 infections depending on the day of the week. To remove the influence of this fluctuation on the subsequent analysis, we calculated three-day moving averages [see Fig. 2].
Fig. 2.
Number of new COVID-19 infections per day in Japan from March 1, 2020 to May 15, 2020. (a) Raw data. (b) Three-day moving averages.
Here,
denotes the 76-D vector that contained the number of infections (after the moving average) for each day. Second, for each day, we counted the number of times that each word appeared in tweets. If a word appeared multiple times in one tweet, we counted it only once to reduce the influence of tweets in which the same word is repeated many times. Also, we ignored the query words used in data acquisition because they appeared in all tweets. Thus, for each word, we obtained a 76-D vector
that contained the number of times that the word appeared in tweets each day.
Furthermore, we calculated the Pearson correlation coefficient (CC) between
and each
. The article [38] reports that
(where
is the absolute value of CC) shows substantial correlation. In the medicine field,
can be interpreted as “Fair” correlation among “None,” “Poor,” “Fair,” “Moderate,” “Very Strong,” and “Perfect” correlations. In the psychology field, we can interpret
as “Moderate” correlation among “Zero,” “Weak,” “Moderate,” “Strong,” and “Perfect” correlations. In the politics field,
can be interpreted as “Strong” correlation among “None,” “Negligible,” “Weak,” “Moderate,” “Strong,” “Very Strong,” and “Perfect” correlations. According to this report, we defined words with
as the ground truth of the keywords. Hereafter, this set of keywords is denoted by
. We defined the keywords in this way because we considered that words with a high correlation with the number of infections contained semantics relevant to the surrounding situation and necessary knowledge, such as the countermeasures.
2). Comparative Methods:
In this experiment, we compared the following ten cases.
Cases 1–4: The cases in which tweet communities are displayed in descending order of the proposed importance measures. Cases 1–4 use degree centrality, closeness centrality, betweenness centrality, and HITS centrality, respectively.
Cases 5–8: The cases in which tweet communities are displayed in descending order of comparative measures. The comparative measures calculate (2) by replacing
with
. Thus, these cases only consider the importance of topics in the entire data set. Cases 5–8 use degree centrality, closeness centrality, betweenness centrality, and HITS centrality, respectively.
Cases 9: The case in which tweet communities are displayed in descending order of tweet community sizes. Thus, this case only considers the occupancy of tweet communities in the network structure.
Case 10: The case in which tweet communities are displayed in a random order.
For each case, we extracted as keywords the words that appeared in tweets contained in the displayed tweet communities, in the same manner as the extraction of
. Furthermore, we denote the keywords obtained in cases 1–10 by
,
,
,
,
,
,
,
,
, and
, respectively.
3). Evaluations:
For the quantitative discussion, we use the Jaccard index, recall, and precision. The Jaccard index is a frequently used metric that represents the overlap between two sets. The recall represents the comprehensiveness. The precision represents the ratio of correct keywords (i.e., keywords included in the ground truth) to keywords that are displayed to users. They are defined as follows:
![]() |
Following the principle of tenfold cross validation, we randomly extracted 900 tweets from each data set, calculated the above metrics for the extracted tweets, and repeated this ten times. By calculating the mean and standard deviation of the ten validations, we attempt to assess the effectiveness of the proposed method accurately.
Next, we show the Jaccard index, recall, and precision for cases 1–10. Note that the size-based method (case 9) displays large tweet communities; therefore, the number of displayed keywords is likely to be larger than other methods. This may make the comparison unfair. Based on the report on the human short-term memory [39], we assume that users memorize only seven frequent keywords in each tweet community. To perform the fair comparison based on this practical assumption, Fig. 3 shows the evaluation results for cases 1–4, 9, and 10. For the
-NN network construction in Section III-B, we should avoid
that erroneously detects tweet communities. A large
is not suitable because it results in a too dense network to reveal the community structure. For this reason, we here set
to 3. From Fig. 3, we can observe that the proposed method (cases 1–4) achieves better results than the size-based method (case 9) and the random method (case 10), for every metric. In particular, the superiority of the proposed method to the random method is statistically significant. For the Jaccard index, the p-values of Welch’s t-test when comparing cases 1–4 with case 10 are 0.001, 0.003, 0.002, and 0.002, respectively. For the recall, those are 0.002, 0.004, 0.002, and 0.002, respectively. For the precision, those are 0.024, 0.017, 0.003, and 0.007, respectively. Furthermore, Fig. 4 shows the evaluation results for cases 5–10. The performance of minor versions of the proposed method (cases 5–8) is worse than the proposed method (cases 1–4 in Fig. 3). Although the performance in cases 5–8 is superior to the random method (case 10), the performance is worse than the size-based method (case 9). This shows the necessity of simultaneously evaluating the importance of topics in the entire data set and occupancy of tweet communities in the network structure. Thus, the validity of the proposed method can be confirmed. Also, we can observe that results with degree centrality, closeness centrality, betweenness centrality, and HITS centrality are almost the same. This may be because the global structure and the local structure are similar due to the small size of the community network. If the community network becomes large and dense, HITS centrality would become powerful because it is equivalent to PCA and can exploit the global structure even for the large and dense network.
Fig. 3.
(a) Jaccard index. (b) Recall. (c) Precision. The horizontal axis shows the number of tweet communities displayed to users. The data points and error bars show the mean and standard deviation of the results of ten validations. The means for each case are shown in parentheses. Seven most frequent keywords in each tweet community are displayed to users. We show results when
was set to 3 in the
-NN network construction in Section III-B.
Fig. 4.
(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. Seven most frequent keywords in each tweet community are displayed to users. We show the results when
was set to 3 in the
-NN network construction in Section III-B.
Table I shows the examples of the keywords extracted as the ground truth (
), those identified by the proposed method (
), and those found by random selection (
). Our method extracted keywords about the declaration of a state of emergency (“official announcement,” “declaration,” and “emergency”). As explained above, a state of emergency was declared by the government in Japan in April 7. “Business suspension” may appear because of the request for business suspension by the government, to prevent the spread of COVID-19 infection. Conversely, our method and the random method incorrectly detected “society” and “digital” as keywords. These words seem to be too general to capture COVID-19-related topics. We notice that the random method cannot detect any correct keywords. From this fact, we can confirm the superiority of our method for ranking tweet communities in descending order of the importance measures.
TABLE II. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 7.
| Keywords shown in Fig. 7(b) | |
|---|---|
| English translation | Original Japanese |
| infection | |
| expansion | |
| cancellation | |
| prevention | |
| schedule | |
| influence | |
| notification | |
| Keywords shown in Fig. 7(c) | |
| English translation | Original Japanese |
| pneumonia | |
| infection | |
| misinformation | |
| information | |
| countermeasure | |
| welfare | |
| influence | |
| Keywords shown in Fig. 7(d) | |
| English translation | Original Japanese |
| infection | |
| Hyogo (a prefecture in Japan) | |
| confirmation | |
| man | |
| Nishinomiya (a city in Hyogo Prefecture) | |
| within the prefecture | |
| Osaka | |
TABLE III. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 8.
| Keywords shown in Fig. 8(b) | |
|---|---|
| English translation | Original Japanese |
| mask | |
| infection | |
| countermeasure | |
| distribution | |
| prevention | |
| Abe (name of the Prime Minister in Japan at that time) | |
| government | |
| Keywords shown in Fig. 8(c) | |
| English translation | Original Japanese |
| mask | |
| Abe (name of the Prime Minister in Japan at that time) | |
| government | |
| infection | |
| household | |
| Prime Minister | |
| countermeasure | |
| Keywords shown in Fig. 8(d) | |
| English translation | Original Japanese |
| infection | |
| confirmation | |
| news | |
| announcement | |
| within the prefecture | |
| NHK (the abbreviation of Japan Broadcasting Corporation) | |
| man | |
TABLE IV. Correspondence Between the English Translation and the Original Japanese for the Extracted Keywords in Fig. 9.
| Keywords shown in Fig. 9(b) | |
|---|---|
| English translation | Original Japanese |
| infection | |
| deceased | |
| patient | |
| Tokyo | |
| hospital | |
| occurrence | |
| announcement | |
| Keywords shown in Fig. 9(c) | |
| English translation | Original Japanese |
| infection | |
| news | |
| NHK (the abbreviation of Japan Broadcasting Corporation) | |
| confirmation | |
| deceased | |
| Tokyo | |
| Hokkaido | |
| Keywords shown in Fig. 9(d) | |
| English translation | Original Japanese |
| infection | |
| information | |
| video | |
| relevance | |
| self-restraint | |
| publication | |
| countermeasure | |
TABLE I. Examples of Keywords Extracted as the Ground Truth (
), Keywords Identified by the Proposed Method (
), and Keywords Found by Random Selection (
). The Number of Displayed Tweet Communities, the Number of Displayed Keywords From Each Tweet Community, and the Value of
Are 15, 7, and 3, Respectively.
| Ground truth | ||
|---|---|---|
| English translation of keywords | Keywords in Japanese | CC (correlation coefficient) |
| business suspension | 0.67 | |
| official announcement | 0.61 | |
| hand sanitizer gel | 0.61 | |
| hospital | 0.57 | |
| emergency | 0.46 | |
| declaration | 0.46 | |
| state | 0.46 | |
| China | −0.58 | |
| Italy | −0.48 | |
| cruise | −0.48 | |
| princess | −0.44 | |
| diamond | −0.41 | |
: The asterisk shows incorrectly detected keywords. | ||
| English translation of keywords | Keywords in Japanese | CC (correlation coefficient) |
| business suspension | 0.52 | |
| emergency | 0.44 | |
| declaration | 0.43 | |
| official announcement | 0.41 | |
| * society | * | * 0.40 |
| influence | −0.40 | |
: The asterisk shows incorrectly detected keywords. | ||
| English translation of keywords | Keywords in Japanese | CC (correlation coefficient) |
| * digital | * | * 0.44 |
4). Verification Using Another
Value:
To test another value of
, Fig. 5 shows the evaluation results where
was set to 6. Even in this setting, we can observe the effectiveness of the proposed method (cases 1–4). In particular, we can confirm the statistical significance of the proposed method to the random method (case 10). For the Jaccard index, the p-values of Welch’s t-test when comparing cases 1–4 with case 10 are 0.000, 0.000, 0.005, and 0.000, respectively. For the recall, those are 0.000, 0.000, 0.005, and 0.000, respectively. For the precision, those are 0.000.
Fig. 5.
(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. Seven most frequent keywords in each tweet community are displayed to users. We show the results when
was set to 6 in the
-NN network construction in Section III-B.
In general, it is difficult to find the best
for the topic extraction. To overcome this difficulty, we previously proposed a method [9] that collaboratively integrates community detection results by multiple
values. Our future work includes the investigation of suitable
values.
5). Performance Limitation in the Proposed Method:
In Sections IV-A3 and IV-A4, evaluations were performed when only seven most frequent keywords in each tweet community were displayed to users. Here, we perform evaluations when all keywords in each tweet community were displayed to users. This condition is not practical because we assume that users take a long time to read many keywords. Therefore, the evaluations here aim at verifying the performance limitation in the proposed method. Fig. 6 shows the evaluation results (where
was set to 3). We can observe that the performance of the proposed method (cases 1–4) is almost the same as that of the size-based method (case 9). As described in Section IV-A3, the size-based method displays more keywords than the proposed method. Thus, the correct keywords are likely to be included in the displayed many keywords. This results in the performance that is comparable with the proposed method. In summary, the proposed method is especially effective in the practical condition where users can read a limited number of keywords. In the case where users can all keywords, the size-based method is substantially effective as well.
Fig. 6.
(a) Jaccard index. (b) Recall. (c) Precision. The notation of these figures is the same as in Fig. 3. All keywords in each tweet community are displayed to users. We show the results when
was set to 3 in the
-NN network construction in Section III-B.
B. Examples of Displayed Tweet Communities
In this section, we show the examples of the tweet communities that are displayed to users. Figs. 7–9 shows three tweet communities, in descending order of importance measures, for March 1, April 1, and May 1, respectively. Here, we show the results by the importance measures with HITS centrality (case 4).
Fig. 7.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on March 1, 2020. The dots represent tweets, and the colors represent the tweet communities to which the tweets belong. Red, blue, and green colors show the tweet communities with the largest, second largest, and third largest importance measures, respectively. (b) Seven most frequent words in the tweet community with the largest importance measure. (c) Seven most frequent words in the tweet community with the second largest importance measure. (d) Seven most frequent words in the tweet community with the third largest importance measure. The correspondence between the original Japanese and the English translation is shown in Table II.
Fig. 8.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on April 1, 2020. The notation in (b)–(d) is the same as in Fig. 7. The correspondence between the original Japanese and the English translation is shown in Table III.
Fig. 9.

(a) Visualization of three tweet communities, in descending order of importance measures (case 4), on May 1, 2020. The notation in (b)–(d) is the same as in Fig. 7. The correspondence between the original Japanese and the English translation is shown in Table IV.
On March 1, Fig. 7(b) shows the news about cancellation of many events that attract large crowds for preventing the spread of infection of COVID-19. Fig. 7(c) shows the concern and warning to misinformation about COVID-19. In Fig. 7(d), the news and concern about the first COVID-19 infection in Nishinomiya City in Hyogo Prefecture appear.
Around April 1, the shortage of masks was a serious concern in Japan. To deal with this situation, Prime Minister Shinzo Abe declared that the government would issue two masks per household [see Fig. 8(b) and (c)]. Fig. 8(d) shows the news and people’s concern about infection spread all over the country.
On May 1, Fig. 9(b) and (c) shows the news about deaths due to COVID-19. More specifically, in Fig. 9(b), we can observe the report that those in their sixties or older account for about 90% of the total. The curation of COVID-19-related information of various regions, and video messages from celebrities, was confirmed [see Fig. 9(d)]. This tweet community includes tweets about countermeasures, including opinions from experts for COVID-19. This may show people’s wish, after the long period of self-restraint, to avoid the spread of infection of COVID-19.
From these results, we find that people’s attention changed over time, from concern about the infection to the countermeasures and the wish for the ending of the spread of COVID-19. As a consequence of this section, we confirmed that our method is useful for understanding the situation, and acquiring the relevant knowledge, through ranking of tweet communities by importance measures.
V. Conclusion and Future Work
This article presented a method that detects tweet communities with similar topics and ranks the communities by importance measures. By identifying only the communities with high importance measures, it becomes possible for users to easily find important information about COVID-19. Specifically, we construct a community network whose nodes are tweet communities, obtained by applying a community detection method to a tweet network. We then calculate the centrality to the community network as importance measures, to detect the most central tweet communities. We conducted experiments by collecting Japanese tweets about COVID-19 sent between March 1, 2020 and May 15, 2020. The results show that our method can successfully extract keywords, that is, words that are strongly correlated with the number of people infected with COVID-19. Because users can browse the keywords from a small number of central tweet communities, quick and easy understanding of important information became feasible.
We discuss how to use the proposed method for fighting COVID-19. The proposed method will be beneficial for quick and objective decision-making based on public opinions. A small number of tweet communities detected by our method help individuals and organizations find meaningful keywords like Figs. 7–9. This makes it possible to quickly make decisions without taking a long time to manually search a flood of information. It is also notable that such decision-making is based on objective data. If individuals and organizations read a small number of tweets selected subjectively, they may make wrong decisions contrary to public opinions. Our method helps solve this problem. Moreover, our method will accelerate various data mining research for fighting COVID-19, such as opinion mining and sentiment analysis. In such research, tweets that are irrelevant to COVID-19 may increase computational cost and may cause noisy results. Because our method can extract relevant tweets from many tweets, our method helps overcome these drawbacks.
Furthermore, we focus on misinformation that is unique to specific regions and/or time periods. In fact, misinformation that 5G networks are the cause of COVID-19 was observed in specific regions such as Europe, America, and the Middle East. The proposed method will be used to handle such type of misinformation in Twitter as follows.
-
1)
If misinformation is mentioned in many tweets, the proposed method can visualize it as tweet communities. Thus, a user can browse tweet communities, including misinformation.
-
2)
The user selects a tweet community in which they would like to verify whether misinformation is included or not.
-
3)
The proposed method is applied to tweets in different regions and/or time periods. Then, tweet communities that are similar to the user’s selected tweet community are extracted.
-
4)
We display the difference between the extracted tweet communities and the user’s selected one. It will be useful to visualize the difference of most frequent words like Figs. 7, 8, and 9. In reference to the visualized difference, the user judges whether misinformation is included or not.
In the future, we will develop this system and evaluate the effectiveness. Note that this system will be useful for only misinformation that is mentioned in many tweets and are unique to specific regions and/or time periods. Thus, future work includes the detection of other type of misinformation.
The scope of this study is within COVID-19. Thus, the future work includes the application of the proposed importance measures to topics other than COVID-19. In our previous study [40], we confirmed that the community network is beneficial for efficient grouping of similar Web videos for retrieval. Specifically, we formulate grouping of similar Web videos as community detection in a network, whose nodes are Web videos and edges are hyperlinks weighted by video similarities. Then, we construct a community network in which each node includes multiple Web videos. By applying the community detection method [41] to the community network, we can efficiently group similar Web videos, while the accuracy of retrieval can be preserved. In this way, the versatility of the community network is confirmed. In the future, we will evaluate the proposed importance measures as well as the community network for other topics.
We believe that this study is one of the pioneering works on data mining for tackling the difficulty caused by COVID-19. However, we will develop more sophisticated methodologies. Future work includes the improvement of tweet community detection by using multimodal features, such as deep-learning-based text features, sentiment features, and visual features of attached images. After this improvement, we will develop a method for predicting the trend of people’s attention to COVID-19 over time; this is required because the method proposed in this article does not include a time series modeling scheme.
Acknowledgment
The authors would like to thank Edanz Group (https://en-author-services.edanzgroup.com/) for editing a draft of this article.
Biographies

Ryosuke Harakawa (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronics and information engineering from Hokkaido University, Sapporo, Japan, in 2013, 2015, and 2016, respectively.
He is currently an Assistant Professor with the Department of Electrical, Electronics, and Information Engineering, Nagaoka University of Technology, Nagaoka, Japan. His research interests include multimedia information retrieval and Web mining.
Dr. Harakawa is a member of the ACM, the IEICE, and the Institute of Image Information and Television Engineers (ITE).

Masahiro Iwahashi (Senior Member, IEEE) received the B.Eng., M.Eng., and D.Eng. degrees in electrical engineering from Tokyo Metropolitan University, Tokyo, Japan, in 1988, 1990, and 1996, respectively.
In 1990, he joined Nippon Steel Company Ltd. Since 1993, he has been with the Nagaoka University of Technology, Nagaoka, Japan, where he is currently a Professor with the Department of Electrical, Electronics and Information Engineering. His research interests include the areas of digital signal processing, multirate systems, and image compression.
Dr. Iwahashi is a Senior Member of the IEICE and a member of the Asia Pacific Signal and Information Processing Association (APSIPA) and the Institute of Image Information and Television Engineers (ITE).
Funding Statement
This work was supported by the Adaptable and Seamless Technology Transfer Program through Target-Driven Research and Development (A-STEP) from Japan Science and Technology Agency (JST) under Grant JPMJTM20DJ.
Footnotes
The terms community and cluster have the same meaning, in general; however, we use the term community in this article to avoid confusion with the term cluster used in information science and the word cluster that means a group of people infected with COVID-19.
Contributor Information
Ryosuke Harakawa, Email: harakawa@vos.nagaokaut.ac.jp.
Masahiro Iwahashi, Email: iwahashi@vos.nagaokaut.ac.jp.
References
- [1].Kwak H., Lee C., Park H., and Moon S., “What is Twitter, a social network or a news media?” in Proc. ACM Int. Conf. World Wide Web (WWW), 2010, pp. 591–600. [Google Scholar]
- [2].Java A., Song X., Finin T., and Tseng B., “Why we Twitter: Understanding microblogging usage and communities,” in Proc. 9th WebKDD 1st SNA-KDD Workshop Web Mining Social Netw. Anal., 2007, pp. 56–65. [Google Scholar]
- [3].Alnajran N., Crockett K., McLean D., and Latham A., “Cluster analysis of Twitter data: A review of algorithms,” in Proc. 9th Int. Conf. Agents Artif. Intell., 2017, pp. 1–11. [Google Scholar]
- [4].Cinelli M.et al. , “The COVID-19 social media infodemic,” 2020, arXiv:2003.05004. [Online]. Available: http://arxiv.org/abs/2003.05004 [DOI] [PMC free article] [PubMed]
- [5].Kishore Shahi G., Dirkson A., and Majchrzak T. A., “An exploratory study of COVID-19 misinformation on Twitter,” 2020, arXiv:2005.05710. [Online]. Available: http://arxiv.org/abs/2005.05710 [DOI] [PMC free article] [PubMed]
- [6].Medford R. J., Saleh S. N., Sumarsono A., Perl T. M., and Lehmann C. U., “An ‘infodemic’: Leveraging high-volume Twitter data to understand early public sentiment for the coronavirus disease 2019 outbreak,” Medrxiv, vol. 7, Oct. 2020, Art. no. ofaa258, doi: 10.1101/2020.04.03.20052936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Qian S., Zhang T., and Xu C., “Multi-modal multi-view topic-opinion mining for social event analysis,” in Proc. 24th ACM Int. Conf. Multimedia, Oct. 2016, pp. 2–11. [Google Scholar]
- [8].Fang Q., Xu C., Sang J., Hossain M. S., and Muhammad G., “Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media,” IEEE Trans. Multimedia, vol. 17, no. 12, pp. 2281–2296, Dec. 2015. [Google Scholar]
- [9].Harakawa R., Takimura S., Ogawa T., Haseyama M., and Iwahashi M., “Consensus clustering of tweet networks via semantic and sentiment similarity estimation,” IEEE Access, vol. 7, pp. 116207–116217, 2019. [Google Scholar]
- [10].Harakawa R., Ogawa T., and Haseyama M., “Extracting hierarchical structure of Web video groups based on sentiment-aware signed network analysis,” IEEE Access, vol. 5, pp. 16963–16973, 2017. [Google Scholar]
- [11].Harakawa R., Ogawa T., and Haseyama M., “Tracking topic evolution via salient keyword matching with consideration of semantic broadness for Web video discovery,” Multimedia Tools Appl., vol. 77, no. 16, pp. 20297–20324, Aug. 2018. [Google Scholar]
- [12].Harakawa R., Ogawa T., and Haseyama M., “A Web video retrieval method using hierarchical structure of Web video groups,” Multimedia Tools Appl., vol. 75, no. 24, pp. 17059–17079, Dec. 2016. [Google Scholar]
- [13].Pitas I., Graph-Based Social Media Analysis. London, U.K.: Chapman & Hall, 2015. [Google Scholar]
- [14].Singh L.et al. , “A first look at COVID-19 information and misinformation sharing on Twitter,” 2020, arXiv:2003.13907. [Online]. Available: http://arxiv.org/abs/2003.13907
- [15].Ferrara E., “What types of COVID-19 conspiracies are populated by Twitter bots?” 1st Monday, vol. 25, no. 6, May 2020, doi: 10.5210/fm.v25i6.10633. [DOI] [Google Scholar]
- [16].Hamzah F. B., Lau C., Nazri H., Ligot D. V., and Lee G., “Coronatracker: World-wide COVID-19 outbreak data analysis and prediction,” Bull. World Health Org., vol. 2, pp. 1–31, Dec. 2020. [Google Scholar]
- [17].Zhong L., Mu L., Li J., Wang J., Yin Z., and Liu D., “Early prediction of the 2019 novel coronavirus outbreak in the mainland China based on simple mathematical model,” IEEE Access, vol. 8, pp. 51761–51769, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Zheng N., Du S., Wang J., Zhang H., Cui W., and Kang Z., “Predicting COVID-19 in China using hybrid ai model,” IEEE Trans. Cybern., vol. 50, no. 7, pp. 2891–2904, May 2020. [DOI] [PubMed] [Google Scholar]
- [19].Sha H., Al Hasan M., Mohler G., and Brantingham P. J., “Dynamic topic modeling of the COVID-19 Twitter narrative among U.S. Governors and cabinet executives,” 2020, arXiv:2004.11692. [Online]. Available: http://arxiv.org/abs/2004.11692
- [20].Ng T. W., Turinici G., and Danchin A., “A double epidemic model for the SARS propagation,” BMC Infectious Diseases, vol. 3, no. 1, p. 19, Dec. 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Hochreiter S. and Schmidhuber J., “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
- [22].Chen E., Lerman K., and Ferrara E., “Tracking social media discourse about the COVID-19 pandemic: Development of a public coronavirus Twitter data set,” JMIR Public Health Surveill., vol. 6, no. 2, May 2020, Art. no. e19273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Alqurashi S., Alhindi A., and Alanazi E., “Large arabic Twitter dataset on COVID-19,” 2020, arXiv:2004.04315. [Online]. Available: http://arxiv.org/abs/2004.04315
- [24].Wicke P. and Bolognesi M. M., “Framing COVID-19: How we conceptualize and discuss the pandemic on Twitter,” 2020, arXiv:2004.06986. [Online]. Available: http://arxiv.org/abs/2004.06986 [DOI] [PMC free article] [PubMed]
- [25].Li L.et al. , “Characterizing the propagation of situational information in social media during COVID-19 epidemic: A case study on Weibo,” IEEE Trans. Comput. Social Syst., vol. 7, no. 2, pp. 556–562, Apr. 2020. [Google Scholar]
- [26].Blei D. M., Ng A. Y., and Jordan M. I., “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [Google Scholar]
- [27].Chamola V., Hassija V., Gupta V., and Guizani M., “A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact,” IEEE Access, vol. 8, pp. 90225–90265, 2020. [Google Scholar]
- [28].Latif S.. (Apr. 2020). Leveraging Data Science to Combat COVID-19: A Comprehensive Review. [Online]. Available: https://www.techrxiv.org/articles/Leveraging_Data_Science_To_Combat_COVID-19_A_Comprehensive_Review/12212516 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Almatarneh S., Gamallo P., and Pena F. J. R., “CiTIUS-COLE at SemEval-2019 task 5: Combining linguistic features to identify hate speech against immigrants and women on multilingual tweets,” in Proc. 13th Int. Workshop Semantic Eval., 2019, pp. 387–390. [Google Scholar]
- [30].Sebastiani F., “Machine learning in automated text categorization,” ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, Mar. 2002. [Google Scholar]
- [31].Le Q. V. and Mikolov T., “Distributed representations of sentences and documents,” 2014, arXiv:1405.4053. [Online]. Available: http://arxiv.org/abs/1405.4053
- [32].Blondel V. D., Guillaume J.-L., Lambiotte R., and Lefebvre E., “Fast unfolding of communities in large networks,” J. Stat. Mech., Theory Exp., vol. 2008, no. 10, Oct. 2008, Art. no. P10008. [Google Scholar]
- [33].Takehara D., Harakawa R., Ogawa T., and Haseyama M., “Extracting hierarchical structure of content groups from different social media platforms using multiple social metadata,” Multimedia Tools Appl., vol. 76, no. 19, pp. 20249–20272, Oct. 2017. [Google Scholar]
- [34].Harakawa R., Ogawa T., and Haseyama M., “Accurate and efficient extraction of hierarchical structure of Web communities for Web video retrieval,” ITE Trans. Media Technol. Appl., vol. 4, no. 1, pp. 49–59, 2016. [Google Scholar]
- [35].Arenas A., Duch J., Fernandez A., and Gomez S., “Size reduction of complex networks preserving modularity,” New J. Phys., vol. 9, no. 176, pp. 604–632, 2007. [Google Scholar]
- [36].Kleinberg J. M., “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, Sep. 1999. [Google Scholar]
- [37].Saerens M. and Fouss F., “HITS is principal components analysis,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intell., Sep. 2005, pp. 782–785. [Google Scholar]
- [38].Akoglu H., “User’s guide to correlation coefficients,” Turkish J. Emergency Med., vol. 18, no. 3, pp. 91–93, Sep. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Miller G. A., “The magical number seven, plus or minus two: Some limits on our capacity for processing information,” Psychol. Rev., vol. 63, no. 2, pp. 81–97, 1956. [PubMed] [Google Scholar]
- [40].Harakawa R., Ogawa T., and Haseyama M., “[Paper] an efficient extraction method of hierarchical structure of Web communities for Web video retrieval,” ITE Trans. Media Technol. Appl., vol. 2, no. 3, pp. 287–297, 2014. [Google Scholar]
- [41].Newman M. E. J. and Girvan M., “Finding and evaluating community structure in networks,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 69, no. 2, Feb. 2004, Art. no. 026113. [DOI] [PubMed] [Google Scholar]














