A generalised significance test for individual communities in networks

Sadamori Kojaku; Naoki Masuda

doi:10.1038/s41598-018-25560-z

. 2018 May 9;8:7351. doi: 10.1038/s41598-018-25560-z

A generalised significance test for individual communities in networks

Sadamori Kojaku ^1,², Naoki Masuda ^2,^✉

PMCID: PMC5943579 PMID: 29743534

Abstract

Many empirical networks have community structure, in which nodes are densely interconnected within each community (i.e., a group of nodes) and sparsely across different communities. Like other local and meso-scale structure of networks, communities are generally heterogeneous in various aspects such as the size, density of edges, connectivity to other communities and significance. In the present study, we propose a method to statistically test the significance of individual communities in a given network. Compared to the previous methods, the present algorithm is unique in that it accepts different community-detection algorithms and the corresponding quality function for single communities. The present method requires that a quality of each community can be quantified and that community detection is performed as optimisation of such a quality function summed over the communities. Various community detection algorithms including modularity maximisation and graph partitioning meet this criterion. Our method estimates a distribution of the quality function for randomised networks to calculate a likelihood of each community in the given network. We illustrate our algorithm by synthetic and empirical networks.

Introduction

Many biological, physical and social systems can be expressed as networks, with nodes representing individual entities within the network and edges representing pairwise relationships between nodes^1,2. Among various structural properties of networks, many empirical networks have community structure such that a network is composed of communities, which are groups of nodes that are densely interconnected with each other while sparsely interconnected with those in other groups^3,4. A community may correspond to the role of nodes. For example, communities may correspond to functional modules of proteins⁵, groups of airports serving the same geographical region⁶ and herds of people sharing an interest⁷.

Many algorithms have been proposed for finding communities in networks^3,4. These algorithms are often equipped with a quality function with which to judge whether or not the detected community structure is significant overall. A much less asked fundamental question is the significance of individual communities. In fact, a network may be composed of a part where community structure is pronounced and another part where community structure is vague or absent. To discuss community structure in such a “chimera” network, one needs methods to assess statistical significance of single communities.

In the present study, we consider the significance of single communities that have been detected by a non-overlapping community-detection algorithm. An algorithm for testing significance of individual communities was previously proposed⁸. In that algorithm, one uses a quality function for individual communities to compare the quality of a community in question, detected in the given network, and that detected in randomised networks. The distribution of the quality function in randomised networks is analytically known. The authors then used the same significant test in OSLOM, which is an algorithm for finding various types of communities⁹. However, OSLOM does not optimise the same quality function as that used in the aforementioned statistical test or its aggregate over the different communities. The same discrepancy exists in a different significance test for single communities¹⁰. In an extreme case, let us suppose one detects communities by optimising a quality function that is very different from the quality function used in the statistical test. Then, the detected communities may have small values of the quality function used in the statistical test and will be judged to be insignificant. However, in terms of the quality function used in the community detection, the detected communities may be sufficiently strong.

This pitfall may be overcome if one uses the same quality function for the community detection and the statistical test. There exist such significance tests for individual communities^11,12. However, these significance tests^11,12 do not consider the possible dependence of the quality function value on the size of community^10,13,14. This practice is problematic for the following reason. Suppose that two communities in the given network have different sizes and bear the same value of the quality function. Then, the significance level (i.e., p-value) in these statistical tests is the same for the two communities. In general, however, the quality function value may be positively correlated with the community size, which is in fact often the case (Methods section). In this case, it is easier for the larger community to attain the observed quality function value than for the smaller community under the null model. Then, the smaller community should be judged to be more significant than the larger community if they yield the same quality function value. An aforementioned statistical test does consider the dependence of the quality function value on the community size¹⁰. However, that method does not use a common quality function between community detection and statistical testing, as discussed already.

Based on these considerations, it will be useful to develop methods to test the significance of individual communities that (i) use a quality function that is consistent with the one used in community detection, and (ii) take into account the dependence of the quality function value on the community size. We will develop a new statistical test for individual communities that meets these criteria. An additional feature of our method is that it allows for general quality functions. Python code for the present significance test is available at https://github.com/skojaku/qstest/.

Methods

Correlation between quality and community size

We consider unweighted networks composed of N nodes. Denote their N × N adjacency matrix by A = (A_ij), where A_ij = 1 if nodes i and j are adjacent and A_ij = 0 otherwise. We assume that the network is undirected (i.e., A_ij = A_ji for all i ≠ j) and does not contain self-loops (i.e., A_ii = 0). Let M be the number of edges in the network. We denote by $d_{i} \equiv \sum_{j = 1}^{N} A_{i j}$ the degree of node i.

One may regard a community as significant if its quality value is significantly larger than that expected for randomised networks. This intuitive approach has a problem. To see this, let us consider a benchmark network generated by the Lancichinetti-Fortunato-Radicchi (LFR) model¹⁵ (Fig. 1(a)). The network has N = 10³ nodes and consists of C non-overlapping communities. Each node i belongs to one of the C = 31 communities. To generate the network, we set the average node’s degree to 10, the maximum node’s degree to 100, the range of the number of nodes in a community c (denoted by n_c) to [10,100] and the power-law exponent for the distributions of d_i and n_c to 2. Let us consider a quality function $q_{c}^{\mod}$ given by^13,14

q_{c}^{\mod} \equiv \frac{1}{2 M} \sum_{\begin{matrix} 1 \leq i, j \leq N \\ i, j \in community c \end{matrix}} (A_{i j} - \frac{d_{i} d_{j}}{2 M}) .

(a) A network with 31 non-overlapping communities generated by the LFR model. The circles represent nodes. The lines between the nodes represent edges. The colour of each node indicates the planted community to which the node belongs. (b–e) Quality of a community (i.e., $q_{c}^{\mod}$ , $q_{c}^{int}$ , $q_{c}^{\exp}$ and $q_{c}^{cnd}$ ) plotted against its number of nodes, n_c. The circles indicate the planted communities shown in panel (a). The crosses indicate the communities detected in 500 randomised networks generated by the configuration model. To find communities in the randomised networks, we use the Louvain algorithm²⁶ for $q_{c}^{\mod}$ (panel (b)) and a variant of the Kernighan–Lin algorithm²⁷ for $q_{c}^{int}$ , $q_{c}^{\exp}$ and $q_{c}^{cnd}$ (panels (c–e)).

Note that the modularity is the sum of $q_{c}^{\mod}$ over the communities⁷. We find a strong positive correlation between $q_{c}^{\mod}$ and n_c (circles in Fig. 1(b)). This is also true for communities in randomised networks that are generated by the configuration model, i.e., random networks that preserve the expected degree of each node (crosses in Fig. 1(b)). Crucially, large communities detected in the randomised networks have larger $q_{c}^{\mod}$ values than small communities in the original network do. Therefore, we can not judge the significance of communities solely by the value of $q_{c}^{\mod}$ . The results are qualitatively the same for other quality functions for individual communities introduced in the following section (Fig. 1(c,d and e)).

Our statistical test

On the basis of the observations made in the previous section, we construct a statistical test for individual communities as follows. Note that we do not specify the quality function q_c, which may be $q_{c}^{\mod}$ or a different one. Moreover, we do not specify how one measures the size s_c of community c. We refer to the present statistical test based on a quality function q and community size s as the (q, s)–test.

Suppose that we have a community c with quality q_c and size s_c. We judge community c to be significant if its q_c value is larger than those for communities of the same size s_c detected in randomised networks. We compute $P (\tilde{q} \geq q_{c} | s_{c})$ , which is the probability that a community of size s_c detected in randomised networks generated by the configuration model has a quality value $\tilde{q}$ larger than q_c. We numerically estimate $P (\tilde{q} \geq q_{c} | s_{c})$ as follows. First, we generate 500 randomised networks using the configuration model. Then, we detect communities in each randomised network by the algorithm that has been used to detect communities in the original network. Let $\bar{C}$ be the sum of the number of communities detected in the 500 randomised networks. For each community $\bar{c}$ $(1 \leq \bar{c} \leq \bar{C})$ in the randomised networks, we compute the quality ${\tilde{q}}_{\bar{c}}$ and size ${\tilde{s}}_{\bar{c}}$ . Then, we compute the average values, i.e., $〈 \tilde{q} 〉 \equiv \sum_{\bar{c} = 1}^{\bar{C}} {\tilde{q}}_{\bar{c}} / \bar{C}$ and $〈 \tilde{s} 〉 \equiv \sum_{\bar{c} = 1}^{\bar{C}} {\tilde{s}}_{\bar{c}} / \bar{C}$ , and the unbiased estimation of the standard deviation, i.e., $σ_{\tilde{q}} \equiv \sqrt{\sum_{\bar{c} = 1}^{\bar{C}} {({\tilde{q}}_{\bar{c}} - 〈 \tilde{q} 〉)}^{2} / (\bar{C} - 1)}$ and $σ_{\tilde{s}} \equiv \sqrt{\sum_{\bar{c} = 1}^{\bar{C}} {({\tilde{s}}_{\bar{c}} - 〈 \tilde{s} 〉)}^{2} / (\bar{C} - 1)}$ . We estimate the joint probability distribution $P (\tilde{q}, \tilde{s})$ using the kernel density estimator¹⁶ as follows:

P (\tilde{q}, \tilde{s}) = \sum_{\bar{c} = 1}^{\bar{C}} f (\frac{\tilde{q} - {\tilde{q}}_{\bar{c}}}{h σ_{\tilde{q}}}, \frac{\tilde{s} - {\tilde{s}}_{\bar{c}}}{h σ_{\tilde{s}}}) / \bar{C},

where h is the width of the kernel. The function f (·, ·) is the bivariate Gaussian kernel (i.e., bivariate standard normal distribution) given by

f (x_{1}, x_{2}) \equiv \frac{1}{2 π \sqrt{1 - γ^{2}}} \exp (- \frac{x_{1}^{2} - 2 γ x_{1} x_{2} + x_{2}^{2}}{2 (1 - γ^{2})}),

where

γ \equiv \frac{\sum_{\bar{c} = 1}^{\bar{C}} ({\tilde{q}}_{\bar{c}} - 〈 \tilde{q} 〉) ({\tilde{s}}_{\bar{c}} - 〈 \tilde{s} 〉)}{\sqrt{\sum_{\bar{c} = 1}^{\bar{C}} {({\tilde{q}}_{\bar{c}} - 〈 \tilde{q} 〉)}^{2}} \sqrt{\sum_{\bar{c} = 1}^{\bar{C}} {({\tilde{s}}_{\bar{c}} - 〈 \tilde{s} 〉)}^{2}}},

is the Pearson correlation coefficient between ${{\tilde{q}}_{\bar{c}}}_{\bar{c} = 1}^{\bar{C}}$ and ${{\tilde{s}}_{\bar{c}}}_{\bar{c} = 1}^{\bar{C}}$ . The probability distribution estimated by the Gaussian kernels is close to any form of the true probability distribution as the number of samples increases¹⁷. Although there are also non-Gaussian kernels that share this property¹⁷, we used the Gaussian kernels, which is a state-of-the-art method. The width h is a free parameter that affects the speed of the convergence to the true probability distribution. Optimising the value of h requires assumptions for the true probability distributions and intensive computations^18,19. Therefore, we set $h = {\bar{C}}^{(- 1/6)}$ according to Scott’s rule-of-thumb²⁰, which often provides a reasonable estimate in practice^18–20.

The conditional probability, $P (\tilde{q} > q_{c} | s_{c})$ , is given by

P (\tilde{q} \geq q_{c} | s_{c}) = \frac{\int_{q_{c}}^{\infty} P (\tilde{q}, s_{c}) d \tilde{q}}{\int_{- \infty}^{\infty} P (\tilde{q}, s_{c}) d \tilde{q}} = \frac{\sum_{\bar{c} = 1}^{\bar{C}} \int_{q_{c}}^{\infty} f (\frac{\tilde{q} - {\tilde{q}}_{\bar{c}}}{σ_{\tilde{q}} h}, \frac{s_{c} - {\tilde{s}}_{\bar{c}}}{σ_{\tilde{s}} h}) d \tilde{q}}{\sum_{\bar{c} = 1}^{\bar{C}} \int_{- \infty}^{\infty} f (\frac{\tilde{q} - {\tilde{q}}_{\bar{c}}}{σ_{\tilde{q}} h}, \frac{s_{c} - {\tilde{s}}_{\bar{c}}}{σ_{\tilde{s}} h}) d \tilde{q}} .

The integration of f (x₁, x₂) over x₁ yields

\int_{y}^{\infty} f (x_{1}, x_{2}) d x_{1} = \frac{1}{\sqrt{2 π}} \exp (- \frac{x_{2}^{2}}{2}) [1 - Φ (\frac{y - γ x_{2}}{\sqrt{1 - γ^{2}}})],

where Φ (·) is the cumulative distribution function of the standard normal distribution. By substituting Eq. (6) into Eq. (5), we have

P (\tilde{q} \geq q_{c} | s_{c}) = 1 - \frac{\sum_{\bar{c} = 1}^{\bar{C}} \exp [- {(\frac{s_{c} - {\tilde{s}}_{\bar{c}}}{\sqrt{2} h σ_{\tilde{s}}})}^{2}] Φ (\frac{1}{\sqrt{1 - γ^{2}}} (\frac{q_{c} - {\tilde{q}}_{\bar{c}}}{h σ_{\tilde{q}}} - γ \frac{s_{c} - {\tilde{s}}_{\bar{c}}}{h σ_{\tilde{s}}}))}{\sum_{\bar{c} = 1}^{\bar{C}} \exp [- {(\frac{s_{c} - {\tilde{s}}_{\bar{c}}}{\sqrt{2} h σ_{\tilde{s}}})}^{2}]} .

Finally, we regard community c as significant if $P (\tilde{q} \geq q_{c} | s_{c}) \leq α$ , where α ∈ [0, 1] is the significance level. The conditional probability $P (\tilde{q} \geq q_{c} | s_{c})$ obeys a uniform probability distribution over [0, 1] for a community detected in a randomised network (see Supplementary Information 1). One can estimate more accurate p-values (i.e. $P (\tilde{q} \geq q_{c} | s_{c})$ ) using a larger number of randomised networks, which, however, requires an additional computational time. We opt to use 500 randomised networks to obtain sufficiently accurate p-values in a reasonable time. In fact, the p-value does not change much if one increases the number of randomised networks beyond 500 or if one uses networks with different numbers of nodes and communities (Supplementary Information 2).

As the number of communities, C, increases, some insignificant communities would be significant owing to the multiple comparison problem. To avoid this, we use the Šidák correction²¹, i.e., α = 1 − (1 − α′)^1/C, where α′ ∈ [0, 1] is the targeted significance level. We set α′ = 0.05.

Time complexity

The time complexity of the proposed statistical test is evaluated as follows. Generating one randomised network from the configuration model consumes $O (N + M)$ time using an efficient algorithm²², which is implemented in some network analysis software^23,24. For each generated randomised network, we detect communities. Any community-detection algorithm qualified for the present statistical test computes the quality and size of the individual communities and maximises the quality function for the entire network. We use the quality and size of the optimised communities in the statistical test. We carry out these procedures for each of the R randomised networks, consuming $O ((N + M + Z) R)$ time in total, where Z is the time complexity of the community-detection algorithm. We compute the p-value for each of the C communities in the original network using Eq. (7) with RC^conf samples on average, ${{\tilde{q}}_{c}}_{c = 1}^{R C^{conf}}$ and ${{\tilde{s}}_{c}}_{c = 1}^{R C^{conf}}$ , where C^conf is the average number of communities detected in a randomised network. This incurs a time complexity of $O (C \times R C^{conf})$ . In total, the proposed statistical test requires $O ((N + M + Z + C C^{conf}) R)$ time.

The time complexity can be mitigated using parallel computing. In other words, one runs multiple threads, each of which generates independent samples of $({\tilde{q}}_{c}, {\tilde{s}}_{c})$ . Once the sampling is completed in all the threads, one computes the p-value using Eq. (7). We used 16 threads on a computer with the Intel 2.6 GHz Sandy Bridge processors and 4GB of memory. For the largest network we analysed (i.e., Internet²⁵; N = 34,761 nodes), our statistical test needed 403 seconds using the Louvain community-detection algorithm, which has a time complexity of $O (M)$ ²⁶. With the Kernighan-Lin community-detection algorithm having a time complexity of $O (N^{2})$ ²⁷, it took 17,763 seconds (i.e. approximately 5 hours).

Community detection with different quality functions

Among various quality functions for individual communities apart from $q_{c}^{\mod}$ ^4,13,14, we consider the following three quality functions. The internal average degree¹⁴ (i.e., normalised number of intra-community edges), denoted by $q_{c}^{int}$ , is defined by

q_{c}^{int} \equiv \frac{1}{n_{c}} \sum_{\begin{matrix} 1 \leq i, j \leq N \\ i, j \in community c \end{matrix}} A_{i j} .

The maximisation of $q_{c}^{int}$ yields a community having dense intra-community connectivity. The expansion¹⁴, denoted by $q_{c}^{\exp}$ , is defined by

q_{c}^{\exp} \equiv - \frac{1}{n_{c}} \sum_{\begin{matrix} 1 \leq i, j \leq N \\ i \in community c \\ j \notin community c \end{matrix}} A_{i j} .

The maximisation of $q_{c}^{\exp}$ yields a community having sparse inter-community connectivity. Finally, the conductance¹⁴, denoted by $q_{c}^{cnd}$ , is defined by

q_{c}^{cnd} \equiv - \frac{1}{{vol}_{c}} \sum_{\begin{matrix} 1 \leq i, j \leq N \\ i \in community c \\ j \notin community c \end{matrix}} A_{i j},

where vol_c is the sum of degrees of nodes (i.e., volume) in a community c. Similar to the case of $q_{c}^{\exp}$ , the maximisation of $q_{c}^{cnd}$ yields a community having sparse inter-community connectivity. One can also interpret the maximisation of $q_{c}^{cnd}$ as the maximisation of the number of intra-community edges²⁸.

For $q_{c}^{\mod}$ , we adopt the Louvain algorithm to maximise the modularity (i.e., sum of $q_{c}^{\mod}$ over the communities, $\sum_{c = 1}^{C} q_{c}^{\mod}$ ) to find communities in the original and randomised networks. However, the Louvain algorithm is not available to $Q = \sum_{c = 1}^{C} q_{c}$ , where $q_{c} = q_{c}^{int}$ , $q_{c}^{\exp}$ or $q_{c}^{cnd}$ . Therefore, we adopt a variant of the Kernighan–Lin algorithm²⁹ used in a previous study²⁷. The algorithm seeks partitioning of the network into communities that maximises Q. Suppose that each node i has a tentative label $ℓ_{i}$ $(1 \leq ℓ_{i} \leq C)$ indicating the index of the community to which node i belongs. First, we assign each node to one of the C communities selected uniformly at random. Second, for each node i, we tentatively relabel it to a different label and measure the increment in Q. Third, we select the node i and its new label c that maximise the increment in Q among all nodes i (1 ≤ i ≤ N) and all possible new labels. Regardless of whether Q increases or not, we accept the proposed relabelling of node i (i.e., set $ℓ_{i} = c$ ). Fourth, we determine the pair of another node j (j ≠ i) and its tentative new label c′, which maximises the increment in Q, and change the label of j to c′ (i.e., $ℓ_{j} = c'$ ). In this manner, we relabel nodes one by one. Here we do not relabel the nodes that have already been relabelled. After sequentially relabelling the N nodes, we select the labelling that yields the largest value of Q among the N + 1 labellings that have appeared in the course of relabelling the N nodes. If the initial labelling (before relabelling any node) yields the largest value of Q, we terminate the algorithm. Otherwise, we use the labelling that has yielded the largest Q value among the N + 1 labellings as the initial labelling in the next round of updating the labels. We repeat the aforementioned procedure to sequentially relabel N nodes and select the best labelling. We repeat rounds of updating until the initial labelling is the best labelling in the round in terms of the Q value.

To find communities in networks using $q_{c}^{int}$ , $q_{c}^{\exp}$ or $q_{c}^{cnd}$ , we need to specify the number of communities, C. Otherwise, the maximisation of the quality functions may yield trivial communities. For example, $q_{c}^{\exp}$ is always the largest when each connected component constitutes a community because there is no inter-community edge. In the analysis of synthetic networks, we set C to the number of planted communities. For empirical networks, we set C to the number of communities identified by the Louvain algorithm.

Other statistical tests

We compare the (q, s)–test with two statistical tests, i.e., the test proposed by Spirin and Mirny¹⁰ and the test proposed by Lancichinetti, Radicchi and Ramasco⁸, which we refer to as the S–test and L–test, respectively. As is the case with the (q, s)–test, both S–test and L–test adopt the configuration model as the null model. For both statistical tests, we set the significance level for a single community to α = 1 − (1 − α′)^1/C, where α′ = 0.05.

The S–test regards a community as significant if it has more intra-community edges than a community composed of the same number of nodes detected in randomised networks does. Their original algorithm¹⁰ is slow for large networks. Therefore, we adopt the Kernighan–Lin algorithm²⁹ to optimise the quality function for a community adopted in the S–test. Up to our numerical efforts, our implementation is faster and also finds better community structure than their original algorithm does in terms of their quality function.

The L–test regards a community as significant if every node in the community has more neighbours within the community than that expected for the configuration model. In the original paper⁸, the authors defined two significance measures, i.e., $C$ -score and $ℬ$ -score. We adopt the $ℬ$ –score, which is less conservative than the $C$ –score. In the original article⁸, the $ℬ$ –score is claimed to be more trustworthy than the $C$ –score because the $C$ –score but not the B-score relies on an extreme value statistics.

Data

We apply the statistical test to the 12 empirical networks listed in Table 1. We ignore the directions and weights of edges in the empirical networks.

Table 1.

Properties of 12 empirical networks.

Network	N	M	C	n _c		vol_c
Network	N	M	C	Min	Max	Min	Max
Karate³⁰	34	78	3	5	17	16	78
Dolphin³¹	62	159	4	7	22	37	123
Les Misérables³²	77	254	10	2	16	3	147
Email³³	151	1527	6	16	50	258	1081
Jazz³⁴	198	2742	6	3	63	9	2029
Network science⁷	379	914	11	6	65	27	290
Blog³⁵	1222	16,714	2	565	657	15,755	17,673
Airport^36,37	2939	15,677	20	2	712	2	12,638
Protein^38,39	3023	6149	161	2	312	2	1832
Chess²⁵	7115	55,779	409	2	812	3	23,034
Astro-ph (co-authorship)⁴⁰	18,771	198,050	116	2	3547	2	98,628
Internet²⁵	34,761	107,720	65	4	13,710	7	106,881

Open in a new tab

Column C indicates the number of communities detected by the Louvain algorithm. Columns n_c and vol_c indicate the number of nodes in a community and the sum of degrees of nodes in a community, respectively.

The karate club network represents the relationships among the members of a university’s karate club³⁰. Each node represents a member of the karate club. Two members are defined to be adjacent if they are friends outside of the club activities.

The dolphin social network represents the relationships of the dolphins living near Doubtful Sound in New Zealand³¹. Each node represents a dolphin. Two dolphins are defined to be adjacent if they are frequently observed in the same school.

The network of Les Misérables represents the relationships between the characters of a novel, Les Misèrables³². Each node represents a character of the book. Each edge indicates that they appear in the same chapter of the book.

The Enron email network represents the email interactions among the staff of Enron Inc³³. Each node represents an email account. Each edge indicates that an email is sent from one account to the other account.

The jazz network represents the collaborations among jazz musicians³⁴. Each node represents a jazz musician. Each edge indicates that two musicians belong to the same band.

The network of network scientists represents the collaborations between researchers in network science⁷. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a co-authored paper cited by one of two popular review papers on network science. Then, some nodes and edges were added manually by the author of the article⁷. We only consider the largest connected component of the network.

The political blog network is the network of blogs on the United States presidential election in 2004³⁵. Each node represents a blog. Two blogs are defined to be adjacent if there is at least one hyperlink between the two blogs on their front page.

The airport network consists of nodes representing airports in the world^36,37. Two airports are defined to be adjacent if there is a direct commercial flight between the two airports.

The protein network represents the physical interactions among human proteins^38,39. Each node represents a protein. Two proteins are defined to be adjacent if they physically interact.

The Chess network represents the chess matches between players²⁵. Each node represents a chess player. Each edge indicates that they have played at least once.

The Astro-ph network represents the collaborations among the researchers who published a joint paper in the arXiv’s astro-ph section⁴⁰. Each node represents a researcher. Two researchers are defined to be adjacent if they have published a joint paper.

The Internet network represents the network of autonomous systems²⁵. A node represents an autonomous system, which is a group of routers maintained by a network operator. Two autonomous systems are defined to be adjacent if they have a logical peering relation.

Results

We measure the size of a community in two ways: the number of nodes in a community c, n_c, and the sum of degrees of nodes in a community c, vol_c. In the next two subsections, we consider the $(q_{c}^{\mod}, n_{c})$ –test and the $(q_{c}^{\mod}, {vol}_{c})$ –test. We show the results for other quality functions in the third subsection.

Synthetic networks

In this section, we examine synthetic networks with planted communities. We generate networks using the LFR model¹⁵, which places edges such that the node’s degree, (i.e., d_i), and the number of nodes in a community c, (i.e., n_c), follow power-law distributions. We set the power-law exponent for the distributions of d_i and n_c to 2, the average node’s degree to 10, the maximum degree to 100 and the range of n_c to [20,200]. The networks are composed of N = 10³ nodes. Each node i has an average fraction 1 − μ of neighbours belonging to the same community, where μ ∈ {0, 0.025, 0.05, …, 1} is a mixing parameter controlling the “strength” of community structure. With μ = 0, all edges are placed within communities, and the community structure is the strongest. With μ = 1, all edges are between different communities. We set the extent of overlaps between different communities to zero.

We generate 30 networks using the LFR model at each μ value. For each generated network, we classify the planted communities into significant and insignificant communities by each statistical test. Then, we compute the true positive rate (i.e., the fraction of significant communities in the network). Finally, we average the true positive rate over the 30 generated networks.

Figure 2 shows the true positive rate as a function of μ. The true positive rate for the S–test is smallest for the entire range of μ, indicating that the S–test is the most conservative. The S–test does not regard all the planted communities as significant even at μ = 0 for the following reason. In the S–test, one detects the strongest community in each randomised network, where the strength of a community is measured by the number of intra-community edges. Then, a focal community in the original network is regarded as significant if it is stronger than the majority of the strongest communities detected in the randomised networks. The strongest communities in the randomised networks often contain almost the largest possible number of intra-community edges, whereas the planted communities do not always even at μ = 0. Therefore, the S–test concludes that some planted communities are insignificant. The true positive rate for the L–test is 1 when μ = 0 and ranges between 0.55 and 0.95 for 0 < μ ≤ 0.5. The true positive rate for the $(q_{c}^{\mod}, n_{c})$ –test and that for the $(q_{c}^{\mod}, {vol}_{c})$ –test are comparable and close to 1 for 0 ≤ μ ≤ 0.3. In contrast, there is a visible difference between the results for the $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ – tests for 0.3 < μ ≤ 0.5. This result suggests that the definition of the size of a community may affect the significance of weak communities but not of strong communities.

True positive rate for the statistical tests applied to the networks generated by the LFR model. Legends S, L, $(q_{c}^{\mod}, n_{c})$ and $(q_{c}^{\mod}, {vol}_{c})$ indicate the S–test, the L–test, the $(q_{c}^{\mod}, n_{c})$ –test and the $(q_{c}^{\mod}, {vol}_{c})$ –test, respectively. The error bars indicate the ±1 standard deviation.

Empirical networks

We apply the statistical tests to the 12 empirical networks listed in Table 1 (see the Data section for details). In this section, we detect communities by modularity maximisation using the Louvain algorithm²⁶. Then, we apply the statistical tests to each detected community.

The fraction of significant communities for each statistical test is shown in Table 2. The $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ –tests identify more significant communities than the S–test and the L–test do in a majority of the 12 empirical networks. This result indicates that the $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ –tests are more generous than the S–test and L– test, which is consistent with the results for the LFR model. This is probably because the $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ – tests use $q_{c}^{\mod}$ to evaluate the quality of individual communities, which is consistent with the objective function of modularity maximisation, $\sum_{c = 1}^{C} q_{c}^{\mod}$ .

Table 2.

Fraction of significant communities identified by the S–test, the L–test, the (q^mod, s)–test, the (q^int, s)–test, the (q^exp, s)–test and the (q^cnd, s)–test in the 12 empirical networks.

Network	S	L	$q_{c}^{\mod}$		$q_{c}^{int}$		$q_{c}^{\exp}$		$q_{c}^{cnt}$
Network	S	L	n _c	vol_c	n _c	vol_c	n _c	vol_c	n _c	vol_c
Karate	1.00	0.33	0.67	1.00	0.00	0.00	0.00	0.00	0.00	0.33
Dolphin	1.00	0.50	1.00	0.75	0.00	0.00	0.00	0.00	0.50	0.50
Les Misérables	0.40	0.40	0.40	0.60	0.20	0.40	0.00	0.00	0.50	0.40
Enron	1.00	0.00	1.00	1.00	0.33	0.67	0.00	0.00	1.00	1.00
Jazz	0.67	0.67	0.67	1.00	0.67	0.83	0.00	0.00	1.00	1.00
Netscience	1.00	0.64	1.00	1.00	0.91	0.82	0.09	0.09	0.91	1.00
Blog	0.00	1.00	1.00	1.00	0.50	0.50	0.00	0.00	1.00	1.00
Airport	0.00	0.60	0.70	0.80	0.15	0.55	0.00	0.00	0.40	0.20
Protein	0.00	0.35	0.14	0.22	0.03	0.12	0.01	0.01	0.00	0.00
Chess	0.00	0.25	0.13	0.15	0.36	0.58	0.00	0.00	0.01	0.03
Astro-ph	—	0.61	0.24	0.53	1.00	1.00	0.00	0.00	0.33	0.12
Internet	—	0.55	0.65	0.60	0.00	0.18	0.00	0.00	0.00	0.02

Open in a new tab

The hyphen indicates that the test did not terminate within 64 days on our computer (Intel 2.6 GHz Sandy Bridge processors and 4GB of memory).

To quantify the agreement between the $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ – tests, we compute the level of agreement defined by τ = (C₁₁ + C₀₀)/C, where C₀₀ is the number of communities classified as insignificant by both statistical tests and C₁₁ is the number of communities classified as significant by both tests. Note that 0 ≤ τ ≤ 1, τ = 1 if the two tests regard the same set of communities as significant, and τ = 0 if the two tests completely disagree. We compute τ between each pair of statistical tests for each empirical network and then average τ over the 12 empirical networks. The averaged τ values are shown in Table 3. We find τ = 0.42 between the S–test and the L–test, indicating that the two statistical tests disagree for a majority of communities. The L–test weakly agrees with the $(q_{c}^{\mod}, {vol}_{c})$ –test (i.e., τ = 0.58) but disagrees with the other tests for a majority of communities (i.e., τ < 0.5). The τ between the $(q_{c}^{\mod}, n_{c})$ – and the $(q_{c}^{\mod}, {vol}_{c})$ –tests is large (τ = 0.84), suggesting that the significance of a majority of communities is not strongly affected by the definition of the community size.

Table 3.

Agreement between pairs of statistical tests.

Test	S	L	$(q_{c}^{\mod}, n_{c})$	$(q_{c}^{\mod}, vo l_{c})$
S	1.00	0.42	0.73	0.66
L	0.42	1.00	0.49	0.58
$(q_{c}^{\mod}, n_{c})$	0.73	0.49	1.00	0.84
$(q_{c}^{\mod}, {vol}_{c})$	0.66	0.58	0.84	1.00

Open in a new tab

Other quality functions

In this section, we examine the $(q_{c}^{int}, s_{c})$ –, the $(q_{c}^{\exp}, s_{c})$ –and the $(q_{c}^{cnd}, s_{c})$ –tests, where s_c is either n_c or vol_c. For the synthetic networks, the true positive rate for the $(q_{c}^{int}, n_{c})$ –and the $(q_{c}^{int}, {vol}_{c})$ –tests is small in the entire range of μ (Fig. 3). As is the case for the S–test, quality function $q_{c}^{int}$ uses the number of intra-community edges. Some planted communities are regarded as insignificant because randomised networks often contain a community having almost the largest possible number of intra-community edges (Fig. 1(c)). The quality function $q_{c}^{\exp}$ is the largest when the community c is disconnected from the other nodes. Randomised networks often contain many disconnected components, yielding a large value of $q_{c}^{\exp}$ (Fig. 1(d)). Therefore, the true positive rate for the $(q_{c}^{\exp}, n_{c})$ – and the $(q_{c}^{\exp}, {vol}_{c})$ –tests is also close to zero in the entire range of μ. In contrast to $(q_{c}^{int}, s_{c})$ – and $(q_{c}^{\exp}, s_{c})$ –tests, the $(q_{c}^{cnd}, n_{c})$ – and $(q_{c}^{cnd}, {vol}_{c})$ –tests yield the true positive rate close to one when μ ≤ 0.3. These results suggest that the results considerably depend on the quality function. For all the (q, s)–tests, the definition of community size (i.e., n_c or vol_c) does not strongly influence the true positive rate.

True positive rate as a function of mixing parameter, μ, for the six (q, s)–tests.

For the empirical networks, we first detect communities by maximising q, where q is either $q_{c}^{int}$ , $q_{c}^{\exp}$ or $q_{c}^{cnd}$ , using the variant of the Kernighan–Lin algorithm (see the Other statistical test sections). Then, we apply the (q,s)–test to each detected community. The results for the $(q_{c}^{int}, s_{c})$ –, the $(q_{c}^{\exp}, s_{c})$ – and the $(q_{c}^{cnd}, s_{c})$ –tests applied to the 12 empirical networks are shown in Table 2. For all the networks, the $(q_{c}^{cnd}, s_{c})$ –test regards more communities as significant than the $(q_{c}^{\exp}, s_{c})$ – and the $(q_{c}^{cnd}, s_{c})$ –tests, where s_c is either n_c or vol_c. This result is consistent with those obtained for the synthetic networks (Fig. 3). For each quality function q, the level of agreement (i.e., τ) between the different definitions of the community size (i.e., n_c or vol_c) is shown in Table 4. For most empirical networks, the agreement τ is larger than 0.8, indicating that the results of the statistical test do not strongly depend on the definition of community size in most cases.

Table 4.

Agreement between the (q_c, n_c)–test and the (q_c, vol_c)–test.

Network	q _c
Network	$q_{c}^{int}$	$q_{c}^{\exp}$	$q_{c}^{cnd}$
Karate	1.00	1.00	0.67
Dolphin	1.00	1.00	1.00
Les Misérables	0.60	1.00	0.90
Enron	0.67	1.00	1.00
Jazz	0.50	1.00	1.00
Netscience	0.73	1.00	0.91
Blog	1.00	1.00	1.00
Airport	0.60	1.00	0.60
Protein	0.90	0.99	1.00
Chess	0.77	1.00	0.98
Astro-ph	1.00	1.00	0.76
Internet	0.82	1.00	0.98

Open in a new tab

Discussion

We proposed a non-parametric statistical test, called the (q, s)–test, for the significance of individual communities, which accounts for the correlation between the quality and the size of single communities. We demonstrated our test with several quality functions q including the one defined as the contribution of a single community to the modularity. In fact, the (q, s)–test accepts different quality functions for individual communities such as those described in the previous literature^{13,14,41–43}. In addition, the (q, s)–test does not demand how communities should be detected in a given network. We note that q that is consistent with the objective function for community detection should be used because the former is maximised in the (q, s)–test and the latter is maximised in community detection.

We have used two definitions of the size of a community, i.e., the number of nodes in a community (i.e., n_c), and the sum of degrees of nodes in a community (i.e., vol_c). For degree-homogeneous networks, the choice does not matter because n_c ∝ vol_c. However, for degree-heterogeneous networks, significant communities may considerably depend on whether we use n_c or vol_c. If q explicitly uses its own measure of the size of a community, we should probably adopt the corresponding definition of the community size in the (q, s)–test. If a measure of community size is not explicit, we suggest that one selects a measure of community size that is more strongly correlated with q than others. If q is correlated with multiple quantities (e.g. both n_c and vol_c) that are not perfectly correlated with each other, one can extend the (q, s)–test by adopting multivariate Gaussian kernels with three or more variables instead of bivariate Gaussian kernels. A downside of this approach is that we would need more data to reliably estimate the distribution of (q, s), where s is at least two-dimensional.

We can adopt the (q, s)–test to assess the significance of other structures of networks, such as bipartite communities⁴⁴ and core-periphery structure^45–47, provided that the quality function for the individual structure (e.g., a single bipartite community) is explicitly defined. In fact, we applied a variant of the (q, s)–test to core-periphery structure in our previous study⁴⁷.

Robustness of community structure against random perturbations (e.g., addition, removal and rewiring of edges) is an alternative measure of the significance of communities^14,48,49. With this approach, if small perturbations do not considerably change communities, then the communities are regarded as significant. Statistical tests based on quality functions including the (q, s)–test and those based on robustness may provide different results⁴⁹. As is the case of quality functions, the robustness of an individual community may be correlated with the size of a community. For example, removal of a small number of intra-community edges may destroy small communities, whereas large communities may survive the removal of more intra-community edges. If this is the case, it may be worthwhile to inform a robustness–based test of individual communities by the dependence of the robustness measure on the size of a community.

Electronic supplementary material

Supplementary Information^{(970KB, pdf)}

Acknowledgements

N.M. acknowledges the support provided through JST, CREST, and JST, ERATO, Kawarabayashi Large Graph Project.

Author Contributions

N.M. conceived and designed the research; S.K. performed the computational experiments; N.M. and S.K. wrote the paper.

Competing Interests

The authors declare no competing interests.

Footnotes

Electronic supplementary material

Supplementary information accompanies this paper at 10.1038/s41598-018-25560-z.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Newman, M. E. J. Networks: An Introduction (Oxford University Press, Oxford, 2010).
2.Barabási, A. L. Network Science (Cambridge University Press, Cambridge, 2016).
3.Fortunato S. Community detection in graphs. Phys. Rep. 2010;486:75–174. doi: 10.1016/j.physrep.2009.11.002. [DOI] [Google Scholar]
4.Fortunato S, Hric D. Community detection in networks: A user guide. Phys. Rep. 2016;659:1–44. doi: 10.1016/j.physrep.2016.09.002. [DOI] [Google Scholar]
5.Jonsson PF, Cavanna T, Zicha D, Bates PA. Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinf. 2006;7:2. doi: 10.1186/1471-2105-7-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Guimerà R, Mossa S, Turtschi A, Amaral LAN. The worldwide air transportation network: anomalous centrality, community structure, and cities’ global roles. Proc. Natl. Acad. Sci. USA. 2005;102:7794–7799. doi: 10.1073/pnas.0407994102. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Newman MEJ. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E. 2006;74:036104. doi: 10.1103/PhysRevE.74.036104. [DOI] [PubMed] [Google Scholar]
8.Lancichinetti A, Radicchi F, Ramasco JJ. Statistical significance of communities in networks. Phys. Rev. E. 2010;81:046110. doi: 10.1103/PhysRevE.81.046110. [DOI] [PubMed] [Google Scholar]
9.Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S. Finding statistically significant communities in networks. PLOS ONE. 2011;6:e18961. doi: 10.1371/journal.pone.0018961. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA. 2003;100:12123–12128. doi: 10.1073/pnas.2032324100. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wang, B. et al. Spatial scan statistics for graph clustering. In Proc. 2008 SIAM Int. Conf. Data Mining, 727–738 (SIAM, Philadelphia, 2008).
12.Zhao Y, Levina E, Zhu J. Community extraction for social networks. Proc. Natl. Acad. Sci. USA. 2011;108:7321–7326. doi: 10.1073/pnas.1006642108. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Leskovec, J., Lang, K. J. & Mahoney, M. W. Empirical comparison of algorithms for network community detection. In Proc. 19th Int. Conf. World Wide Web, 631–640 (ACM, New York, 2010).
14.Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. Know. Inf. Syst. 2015;42:181–213. doi: 10.1007/s10115-013-0693-z. [DOI] [Google Scholar]
15.Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys. Rev. E. 2008;78:046110. doi: 10.1103/PhysRevE.78.046110. [DOI] [PubMed] [Google Scholar]
16.Wand MP, Jones MC. Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Am. Stat. Assoc. 1993;88:520–528. doi: 10.1080/01621459.1993.10476303. [DOI] [Google Scholar]
17.Parzen E. On estimation of a probability density function and mode. Annal. Math. Stat. 1962;33:1065–1076. doi: 10.1214/aoms/1177704472. [DOI] [Google Scholar]
18.Park BU, Marron JS. Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc. 1990;85:66–72. doi: 10.1080/01621459.1990.10475307. [DOI] [Google Scholar]
19.Jones MC, Marron JS, Sheather SJ. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 1996;91:401–407. doi: 10.1080/01621459.1996.10476701. [DOI] [Google Scholar]
20.Scott, D. W. Multivariate density estimation and visualization (Springer, Berlin, 2012).
21.Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 1967;62:626–633. [Google Scholar]
22.Miller, J. C. & Hagberg, A. Efficient generation of networks with given expected degrees. In Frieze, A., Horn, P. & Prałat, P. (eds) Algorithms and Models for the Web Graph, vol. 6732 LNCS, 115–126 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2011).
23.Staudt CL, Sazonovs A, Meyerhenke H. Networkit: A tool suite for large-scale complex network analysis. Network Science. 2016;4:508–530. doi: 10.1017/nws.2016.20. [DOI] [Google Scholar]
24.Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Varoquaux, G., Vaught, T. & Millman, J. (eds) Proc. 7th Python in Sci. Conf., 11–15 (Pasadena, CA USA, 2008).
25.Kunegis, J. Available at, http://konect.uni-koblenz.de [Accessed: 2 Sep 2017].
26.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008:P10008. doi: 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]
27.Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys. Rev. E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]
28.von Luxburg U. A tutorial on spectral clustering. Stat. Comput. 2007;17:395–416. doi: 10.1007/s11222-007-9033-z. [DOI] [Google Scholar]
29.Kernighan BW, Lin S. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 1970;49:291–307. doi: 10.1002/j.1538-7305.1970.tb01770.x. [DOI] [Google Scholar]
30.Zachary WW. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 1977;33:452–473. doi: 10.1086/jar.33.4.3629752. [DOI] [Google Scholar]
31.Lusseau D, et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol. 2003;54:396–405. doi: 10.1007/s00265-003-0651-y. [DOI] [Google Scholar]
32.Knuth, D. E. The Stanford GraphBase: A Platform for Combinatorial Computing (ACM Press, New York, 1993).
33.Klimt, B. & Yang, Y. The Enron corpus: A new dataset for email classification research. In Proc. 15th European Conf. Machine Learning, 217–226 (Springer, Berlin, 2004).
34.Gleiser PM, Danon L. Community structure in jazz. Adv. Comp. Syst. 2003;6:565–573. doi: 10.1142/S0219525903001067. [DOI] [Google Scholar]
35.Adamic, L. A. & Glance, N. The political blogosphere and the 2004 u.s. election: divided they blog. In Proc. 3rd Int. Workshop on Link Discovery, 36–43 (ACM, New York, 2005).
36.J. Patokallio. Available at, http://openflights.org [Accessed: 24 Sep 2016].
37.T. Opsahl. Available at, https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection [Accessed: 24 Sep 2016].
38.Rual J, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437:1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]
39.Ma’ayan, A. Available at, http://research.mssm.edu/maayan/datasets/qualitative_networks.shtml [Accessed: 2 Sep 2017].
40.Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data. 2007;1:2. doi: 10.1145/1217299.1217301. [DOI] [Google Scholar]
41.Chen M, Kuzmin K, Szymanski BK. Community detection via maximization of modularity and its variants. IEEE Trans. Comput. Soc. Syst. 2014;1:46–65. doi: 10.1109/TCSS.2014.2307458. [DOI] [Google Scholar]
42.Lambiotte R, Delvenne JC, Barahona M. Random walks, markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 2014;1:76–90. doi: 10.1109/TNSE.2015.2391998. [DOI] [Google Scholar]
43.Zhang P, Moore C. Scalable detection of statistically significant communities and hierarchies, using message passing for modularity. Proc. Natl. Acad. Sci. USA. 2014;111:18144–18149. doi: 10.1073/pnas.1409770111. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Newman MEJ, Leicht EA. Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA. 2007;104:9564–9569. doi: 10.1073/pnas.0610537104. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Borgatti SP, Everett MG. Models of core/periphery structures. Soc. Netw. 2000;21:375–395. doi: 10.1016/S0378-8733(99)00019-2. [DOI] [Google Scholar]
46.Rombach MP, Porter MA, Fowler JH, Mucha PJ. Core-periphery structure in networks (revisited) SIAM Rev. 2017;59:619–646. doi: 10.1137/17M1130046. [DOI] [Google Scholar]
47.Kojaku, S. & Masuda, N. Core-periphery structure requires something else in the network. New J. Phys.20, 043012 (2018).
48.Gfeller D, Chappelier JC, De Los Rios P. Finding instabilities in the community structure of complex networks. Phys. Rev. E. 2005;72:056135. doi: 10.1103/PhysRevE.72.056135. [DOI] [PubMed] [Google Scholar]
49.Karrer B, Levina E, Newman MEJ. Robustness of community structure in networks. Phys. Rev. E. 2008;77:046119. doi: 10.1103/PhysRevE.77.046119. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(970KB, pdf)}

[CR1] 1.Newman, M. E. J. Networks: An Introduction (Oxford University Press, Oxford, 2010).

[CR2] 2.Barabási, A. L. Network Science (Cambridge University Press, Cambridge, 2016).

[CR3] 3.Fortunato S. Community detection in graphs. Phys. Rep. 2010;486:75–174. doi: 10.1016/j.physrep.2009.11.002. [DOI] [Google Scholar]

[CR4] 4.Fortunato S, Hric D. Community detection in networks: A user guide. Phys. Rep. 2016;659:1–44. doi: 10.1016/j.physrep.2016.09.002. [DOI] [Google Scholar]

[CR5] 5.Jonsson PF, Cavanna T, Zicha D, Bates PA. Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinf. 2006;7:2. doi: 10.1186/1471-2105-7-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Guimerà R, Mossa S, Turtschi A, Amaral LAN. The worldwide air transportation network: anomalous centrality, community structure, and cities’ global roles. Proc. Natl. Acad. Sci. USA. 2005;102:7794–7799. doi: 10.1073/pnas.0407994102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Newman MEJ. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E. 2006;74:036104. doi: 10.1103/PhysRevE.74.036104. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Lancichinetti A, Radicchi F, Ramasco JJ. Statistical significance of communities in networks. Phys. Rev. E. 2010;81:046110. doi: 10.1103/PhysRevE.81.046110. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S. Finding statistically significant communities in networks. PLOS ONE. 2011;6:e18961. doi: 10.1371/journal.pone.0018961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA. 2003;100:12123–12128. doi: 10.1073/pnas.2032324100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Wang, B. et al. Spatial scan statistics for graph clustering. In Proc. 2008 SIAM Int. Conf. Data Mining, 727–738 (SIAM, Philadelphia, 2008).

[CR12] 12.Zhao Y, Levina E, Zhu J. Community extraction for social networks. Proc. Natl. Acad. Sci. USA. 2011;108:7321–7326. doi: 10.1073/pnas.1006642108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Leskovec, J., Lang, K. J. & Mahoney, M. W. Empirical comparison of algorithms for network community detection. In Proc. 19th Int. Conf. World Wide Web, 631–640 (ACM, New York, 2010).

[CR14] 14.Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. Know. Inf. Syst. 2015;42:181–213. doi: 10.1007/s10115-013-0693-z. [DOI] [Google Scholar]

[CR15] 15.Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys. Rev. E. 2008;78:046110. doi: 10.1103/PhysRevE.78.046110. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Wand MP, Jones MC. Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Am. Stat. Assoc. 1993;88:520–528. doi: 10.1080/01621459.1993.10476303. [DOI] [Google Scholar]

[CR17] 17.Parzen E. On estimation of a probability density function and mode. Annal. Math. Stat. 1962;33:1065–1076. doi: 10.1214/aoms/1177704472. [DOI] [Google Scholar]

[CR18] 18.Park BU, Marron JS. Comparison of data-driven bandwidth selectors. J. Am. Stat. Assoc. 1990;85:66–72. doi: 10.1080/01621459.1990.10475307. [DOI] [Google Scholar]

[CR19] 19.Jones MC, Marron JS, Sheather SJ. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 1996;91:401–407. doi: 10.1080/01621459.1996.10476701. [DOI] [Google Scholar]

[CR20] 20.Scott, D. W. Multivariate density estimation and visualization (Springer, Berlin, 2012).

[CR21] 21.Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 1967;62:626–633. [Google Scholar]

[CR22] 22.Miller, J. C. & Hagberg, A. Efficient generation of networks with given expected degrees. In Frieze, A., Horn, P. & Prałat, P. (eds) Algorithms and Models for the Web Graph, vol. 6732 LNCS, 115–126 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2011).

[CR23] 23.Staudt CL, Sazonovs A, Meyerhenke H. Networkit: A tool suite for large-scale complex network analysis. Network Science. 2016;4:508–530. doi: 10.1017/nws.2016.20. [DOI] [Google Scholar]

[CR24] 24.Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using networkx. In Varoquaux, G., Vaught, T. & Millman, J. (eds) Proc. 7th Python in Sci. Conf., 11–15 (Pasadena, CA USA, 2008).

[CR25] 25.Kunegis, J. Available at, http://konect.uni-koblenz.de [Accessed: 2 Sep 2017].

[CR26] 26.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008:P10008. doi: 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]

[CR27] 27.Karrer B, Newman MEJ. Stochastic blockmodels and community structure in networks. Phys. Rev. E. 2011;83:016107. doi: 10.1103/PhysRevE.83.016107. [DOI] [PubMed] [Google Scholar]

[CR28] 28.von Luxburg U. A tutorial on spectral clustering. Stat. Comput. 2007;17:395–416. doi: 10.1007/s11222-007-9033-z. [DOI] [Google Scholar]

[CR29] 29.Kernighan BW, Lin S. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 1970;49:291–307. doi: 10.1002/j.1538-7305.1970.tb01770.x. [DOI] [Google Scholar]

[CR30] 30.Zachary WW. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 1977;33:452–473. doi: 10.1086/jar.33.4.3629752. [DOI] [Google Scholar]

[CR31] 31.Lusseau D, et al. The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav. Ecol. Sociobiol. 2003;54:396–405. doi: 10.1007/s00265-003-0651-y. [DOI] [Google Scholar]

[CR32] 32.Knuth, D. E. The Stanford GraphBase: A Platform for Combinatorial Computing (ACM Press, New York, 1993).

[CR33] 33.Klimt, B. & Yang, Y. The Enron corpus: A new dataset for email classification research. In Proc. 15th European Conf. Machine Learning, 217–226 (Springer, Berlin, 2004).

[CR34] 34.Gleiser PM, Danon L. Community structure in jazz. Adv. Comp. Syst. 2003;6:565–573. doi: 10.1142/S0219525903001067. [DOI] [Google Scholar]

[CR35] 35.Adamic, L. A. & Glance, N. The political blogosphere and the 2004 u.s. election: divided they blog. In Proc. 3rd Int. Workshop on Link Discovery, 36–43 (ACM, New York, 2005).

[CR36] 36.J. Patokallio. Available at, http://openflights.org [Accessed: 24 Sep 2016].

[CR37] 37.T. Opsahl. Available at, https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection [Accessed: 24 Sep 2016].

[CR38] 38.Rual J, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437:1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Ma’ayan, A. Available at, http://research.mssm.edu/maayan/datasets/qualitative_networks.shtml [Accessed: 2 Sep 2017].

[CR40] 40.Leskovec J, Kleinberg J, Faloutsos C. Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data. 2007;1:2. doi: 10.1145/1217299.1217301. [DOI] [Google Scholar]

[CR41] 41.Chen M, Kuzmin K, Szymanski BK. Community detection via maximization of modularity and its variants. IEEE Trans. Comput. Soc. Syst. 2014;1:46–65. doi: 10.1109/TCSS.2014.2307458. [DOI] [Google Scholar]

[CR42] 42.Lambiotte R, Delvenne JC, Barahona M. Random walks, markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 2014;1:76–90. doi: 10.1109/TNSE.2015.2391998. [DOI] [Google Scholar]

[CR43] 43.Zhang P, Moore C. Scalable detection of statistically significant communities and hierarchies, using message passing for modularity. Proc. Natl. Acad. Sci. USA. 2014;111:18144–18149. doi: 10.1073/pnas.1409770111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Newman MEJ, Leicht EA. Mixture models and exploratory analysis in networks. Proc. Natl. Acad. Sci. USA. 2007;104:9564–9569. doi: 10.1073/pnas.0610537104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Borgatti SP, Everett MG. Models of core/periphery structures. Soc. Netw. 2000;21:375–395. doi: 10.1016/S0378-8733(99)00019-2. [DOI] [Google Scholar]

[CR46] 46.Rombach MP, Porter MA, Fowler JH, Mucha PJ. Core-periphery structure in networks (revisited) SIAM Rev. 2017;59:619–646. doi: 10.1137/17M1130046. [DOI] [Google Scholar]

[CR47] 47.Kojaku, S. & Masuda, N. Core-periphery structure requires something else in the network. New J. Phys.20, 043012 (2018).

[CR48] 48.Gfeller D, Chappelier JC, De Los Rios P. Finding instabilities in the community structure of complex networks. Phys. Rev. E. 2005;72:056135. doi: 10.1103/PhysRevE.72.056135. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Karrer B, Levina E, Newman MEJ. Robustness of community structure in networks. Phys. Rev. E. 2008;77:046119. doi: 10.1103/PhysRevE.77.046119. [DOI] [PubMed] [Google Scholar]

PERMALINK

A generalised significance test for individual communities in networks

Sadamori Kojaku

Naoki Masuda

Abstract

Introduction

Methods

Correlation between quality and community size

Figure 1.

Our statistical test

Time complexity

Community detection with different quality functions

Other statistical tests

Data

Table 1.

Results

Synthetic networks

Figure 2.

Empirical networks

Table 2.

Table 3.

Other quality functions

Figure 3.

Table 4.

Discussion

Electronic supplementary material

Acknowledgements

Author Contributions

Competing Interests

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A generalised significance test for individual communities in networks

Sadamori Kojaku

Naoki Masuda

Abstract

Introduction

Methods

Correlation between quality and community size

Figure 1.

Our statistical test

Time complexity

Community detection with different quality functions

Other statistical tests

Data

Table 1.

Results

Synthetic networks

Figure 2.

Empirical networks

Table 2.

Table 3.

Other quality functions

Figure 3.

Table 4.

Discussion

Electronic supplementary material

Acknowledgements

Author Contributions

Competing Interests

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases